Many partitioned scienti c programs can be modeled as iterative execution of computational tasks, represented by iterative task graphs (ITGs). In this paper, we consider the symbolic scheduling of ITGs on distributed memory architectures with nonzero communication overhead without searching the entire iteration space. An ITG may or may not have dependence cycles and we propose heuristic algorithms for mapping cyclic and acyclic ITGs, which incorporate techniques of software pipelining, graph unfolding, directed acyclic graph (DAG) scheduling and load balancing. We provide an analysis for computing near-optimal unfolding factors and comparing the performance of the proposed heuristic algorithms with the optimal solutions. We also study the stability of run-time performance when weights are not estimated accurately at compile-time. Our experiments study the scheduling performance of solving several scienti c computing problems and analyze the e ectiveness of optimization techniques used in our methods. We also present experimental results for mapping SOR-based iterative computation in solving sparse and banded matrix systems, and compare our approach with the multi-coloring method.
Introduction
Parallel programming on distributed memory machines requires coarse grain program partitioning since there is a large startup overhead in message transmission. There are many techniques proposed for program partitioning and granularity control CL95, GY93a, SK92, WF89] and partitioned programs can be represented in a format of coarse-grain task graphs. Finding a mapping of weighted 1 coarse-grain task graphs onto target architectures with the shortest parallel time is important for studying the impact of partitioning and generating e cient code. In this paper we consider the scheduling of a class of partitioned programs which can be modeled as iterative task graphs (ITGs). An iterative task graph (ITG) represents a sequence of task computations where each iteration consists of the execution of a set of tasks. There exists task dependence within an iteration and across iterations.
Iterative task computations involve both task and loop parallelism YFGS] . There is a lot of research work on scheduling task parallelism based on directed acyclic graphs (DAG), e. N88] have also been developed to allow a compiler to explore more parallelism. Our work has been motivated by the above research but takes into consideration the characteristics of asynchronous parallelism and the impact of communication. We will examine how task graph and loop scheduling techniques can be combined together for mapping iterative task computation on message-passing architectures. ELA94] presented an approach which uses a DAG scheduling algorithm for one iteration of the computation and computes the optimal unfolding for a special case. We will show how to derive the near-optimal unfolding factor for general ITGs.
Load balancing and communication minimization are the important aspects of optimization. It is known that data locality must be explored for reducing unnecessary communication. In the task computation model SK89, KB88] , exploring data locality means to localize data communication between tasks by assigning them in the same processor and then tasks could exchange data through a local memory to avoid high-cost inter-processor communication GY93a, SK89] . The performance of scheduling is also greatly a ected by the granularity of task graphs. GY93a ] presents a quantitative analysis on the impact of granularity over the choice of optimal DAG scheduling strategies.
What distinguishes our work with that of others is the incorporation of both communication and computation costs in ITG scheduling. The optimal solution for a special class of ITGs has been studied in Ch93] . Our heuristic algorithm is for general ITGs. The proposed algorithm contains the following parts: graph characterization, unfolding, graph transformation, DAG scheduling and the ITG schedule transformation. The optimal loop unfolding factor is derived so that the scheduling performance can be guaranteed compared to the optimal solution. We will show that the quality of DAG scheduling is critical to the performance of ITG scheduling. To make use of the scheduling result, we also present a method that executes the derived ITG schedule on distributed memory machines. We analyze the performance of run-time execution of static schedules when the estimation of task weights at static-time is not accurate. We conduct several experiments to verify our approach. We also investigate the applications of our techniques for scienti c computing such as SOR DY95, FYG95] .
This paper is organized as follows. Section 2 gives the problem de nition and assumptions. Section 3 discusses the possible approaches for mapping ITGs. Section 4 presents the scheduling algorithm and the optimality analysis in details for the cyclic ITGs. Section 5 gives the algorithm and the analysis for acyclic ITGs. Section 6 discusses the schedule executing scheme and examines the run-time performance of static schedules by studying the sensitivity of scheduling performance on inaccurate weights. Section 7 presents the experimental results.
De nitions and Backgrounds
Task graphs have been traditionally used to model task parallelism in which a set of functional tasks communicate with each other. Their dependence structure is expressed as a directed acyclic graph (DAG). A task is an indivisible unit of computation which reads its input data, it performs operations and then produces new data SK89, WG88] . The scheduling problem is to nd the mapping of tasks to processors and assign the execution starting times to those tasks. An iterative task graph (ITG) represents a sequence of task computations where each iteration consists of the execution of a set of tasks. There exists task dependence within an iteration and across iterations. Control dependence is not considered in this paper. Fig. 2 (a) depicts a task graph arising from the PDE computation and it contains three iterations. Since the number of PDE 3 iterations could be unknown or large, an ITG representation can capture the dependence between iterations, as depicted in Fig. 2(b) . A formal de nition of the ITG is given below. Formally, an ITG G contains v tasks T 1 ; T 2 ; ; T v , and e data dependence edges. This graph is repeatedly executed for N iterations. Let the instance of task T i at iteration k be T k i (1 k N). We de ne the following terms.
Let (T i ; T j ) be a dependence edge from T i to T j in G, labeled d i;j . d i;j is a non-negative integer called dependence distance 1 . Thus task instance T k+d i;j j depends on T k i and it cannot start to execute until T k i has completed its computation and the data produced by T k i is available at the local memory.
Let Children(T x ; G) be a set of tasks in G that depend on task T x . Let Parent(T x ; G) be a set of tasks in G that T x depends.
The expanded graph E(G; N) contains all N v task instances T k i (1 i v, 1 k N), and these tasks and their dependencies in E(G; N) constitute a DAG. Each task T x in an ITG has a computation cost x and there is a communication cost for sending a message from task T x in one processor to task T y in another processor. The cost is denoted as c x;y , which is usually estimated as communication latency + transmission speed size of message Dun91].
The scheduling problem for an ITG is to assign the instances of tasks to the given p processors and determine the execution order of tasks within each processor. The parallel time is the completion time of the last task instance and the goal of scheduling is to minimize the parallel time. Fig. 3(a) shows an ITG where edges are labeled with the dependence distances between tasks. Fig. 3(b) shows another ITG. For these two examples, the computation weights of all tasks are 20 and the communication edge weights are 10 between any two processors. The expanded graph for Fig. 3(b) is in (c). A schedule for this graph when N = 3 is depicted in Fig. 3(d) . The parallel time is 290. Loop unrolling or graph unfolding techniques PM91, N88] have been found useful to explore more parallelism. Let f be the unfolding factor. When a graph is unfolded f times, the number of tasks increases by a factor of f and the number of iterations for the new ITG is bN=fc. For example, Exploiting both task and loop parallelism together is essential to achieve a good performance.
Finally there is a trade-o between the performance and complexity of scheduling. The best solution can be found by examining the entire iteration space. However it is too expensive or not feasible to search the entire iteration space. It is possible to investigate few iterations to identify regularities of available parallelism. But examining as few task iterations as possible is important to balance the optimality of the solution and the complexity of scheduling. Since we do not search the entire iteration space, an important issue is to make sure the derived schedule is correct in the sense that all dependence constraints are satis ed.
There are several solutions for mapping ITGs:
Direct application of DAG algorithms for an expanded graph to map task instances from all iterations can produce a good solution, but this approach may not be feasible since the iteration number may not be known in advance, or the complexity is too high.
Applying a DAG algorithm for one iteration is possible; however, parallelism across iterations cannot be exploited in such an approach. For example, for Fig. 3(b) we can derive the DAG of each iteration by keeping all edges with distance zero, as shown in Fig. 4(a) . We schedule this DAG and derive a schedule for it in Fig. 4(b) . Then the schedule for the original ITG is shown in Fig. 4 (c), which is longer than the schedule of Fig. 3(d) . In Fig. 3(d) , task instances in di erent iterations are executed at the same time. The idea of exploring parallelism within and across iterations has been proposed in the context of instruction-level software pipelining AN88, L88]. In GS92], a near-optimal scheduling algorithm for pipelining with no communication delay is proposed. We need to extend this 6 result to incorporate communication optimization with load balancing since communication is a major overhead in a message-passing machine.
As we mentioned above, graph unfolding techniques PM91, N88] have been also developed. Unfolding can increase the number of tasks within each iteration and thus more parallelism could be explored. Our goal is to only examine tasks from few iterations (limited by the unfolding factor f) but explore parallelism of the entire iteration space. We need to derive a small unfolding factor since the complexity of scheduling could be too high when the unfolding factor is large.
We will use the concept of integer-periodic scheduling RE68] as in instruction-level software pipelining and structure a schedule as follows: all instances of task T i are assigned to the same processor. In order to derive a schedule which is near optimal, we need to examine the performance of the optimal schedule for this problem. The optimal solution is one that obtains the minimum time for the expanded graph E(G; N). We conduct a comparison of our schedule with an optimal solution. We should emphasize that our analysis measures the asymptotic performance of scheduling and study the competitive ratio of the derived parallel time over the optimal time when N is very large. We will use symbol o(N) in our analysis to represent a term which is negligible when N is very large.
In order to execute a schedule on a parallel machine, we need to construct a code that follows the scheduling result. If we study the schedule in Fig. 3(d) , we can see that the execution of each processor possesses certain local periodic patterns. In general, the local schedule of a processor contains three parts: prelude, iterative execution based on periodic pattern and postlude. For example, Processor 0 has a periodic pattern T 6 ; T 8 at interval 20; 110). It will repeat at 110; 200) and so on. There is no prelude and postlude. However the periodic pattern is not unique. Another pattern is T 8 ; T 6 at 90; 180). The prelude is T 1 6 and the postlude is T 3 8 . Notice that in instructionlevel software pipelining AN88], a global periodic pattern (a pattern which starts at the same time for all processors) is computed and this is not feasible in message-passing machines with asynchronous communication. In our approach, a local periodic pattern is produced. We will also uniformly express the prelude and postlude within a periodic pattern to avoid explicitly enumerating and sorting tasks in the prelude and postlude. The generated code is MPMD (multiple-program multiple-data) in nature, since each processor has a di erent local schedule. However, in practice, it is convenient to use a single SPMD program for all processors, and to let the program determine 7 which local schedule should be used based on the processor identi cation. 
Computing the performance parameters
In this step, we determine the unfolding factor based on the performance bound analysis presented in Section 4.6.
Smallest iteration interval. We rst need to estimate the maximum performance this task graph could achieve. The most important factor is
where d(C) is the summation of edge distances in this cycle C and (C) is the summation of computation weights of tasks in this cycle. This is the well-known optimal rate (the smallest iteration interval) of the pipelining when the communication cost is zero and there are a su cient number of processors RE68]. To compute this value, we set up the following inequalities for all dependence edges (T x ; T y ) in G:
where y is a non-negative unknown associated with task T y and is an approximation of (G). These inequalities can be solved in a complexity of O( p ve log 2 Seq(G) ) using the shortest path algorithm GR89] where constant is the desired accuracy in nding the minimum value such that (G) (G) + . For example, for Fig.3 (a), is chosen as 0.5 and the value of is computed as 40.
Granularity. We need to estimate the granularity of the graph, this value will be useful in computing the performance bound of the produced schedule and the unfolding factor. We use the 8 following value to represent the granularity for a given graph: Unfolding factor. Then we compute the unfolding factor f. The analysis in Section 4.6 shows that such a factor will guarantee that the asymptotic performance of this scheduling is competitive to an optimal solution. The performance bound compared to an optimal solution is estimated as:
where S1 = 1 ? 1=p + 1=g(G). The unfolding factor f is dmax(L1; L2)e where L1 = S1( max + ) + max =g(G) 
Graph transformation
We transform G f into a DAG so that the DAG scheduling technique could be applied. De ne DIV (x; y) as the largest integer that is less than or equal to x=y, i.e. bx=yc. De ne MOD(x; y) = x ? DIV (x; y) y. Let the tasks in G f be renumbered as T 1 ; T 2 ; ; T fv . We rst compute the minimum integer values for f and i (1 i fv) based on the following inequalities for each dependence edge (T i ; T j ) in G f : j ? i + f d i;j i : where j is a non-negative unknown associated with task T j in G f and f is an approximation of (G f ). Since we know that (G f ) = f (G), itcan be shown that f ?f (G f ) f . Thus the value f can be easily found in this range using the binary searching and the shortest path algorithm. The result satis es (G) f (G) + . 
Mapping the kernel graph
In this stage, a DAG schedule is derived for K(G f ). The starting time i and the processor assignment Proc(i) of each task T i are provided. In mapping the kernel DAG, we use two algorithms: one is a multi-stage algorithm designed for the PYRROS system YG92], another is a one-stage algorithm. We will choose the smaller one between two solutions produced by these algorithms. The result of the DAG scheduling is used for constructing the ITG schedule in Section 4.5. The rst algorithm involves the following multiple stages in scheduling: 1) It clusters tasks with intensive communication using the DSC algorithm YG94].
2) It merges clusters to p processors using a load balancing algorithm.
3) It reorders tasks within each processor to overlap computation with communication. The algorithm has a complexity O((v + e) log v) 2 and works well in practice but we are not able to derive a performance bound. The second algorithm has the same complexity but only involves one-stage. We will derive the performance bound of the second algorithm, which is used in deriving the near-optimal unfolding factor.
In the one-stage approach, we use the idea of the DSC algorithm YG94] but limit the number of clusters to p. The algorithm rst computes the blevel value for each task where blevel(T x ) is the length of the longest path from this task T x to an exit node. The algorithm updates the value of tlevel for each unscheduled task where tlevel(T x ) is the length of the longest path from an entry task to this task T x (the partial order includes the execution order between tasks within the same processors). The algorithm maintains a list of free tasks and at each step it picks one free task T 1 with the smallest tlevel value and another free task T 2 with the biggest tlevel(T 2 ) + blevel(T 2 ) value. Then we will try to schedule these two tasks separately through the minimization procedure described in the following. And we will select one of these two tasks that has earlier starting time.
If a tie occurs, we will pick the latter one. The tlevel value of the selected task will be the latest starting time x of this task unless there are no processors available. The minimization procedure is also used in DSC ( YG94], page 958). The idea is to further reduce the starting time of the selected task by merging one or more parents of this examined task to one processor. In this way the unnecessary communication could be saved. Assume that the parents of T x as depicted in Fig. 6 are sorted such that:
arrival(T j ; T x ) arrival(T j+1 ; T x ) for 1 j m ? 1:
Then T 1 is called the most dominating parent for T x and tlevel(T x ) = arrival(T x ; T 1 ). tlevel(T x ) is the worst starting time of T x if there is an idle processor available at that time. However the starting time of T x could be further reduced if T x is assigned to the processor of T 1 and some other parents of T x are assigned to the same processors. Assuming k parents are assigned to Proc(T 1 ), then the starting time of T x is
where avail(i) is the available time of processor i. Optimal k that leads to the minimum x can be found by a binary searching algorithm. When a parent of task T x (say T 2 ) is moved from one processor to Proc(T 1 ), we need to impose a constraint that such relocation should not a ect the tlevel or starting times of children of T 2 other than T x . Currently we use a simple but restrictive strategy: a parent T 2 is movable to Proc(T 1 ) if T 2 does not have a child other than T x . When T 2 is moved to the new processor location, the avail time of this processor may decrease.
The complexity of this algorithm is O((v +e) log v). We apply this algorithm to the given DAG and also the reversed graph, and we can show that this algorithm has the following properties: when there is a su cient number of processors, the algorithm reaches the optimum for fork, join, coarse grained tree DAGs. Also the performance bound is studied in the following theorem.
Theorem 1 Given a DAG R and p processors, the parallel time produced by this algorithm is bounded by
where CP(R) is the length of the critical path (the longest path, excluding edge communication weights) in this graph R.
Proof: We call a task is \relocated" if it is rescheduled from one processor to another by the minimization procedure. Otherwise this task is called \position-xed". In the proof, we will restructure the derived schedule, move relocated tasks back to its original processor positions if there are empty slots (partial lling is possible if the original positions are lled by other tasks, but this does not a ect the proof).
Let task T x be the task examined at each step of the scheduling. Let T y be the most dominating parent for T x , tlevel(T x ) = arrival(T y ; T x ). Then if T x is a position-xed task, we can see two properties hold:
T y is also a position-xed task.
If arrival(T y ; T x ) < x , then there is no processor idle in time interval arrival(T y ; T x ); x ]. De ne CT(T x ) to be the completion time of task T x , CT(T x ) = x + x . The total processor idle area in time interval CT(T y ); CT(T x )] is at most p c y;x + (p ? 1) x p x =g(R) + (p ? 1) x = (p ? 1 + p=g(R)) x :
Notice that the relocated parents of a task T x could be moved back to its original positions in the above schedule reconstruction (for the purpose of proof only) , there could be some idle space 12 left in Proc(T x ) before time x . But it is easy to verify that these idle slots are in time interval CT(T y ); arrival(T y ; T x )].
Let task T x be the last task in the DAG schedule. Then it must be a position-xed task. Then we can search its most dominating parent. Repeat this reasoning, we can get a dependence chain of tasks, such that T 1 ; T 2 ; ; T x , such all these tasks are position-xed. And it is also easy to verify that there is no idle time from time 0 to the starting time of task T 1 because all tasks that are scheduled before T 1 have to be entry tasks which are independent to each other. So the total idle area from time 0 to the completion of task T x is at most
Since p PT(R) is equal to the summation of working and idle areas, thus we have PT(R) (1 ? 1=p + 1=g(R))CP (R) + Seq(R)=p:
The ETF algorithm H89] has a similar performance bound with a complexity O(pv 2 ) while our algorithm has a lower complexity O((v + e) log v).
Schedule Transformation
The DAG scheduling produces processor assignment Proc(i) and the starting time i for each task T i in K(G f ). 
Analysis for scheduling cyclic ITGs
We can show below that the schedule derived by the above algorithm satis es the resource and dependence constraints. The proof is based on the analysis in GS92]. The main di erence is that we consider communication delay and also we add a proof on resource constraint satisfaction.
Theorem 2 The algorithm correctly produces a schedule for an ITG.
Proof: Let We also conduct an analysis to compute the unfolding factor that gives a guarantee on the asymptotic performance bound of this heuristic algorithm. Because the result of this analysis is used in the rst step of our algorithm, the expression for estimating the unfolding factor can only rely on information from the given graph G, not G f . Thus a connection between the characteristics of G and G f must be constructed. Proof: We notice the following relationship between G, G f , and K(G f ): This theorem indicates that since could be chosen small, the bound is about S1 + 1 = (1 ? 1=p + 1=g(G)) + 1. When the granularity of the G is not too small, this bound is tight. Notice that our granularity estimation may not be e ective when some tasks perform very small computation but receive large messages from others. In practice, the size of data a task communicates is proportional to the amount of computation work it does, we expect our algorithm performs well and the bound is relatively tight. We will verify this in our experiments. On the other hand, the theorem indicates that program partitioning that produces ITGs should make g(G) not too small. This is consistent to the previous results GY93a, CR92].
Scheduling for acyclic ITGs
If an ITG does not have a dependence cycle, then the ITG itself is a DAG. The following algorithm is used to handle such a graph.
Step 1: We unfold the graph G in a factor f = dmax(L3; L4)e where
and B is the desired performance competitive bound. We set B = 2p p+1 + 1.
Step 2: We construct a weighted DAG which as the same structure as the given ITG G f except the weight of an edge (T i ; T j ) is assigned as 1 ? d i;j and the weight of each node is zero. This graph is called R(G f ). Then we use the topological sorting algorithm to compute value H i for each task T i , which is the length of the longest path from this task T i to an exit task in R(G f ).
Step 3: We assign tasks in the ITG G f to p processors such that load among those processors is balanced. Our load balancing method uses the classical list scheduling algorithm CD73] with the highest-weight-task-rst principle. The result of this mapping is the starting time i for each task and the processor assignment PA. Let the length of this schedule be LPT. This algorithm has a time complexity of O(v log p + e) and has the following properties.
Theorem 4 The above 4-step algorithm produces a correct schedule for an ITG with no dependence cycle.
Proof: (1) The proof for the resource constraint satisfaction is the same as the one for Theorem 2.
(2) Now we show that all dependence is satis ed. where B has to be set as B > 2p p+1 .
Proof: The load balancing algorithm used at Step 3 can achieve the following performance: LPT Thus we choose f = dmax(L3; L4)e.
We can see that when for coarse grain graphs with g(G) 1, L3 and L4 are very small if max p=Seq(G) is not large. Thus without unfolding, the scheduling algorithm can produce a reasonable performance. Another note is that if a given ITG is cyclic, and it only has self dependence, then this above algorithm without unfolding still works since the self-dependence in G f is actually satis ed automatically. Thus for a coarse-grain ITG which only has self-dependence cycles, we can still apply the above algorithm without unfolding and the performance should be good according to the above analysis. In Section 7, we will examine such an ITG arising from matrix-matrix multiplication.
Run-time Execution and Sensitivity Analysis
The scheduling algorithm can determine the processor assignment of tasks and the starting time of each task. In order to utilize this result, we need to study the scheme that executes this schedule. We will rst discuss the run-time method for executing parallel tasks following the static schedule and then discuss the run-time performance variation due to inaccurate estimation of task and communication weights.
The schedule execution
We assume that the target is a message-passing parallel machine, each processor has its own private memory and a communication bu er to hold the outgoing and incoming messages. In Section 4, we have derived the starting function value for each task instance. If we execute tasks by sorting the starting time of these tasks at run-time, then the cost of sorting contributes a signi cant amount of overhead in the overall performance. Our approach is based on the idea of \periodic execution" in software pipelining. The di erence is that in instruction-level pipelining, periodic pattern is global usually AN88] while in a distributed memory machine, we use local period patterns to utilize 20 asynchronous parallelism. Our code does not have separate prelude and postlude parts, and in this way the code length can be shortened.
The starting time of task instances depends on the iteration number k. We use the following method to nd the periodic pattern for processor j from the coe cients in starting time function ST(T x ) = i;p + p (k ? 1). Let Tasks(j) be all tasks assigned to processor j. Let Based on the above theorem, we give the code structure for executing the schedule at processor j in Fig. 8 . When executing each individual task, the task receives the data items it needs, performs computation, and then sends out data produced to its children (using aggregate multicasting if possible). The style is similar to the one used in the PYRROS system YG92].
Next we examine the performance of this code structure and provide an analysis on performance variation between predicted time and actual run-time performance. 21 6.2 An analysis of run-time performance
Code in Fig.8 executes the given ITG G N times using the derived periodic pattern. When it executes a task, it starts task computation as soon as this task is ready. Thus if run-time weights are the same as the static time weights, it is possible that the code may execute tasks in an iteration interval smaller than computed p . We call the periodic schedule used in Fig.8 to be S, and de ne PT(G) to be the parallel time estimated by our algorithm using parameter p . Let PT(S; G) be the parallel time produced by the code in Fig.8 . Then PT(S; G) PT(G).
At run-time, the computation time of each task and communication delay between tasks may not be the same as the one predicted at compile-time. The following parameters a ect the performance of a static schedule in a run-time environment:
Variation of task computation weight, due to inaccurate estimation of static weight, the cost of fetching data from the communication bu er and copying the data from the local memory to the communication bu er.
Variation of communication delay, due to inaccurate estimation of static communication weight, unexpected run-time network contention.
We address the sensitivity of the scheduling performance on the variation of task communication and computation weights. Let G be the graph for which the scheduling algorithm produces a schedule S. At the run-time, the weights of this graph change, we call the new graph as G r . The run-time method in Fig. 8 uses the mapping and task ordering of S to execute graph G r , resulting in a schedule S r in a length of PT(S; G r ). Assume that PT opt (G r ) is the optimal schedule length for G r . The interesting question is: what is the competitive ratio of PT(S; G r ) over PT opt (G r ).
We assume that at run-time, each communication weight c i;j is changed to c r i;j and the computation weight i is changed to r i . They satisfy:
(1 ? 1 )c i;j c r i;j (1 + 2 )c i;j ; (1 ? 1 ) i r i
(1 + 2 ) i : 22 We have shown in Theorem 3 and 5 that the derived schedule satis es: PT(G) B PT opt (G) where PT opt (G) is the optimal schedule length for G, B = (1 ? 1=p + 1=g(G))(1 + max( ? ;Seq(G)=p) ) + 1 for the cyclic case and B = 2p p+1 +1 for the acyclic case. The following theorem identi es the relationship between PT(S; G r ) and PT opt (G r ). Proof: We rst estimate PT(S; G). The length of the critical path in E(G; N) is usually smaller than PT(S; G) because in this schedule independent tasks could be mapped to the same processor, which may not be re ected in a critical path of E(G; N). However, after adding an execution edge from T i x to T j y if T i x is executed immediately before T j y in the same processor at schedule S and these two tasks are independent, the length of the critical path in the new graph is the execution time, i.e. PT(S; G). We call this new graph EE(S; G; N). At run-time, the weights of EE(S; G; N) may change. We call the run-time graph EE(S; G r ; N). Note that EE(S; G r ; N) has the same dependence structure as EE(S; G; N).
In the rest of the proof, we use CP(S; G) and CP(S; G r ) to denote a critical path in EE(S; G; N) and EE(S; G r ; N) respectively based on schedule S. We also de ne W(P) as the summation of task weights and communication weights between adjacent tasks along the path P (the communication weight will be zero if two tasks are assigned to the same processor in the schedule considered).
First it is easy to see that for any path P in EE(S; G; N), we have PT(S; G) W(P). But at run-time, PT(S; G r ) = W(CP(S; G r )) (1 + 2 )W (P x ), where P x is the path in EE(S; G; N) contains the same tasks and edges as the critical path CP(S; G r ) of EE(S; G r ; N). Note that P x may not be a critical path of EE(S; G; N). This analysis shows that if the run-time variation of weights is not signi cant, a static schedule still produces a competitive performance. We will verify this in our experiments.
Experiments
We have implemented our algorithms in a scheduling system which takes an ITG as input and produces a schedule for each processor. Currently the system is able to simulate the scheduling performance on message-passing machines using their communication and computation parameters. In our experiments, we have examined the performance of scheduling 50 randomly generated ITGs, and several application programs including matrix multiplication, solving a Laplace equation, solving a banded matrix system, and the SOR method for sparse matrix systems. We have also hand-coded some of schedules using our executing scheme in nCUBE-2 distributed memory machines to verify the correctness of the simulation in a real machine. In this section we report the experimental results which address several aspects of the performance of this algorithm.
Overall performance of scheduling. We report the performance of the simulation and also the actual performance on nCUBE-2. In the simulation, we compute the asymptotic speedup of an ITG schedule as:
In the nCUBE2 implementation, we use the sequential time divided by the actual parallel time. In Fig. 10 , we list the results of scheduling three application programs. We rst show the simulated and actual performance on nCUBE-2 for the matrix multiplication 3 when the matrix dimension is n = 200 and n = 400. We also show the performance of solving Laplace PDEs (200x200 and 400x400 grids), and banded matrices (n=2000 and 4000 with band width=35) using the SOR method. We can see that the simulated performance is close to the actual performance for the studied programs. There is still some discrepancy especially when p is getting large. It indicates the communication overhead prediction still needs some future investigation. The ITGs for those test cases are depicted in Fig. 2(b) and Fig. 9 . Performance sensitivity on inaccurate weights. We change the weights of PDE and banded matrix ITGs in a range of ?10%; +10%]. We compare the variation ratio Speedupnew Speedup old ?1. The result is in Fig. 11 . We can see the performance is sensitive but variation is small, within 5%.
The e ectiveness of unfolding and the performance bound. We examine if the computed factor identi es the range of unfolding in improving performance for three groups of randomly generated acyclic and cyclic ITGs: coarse grained graphs with g(G) 1; ne grained graphs where 0:4 < g(G) < 1 and the local granularity (local computation weights vs. x Banded matrix n=4000, width=35
Figure 11: Performance variation ratio of the PDE ITG and banded matrix ITG using the inaccurate and accurate weights.
percentage which is calculated with respect to the unfolding factor obtained by our scheduling algorithm. Assume that f is the calculated factor. 50% unfolding percentage means that 0:5f is used. The Y axis is the speedup improvement ratio after unfolding: Speedupnew Speedup old ? 1. For the acyclic coarse-grain ITGs, the calculated unfolding factors are all 1 and thus the unfolding performance is not listed. We examine the unfolding lower bounds L3 and L4 derived in Theorem 5 and we can see that when g(G) 1, L3 and L4 are very small if max p=Seq(G) is not large. In general, we can see that unfolding greatly improves the performance. This is because unfolding produces more tasks for the DAG scheduling, which gives the algorithm more exibility. The average estimated upper bound after unfolding and actual bound are listed in Since the value of unfolding factor a ects the complexity and also the size of code, we study how large the unfolding factor is for these random ITGs. The unfolding factors computed are small (4-8) in general except for three cases (30-50). We studied those cases and found that B bound is set to be too small. If we increase B by 1, then the average unfolding factor is reduced to 6.
With and without software pipelining. We have demonstrated in Fig. 4 that the parallelism is not fully utilized if using the DAG scheduling for one iteration without pipelining iterations (called simple DAG scheduling approach). It should be noted that after unfolding an ITG, the simple DAG scheduling approach could also explore parallelism between iterations for the original ITG but not for the new ITG. We examine the 100 random ITGs and compare the performance of this simple DAG scheduling approach with our algorithm. Fig 13 shows the average performance improvements of our approach over this simple DAG scheduling approach before graph unfolding and after unfolding by a factor of 2. It can be seen that before unfolding, our approach can achieve as high as 50% performance improvement due to the exploration of the across-iteration parallelism especially when the number of processors is large. After unfolding, both approaches have gained more exibility. But our approach is still a lot better.
The role of DAG scheduling and load balancing. We investigate how critical the DAG scheduling in our algorithm for constructing the nal ITG schedule. Notice that p = max(D; PT(K(G f )). We compute the ratio PT(K(G f ))= p for these randomly generated graphs. The larger this value is, the more crucial the DAG scheduling is. The left part of Fig. 14 shows the average ratios for three groups of cyclic ITGs. It shows that when p is small, each processor has a heavy load and the DAG schedule dominates the ITG performance. When p is large (p = 32), the load per processor becomes smaller (notice that the graph sizes are not too big), the role of communication comes to play. But still the DAG scheduling contributes more than 60% to the nal performance. Similar results for acyclic ITG scheduling can be seen in the right part of Fig. 14 on the role of the load balancing algorithm at Step 3.
Iterative sparse matrix computation. In solving a sparse matrix system, the SOR iterative methods could be used AJ86]. We use submatrix partitioning and a partitioned sparse matrix 27 The top left is the average improvement ratio using our cyclic ITG algorithm instead of applying the DAG scheduling for one iteration. The top right is the average improvement ratio after the graphs are unfolded by a factor of 2. The bottom left is the improvement ratio for acyclic ITGs and the bottom right is the ratio after unfolding those ITGs by 2. usually still contains many zeros. The left side of Fig. 15 shows a typical ITG based on the SOR method for a sparse matrix, assuming that the convergence test is not conducted in every iteration but every few hundreds of iterations instead. Some of the edges in this graph are marked distance 1. The unmarked edges have distance 0. We have tested the performance of sparse SOR ITGs. The test matrices are the part of the Harwell-Boeing Test Suites D92]. We use matrix BCSSTK14 arising from structural analysis for the roof of Omni Coliseum at Atlanta and matrix BCSSTK15 for Module of an o shore platform. The simulated performance is shown in the right part of Fig.15 . While the parallelism in a sparse matrix computation is limited, this algorithm is able to explore a decent amount of parallelism. The main di culty in e ciently parallelizing the SOR methods stems from the intra-iteration data dependencies (cf. Jacobi method), as iterates values depend on computed values of the same iteration. The multi-coloring method has been proposed AJ86] for executing SOR-based iterative computation. This scheme performs variable relabeling to increase the amount of exploitable parallelism since nodes assigned the same color can be computed concurrently. Adams and Jordan AJ86] have shown that the multi-coloring method is numerically equivalent to performing several SOR iterations computed simultaneously in the absence of convergence tests. However, in order to preserve inter-color data dependencies, processors must communicate between each computation phase associated with each color. This constitutes a drawback in the implementation of the multi-coloring scheme for message-passing machines as communication and computation cannot be overlapped. Our approach based on software pipelining is essentially to overlap computation and communication of several SOR iterations. We conduct the experiments in using the SOR method for solving banded matrices to compare the multi-coloring approach with our approach DY95]. Table 2 shows the mega ops obtained in nCUBE-2 and Intel Paragon machines (vector units are not used). The left part shows the mega op performance for xed matrix size n and p = 64. The right part is for a xed matrix band width. The experiments indicate that our approach that overlaps several SOR iterations outperforms the coloring method when many colors are needed. Table 2 : The left part is the mega ops obtained with the xed matrix size and p = 64 and the right part is with the xed matrix bandwidth.
Concluding Remarks
Our experiments show that the automatic scheduling algorithms for ITGs delivers good performance on message-passing architectures. We have also presented a method for executing ITG schedules and provided analytic and experimental results to show that run-time performance of static schedules is stable if variations between predicted static weights and actual run-time weights are not signi cant.
Currently we are implementing a run-time support system for e ciently executing graph schedules. CFH95] proposed a hierarchical tiling framework and our work is useful and can be extended for assisting performance prediction. Our algorithms have not considered the overhead of memory accessing and e ect of caching and WF93] discussed an approach for addressing this issue.
The scheduling algorithms we present have not considered the processor distance. It has been shown (e.g. Dun91]) in the modern processor architectures, the distance factor does not a ect the cost of communication signi cantly because of wormhole routing technology. However, this routing scheme requires the exclusive use of channels, and there exists communication contention if network tra c is heavy. Assigning communication-intensive tasks close to each other in terms of processor distances would reduce the chance of contention. The communication contention is hard to model. In YG92], the processor distance information is incorporated in computing communication volume after virtual processor assignments of tasks in a DAG are determined. Then physical mapping can be adjusted B90]. We are investigating a similar method for mapping ITGs.
