This paper presents a neighborhood search algorithm for heterogeneous multiprocessor scheduling in which loop pipelining is used to exploit parallelism between iterations. The method adopts a realistic model for interprocessor communication where resource contention is taken into consideration. The schedule representation scheme is flexible so that communication scheduling can be performed in a generic manner. Based on a general time formulation of the schedule performance, the algorithm improves an initial schedule in an efficient way. Experimental results show that significant improvement over existing methods can be obtained. Using the scheduling results, a parallel software video encoder was implemented and real time performance was achieved.
Introduction
Given a program modelled by a task graph, finding an optimal multiprocessor schedule is a well-known NPcomplete problem [I] . Taking into account inter-processor communication (IPC), optimal solution has been found under restrictive assumptions on the task graph [2, 3] and unbound number of processors connected by a contention free network. These assumptions are rather ideal for real applications and platforms.
More realistic approaches try to model IPC resource contention [4, 6] . For example, the Mapping Heuristic (MH) proposed in [4] estimates an additional contention delay for each message with respect to the system state. Unfortunately, no actual implementation was given based on the model. In [5, 6] , the Ordered Transaction model was proposed and implemented on a board containing four DSP96002 processors and a memory access controller. The shared memory access pattern is determined at compile time, so that run-time resource contention is eliminated. For a 1024 point complex FFT, a speedup of 3 is obtained. Based on a similar IPC model, the Dynamic Level Scheduling (DLS) [7] performs list scheduling where in each step, the best matched task processor pair is found based on the system state. Similarly, the genetic algorithm (CA) proposed in [SI represents a schedule by matching and scheduling strings. Both algorithms have implicit restrictions in that the input data transfers for each task are scheduled only when the task is being considered.
For iterative applications, rotation operation was proposed in [9] for loop pipelining without consideration of IPC. In [IO] , although IPC is included in the model, its scheduling has similar restriction as that of [7, 8] . Moreover, both [9, 10] assume synchronous control steps and so are unsuitable for asynchronous processors that are common in most distributed or shared memory systems.
As discussed, optimal solution has been found only under restricted problem instances and ideal platforms such as contention free network. When IPC contention is considered, there are often unnecessary restrictions to the IPC scheduling. Therefore, one of our objectives is to develop a realistic and general model for computation and IPC scheduling. Based on this model, we developed a novel neighborhood search algorithm with pipelining to exploit inter-iteration parallelism. Experimental results show that significant improvement can be obtained over existing methods. Using the resulting schedules, a parallel video encoder was implemented, which achieved over 30 framedsec at 352x240 resolution using 24 processors, which is about 2 times that of the CA tested and 37% better than a manually optimized video coding algorithm. This paper is organized as follows: Section 2 states the model for scheduling. Section 3 presents the method of neighborhood search. Section 4 gives experimental results and discussions. This paper is concluded in Section 5.
Problem modelling
In order to obtain true overall performance, the scheduling model should take into account IPC resource contention. For example, ignoring IPC contention, the task graph in Fig. l(a) has an optimal schedule in Fig. l(b) . In the presence of link contention, the schedule is no longer optimal as shown in Fig. l(c) . As illustrated, the resource contention and the flow of data should be emphasized, which can be represented by a data flow graph (DFG).
In general, the DFG model consists of a number of non-preemptive computation tasks and a number of data objects connected according to G( V+VD ,ETDVEDT) with definition of notations given in the APPENDIX. Each task takes some data objects as input and produces some data objects as output. Fig. 2 to an earlier time slot in PI, resulting in a longer schedule. Moreover, the scheduler should not impose unnecessary restriction to the IPC scheduling as in [7, 8, 10] . For instance, the data transfer (To->T~) in Fig. l(c) can be moved before (Tl->T2), giving a better schedule in Fig. 3(c). We also consider overlapping of successive iterations to exploit inter-iteration parallelism. Fig. 3(b) shows an example of overlapped iterations in which significant improvement is obtained over Fig. l(c) . In the parallel platform model adopted, each data transfer is scheduled to channel resources by dedicating them throughout the duration of transfer [7, 8] . For each ordered pair of processors, there is a channel that contains the resources involved. For each computation task, the execution time is assumed to be known a priori and it can be different on different processors. The data transmission time may be modelled with a channel setup time plus the product of data size and an effective bandwidth. 
Proposed method
Schedules generated by heuristic and non-deterministic approaches are often sub-optimal. There is obviously opportunity for improving them with neighborhood search in which a solution undergoes modification to obtain neighbor solution which is adopted if it is better. The optimization criteria should be the overall schedule length, rather than the task start time [I 13. 
Schedule characterization and evaluation
For iterative program, all the loops execute according to the same static schedule (RI,Mup,DS,Seq). Table 1 shows an example schedule for the DFG of Fig. 2 . The modelled schedule performance is evaluated with several intermediate graphs, as depicted in Fig. 6 . First, the DFG G is transformed into Gc. Second, the precedence relations between the tasks are determined with respect to their relative iteration indices (RI). Then, GpI is derived according to the platform resource constraints. 
Precedence graph (Gp).
The precedence relation of the tasks in the schedule, as represented by GdV+Vm, Ep), is derived from Gc and RI. For ( 
T,,T,)E Ec, (T,,TJ)E Ep if (T,,T,)E Ec and RI(T,)=RI(T,)-d(T,,T,).
Obviously, Ep contains a subset of the edges of Ec. Fig. 7 depicts the Gp obtained from the example schedule and the Gc . 3.1.3. Map, Seq and DS. Given a schedule, the execution time line is formed by traversing and scheduling the tasks in the order of Seq, which is a topological ordered sequence that satisfies the precedence relation of Gp. Each resource has a task list to guide its execution. During the scheduling, computation task Tis appended to the task list of processor Mup(T). For communication task T, it is appended to the channel resources between the source processor, Mup(Producer(T)), and the destination processor, Mup(Consumer(T)). If an alternative data source (Datu Forwurder [SI) is specified by DS (T) , the source is the destination of DS (T) . If the data object is already present in the destination processor, T is not scheduled and ET(Z') is set to zero. Fg. 8 shows the t i m e line of the example schedule. The resources involved in each scheduling step are tabulated in 
tlevel(T,) + ET(T,)~T,,T)E E,,'), (2) blevel(T) =max{O,blevel(T,~T,T)€ E,,'}+ ET(T) (3)
Assume that the number of input and output data objects for each computation task and the number of resources in each channel are bounded by constants, it takes a constant time for finding tlevel and blevel for each task. As there are at most e+v tasks, it takes O(e+v) time to find the schedule length. 
Neighborhood search
Below are the three phases of search employed.
3.2.1. Phase NSP-MAP. In this phase, neighbor solutions are obtained by changing the processor mapping for some computation task, while keeping the processor mapping of the other tasks fixed. The algorithm cycles through all the computation tasks for evaluation upon different processor mappings. It terminates if no improvement is found for all the tasks. In the best and worst cases, it requires p and p v evaluations per improvement respectively. Thus, the time complexity for finding an improvement is O[pv(e+v)].
3.2.2.
Phase NSP-SEQ. This phase searches, for each task T, a new position in Seq that gives the shortest schedule length. Due to precedence constraints, the search starts by shifting T backward from its original position until reaching a predecessor task. If this shifting reaches the head of Seq, T is wrapped around to the end of Seq with RI(7J increased by 1 and Gp updated. Then the shifting continues from the end of Seq. In this way, Tis effectively shifted to the previous loop while the instance of T in the next loop is shifted in. After backward shifting, T is forward shifted from its original position until reaching a successor task. Similarly, when T reaches the end of Seq, it is wrapped around to the head with RI(T) decreased by 1 and Cp updated. Wrap around in both directions is not performed if the maximum latency exceeds L. After the shifting, the Tis moved to the best position.
For each task, there are at most e+v possible positions in Seq and L-1 times of wrap around in both directions. The algorithm terminates when no improvement is found after inspection for all the e+v tasks. As each evaluation takes O(e+v) time, an improvement takes O [L(e+vP] time.
This time complexity can be reduced. The idea is to first remove T, then use the levels functions of the remaining tasks with the absence of T to find the levels for 
In (4), rlevel (T) and bfevel(7J are obtained by (2) 
O { (SL,/SL*-1 )D< [pv(e+v)+L(e+~)~]/&}.
Fine improvement is ignored with a large E. With a small E, the search is likely to give better result using a longer search time. In all the tests done, the value of E is IO-', which can be considered as typical value. 
Experimental results and discussions

Comparison by random DFG
Comparisons were made with DLS [7] and the GA of [8] since they have a similar model of IPC as our approach. For acyclic DFGs, five tests were done with variation in the parameters as shown in Table 3 . In each random DFG, v data objects are added to v computation tasks with the producer and consumer tasks selected randomly. Each computation task has an expected unit execution time selected from the range 0.001 to 1.999 with uniform distribution. Each data object has a size also from this range but post-scaled by the desired CCR. In each test, 50 DFGs were used, i.e. a total of 250 DFGs in the 5 tests. In Fig. 13 , the improvement of NSP over CA and DLS starts to increase, reaches about 32% and 45% respectively at CCR=0.4. In fact, NSP gives a better IPC scheduling as reflected from this substantial improvement.
4.1.6. Execution time:On a Pentium@ I1 350MHz system, the execution time of the algorithms was measured. For the case of CP=55, NSP starts from an initial solution from D U , which takes 0.16 second for scheduling. As depicted in Fig. 14, CA attains a steady value after about 10 minutes while the NSP phases show stepwise drops and stop at about 3.5 minutes. This shows that NSP gives a substantially better schedule in a comparably short time. Fig. 15 where macroblock (MB) is the basic unit of data decomposition. The platform used is the IBM-SP2 in the University of Hong Kong, which is composed of 48 160MHz IBM P2SC RISC processors connected by the High Performance Switch (HPS) with point-to-point bandwidth of 105MBytes/s and latency of 27.5psec.
Application to video encoding
Blocking send and receive operations of the Message Passing Library were used. All the task execution times were measured using gettimeofday. For simplicity, we assume that the HPS is a completely connected switch such that each IPC channel is composed of the source and destination processors only. The video tested consists of 50 frames of 352x240-pixel resolution. It shows a table tennis game that involves a zooming view. Each test was repeated 8 times and the average frame rates were taken. gives about 31 framedsec at p=24, which is about 2 times that of CA and 37% better than MMMS. From Fig. 17 , the speedup of GA tends to level at about 6 for p over 20 while both MMMS and NSP show an increasing trend and reach about 9 and 12 respectively at p=24. NSP shows a curve closer to linear than MMMS. In fact, MMMS is the product of manual optimization by experience while NSP is an automatic scheduler for arbitrary DFG and platform.
Owing to variation in message transfer time, deviation from the predicted performance is observed. Firstly, the HPS has message buffers so that the sender can complete earlier. Secondly, network congestion may cause the transfer time to be longer. In our case, the first effect dominates and the message transfer time is generally shorter than the predicted. Furthermore, there is variation in the MB encoding time depending on the video content. Since the schedules were generated based on a frame with above average encoding time, the result is better than the predicted. Moreover, the latency from input frame to output bit-stream is only 3 frames encoding time, which suits on-line applications such as video conferencing.
Conclusions
In this paper, a scheduling algorithm for heterogeneous multiprocessor systems is presented. First, a flexible representation scheme is used so that communication scheduling can be done in a generic way. Second, loop pipelining is used to exploit parallelism between iterations. Third, an efficient technique is incorporated into the search that reduced the time complexity by an order of magnitude. Fourth, experimental comparisons were made with DLS and a CA algorithm using different suites of random DFGs with variations in different parameters including the effect of data sharing. Finally, the method is verified by actual implementation of a video encoder in which over 30 framedsec is obtained using 24 processors. [The set of directed edges in Gp.
I E,'
lThe set of directed edees in G,' .
I SL ]The current schedule length. SLT , !The schedule length when T is removed and re--.
(last improvement before the search stops.
INsHaRE ]Max. no. of consumer tasks sharing each data object. 1
