Novel neighborhood search for multiprocessor scheduling with pipelining by Cheung, PYS et al.
Title Novel neighborhood search for multiprocessor scheduling withpipelining
Author(s) Leung, KK; Yung, NHC; Cheung, PYS
Citation
The 4th International Conference/Exhibition on High
Performance Computing in the Asia-Pacific Region Proceedings,
Beijing, China, 14-17 May 2000, v. 1, p. 296-301
Issued Date 2000
URL http://hdl.handle.net/10722/46177
Rights
©2000 IEEE. Personal use of this material is permitted. However,
permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for
resale or redistribution to servers or lists, or to reuse any
copyrighted component of this work in other works must be
obtained from the IEEE.
Novel Neighborhood Search for Multiprocessor Scheduling with Pipelining 
K. K. Leung, N. H. C. Yung & P. Y. S. Cheung 
Department of Electrical & Electronic Engineering, The University of Hong Kong 
kkleung@hkueee.hku.hk, nyung@hkueee. hku.hk, cheung@hkueee.hku.hk 
Abstract 
This paper presents a neighborhood search algorithm 
for heterogeneous multiprocessor scheduling in which 
loop pipelining is used to exploit parallelism between 
iterations. The method adopts a realistic model for inter- 
processor communication where resource contention is 
taken into consideration. The schedule representation 
scheme is flexible so that communication scheduling can 
be performed in a generic manner. Based on a general 
time formulation of the schedule performance, the 
algorithm improves an initial schedule in an efficient way. 
Experimental results show that significant improvement 
over existing methods can be obtained. Using the 
scheduling results, a parallel software video encoder was 
implemented and real time performance was achieved. 
1. Introduction 
Given a program modelled by a task graph, finding an 
optimal multiprocessor schedule is a well-known NP- 
complete problem [I]. Taking into account inter-processor 
communication (IPC), optimal solution has been found 
under restrictive assumptions on the task graph [2,3] and 
unbound number of processors connected by a contention 
free network. These assumptions are rather ideal for real 
applications and platforms. 
More realistic approaches try to model IPC resource 
contention [4,6]. For example, the Mapping Heuristic 
(MH) proposed in [4] estimates an additional contention 
delay for each message with respect to the system state. 
Unfortunately, no actual implementation was given based 
on the model. In [5,6], the Ordered Transaction model was 
proposed and implemented on a board containing four 
DSP96002 processors and a memory access controller. 
The shared memory access pattern is determined at 
compile time, so that run-time resource contention is 
eliminated. For a 1024 point complex FFT, a speedup of 3 
is obtained. Based on a similar IPC model, the Dynamic 
Level Scheduling (DLS) [7] performs list scheduling 
where in each step, the best matched task processor pair is 
found based on the system state. Similarly, the genetic 
algorithm (CA) proposed in [SI represents a schedule by 
matching and scheduling strings. Both algorithms have 
implicit restrictions in that the input data transfers for each 
task are scheduled only when the task is being considered. 
For iterative applications, rotation operation was 
proposed in [9] for loop pipelining without consideration 
of IPC. In [IO], although IPC is included in the model, its 
scheduling has similar restriction as that of [7,8]. 
Moreover, both [9,10] assume synchronous control steps 
and so are unsuitable for asynchronous processors that are 
common in most distributed or shared memory systems. 
As discussed, optimal solution has been found only 
under restricted problem instances and ideal platforms 
such as contention free network. When IPC contention is 
considered, there are often unnecessary restrictions to the 
IPC scheduling. Therefore, one of our objectives is to 
develop a realistic and general model for computation and 
IPC scheduling. Based on this model, we developed a 
novel neighborhood search algorithm with pipelining to 
exploit inter-iteration parallelism. Experimental results 
show that significant improvement can be obtained over 
existing methods. Using the resulting schedules, a parallel 
video encoder was implemented, which achieved over 30 
framedsec at 352x240 resolution using 24 processors, 
which is about 2 times that of the CA tested and 37% 
better than a manually optimized video coding algorithm. 
This paper is organized as follows: Section 2 states the 
model for scheduling. Section 3 presents the method of 
neighborhood search. Section 4 gives experimental results 
and discussions. This paper is concluded in Section 5. 
2. Problem modelling 
In order to obtain true overall performance, the 
scheduling model should take into account IPC resource 
contention. For example, ignoring IPC contention, the task 
graph in Fig. l(a) has an optimal schedule in Fig. l(b). In 
the presence of link contention, the schedule is no longer 
optimal as shown in Fig. l(c). As illustrated, the resource 
contention and the flow of data should be emphasized, 
which can be represented by a data flow graph (DFG). 
In general, the DFG model consists of a number of 
non-preemptive computation tasks and a number of data 
objects connected according to G( V+VD ,ETDVEDT) with 
296 
0-7695-0589-2/00 $10.00 0 2000 IEEE 
definition of notations given in the APPENDIX. Each task 
takes some data objects as input and produces some data 
objects as output. Fig. 2 depicts an example DFG. 
A two processor system 
(b) .Schedule length& 
to an earlier time slot in PI, resulting in a longer schedule. 
Moreover, the scheduler should not impose unnecessary 
restriction to the IPC scheduling as in [7,8,10]. For 
instance, the data transfer (To->T~) in Fig. l(c) can be 
moved before (Tl->T2), giving a better schedule in Fig. 
3(c). We also consider overlapping of successive iterations 
to exploit inter-iteration parallelism. Fig. 3(b) shows an 
example of overlapped iterations in which significant 
improvement is obtained over Fig. l(c). 
0 5 10 15 
Computation 
Figure 1. Schedule with IPC contention 
For iterative DFG G, an iteration is the execution of all 
the tasks in G once. For tasks T, , Ti in VT and D in VD, if 
(T,D)EETD and (D,T,)EETD, then is dependent on the 
instance of T, in the d(D) past iteration. In order to 
maintain precedence constraint, all the cycles in the DFG 
should include at least one data object, such as D3 and D4 
in the cycles in Fig. 2, with positive dependence distance. 
In the parallel platform model adopted, each data 
transfer is scheduled to channel resources by dedicating 
them throughout the duration of transfer [7,8]. For each 
ordered pair of processors, there is a channel that contains 
the resources involved. For each computation task, the 
execution time is assumed to be known a priori and it can 
be different on different processors. The data transmission 
time may be modelled with a channel setup time plus the 
product of data size and an effective bandwidth. 
Data transfer from T, t o  T, Idling 
d(D,)=d(D,)= 1 
4 Edge in E ,  - Edge in E, 
0 Data object 
0 Computation task 
&7 (System output to pI) 
Figure 2. An example cyclic DFG 
3. Proposed method 
Schedules generated by heuristic and non-deterministic 
approaches are often sub-optimal. There is obviously 
opportunity for improving them with neighborhood search 
in which a solution undergoes modification to obtain 
neighbor solution which is adopted if it is better. The 
optimization criteria should be the overall schedule length, 
rather than the task start time [I 13. For example, Fig. 3(a) 
shows the modified schedule of Fig. l(c) with T4 migrated 
(b)Owrlapping of successiw iterations, To and TI are moxd to the previous iteration 
I - - - L , ; - t , *  i t P , l h "  $h i t c r a t i n n d  - - 
Pol '4 I 's I '0.1 . ' I  ' 4  I 's I '0.1 .'I I 
Belong to h e $  iteration Belong to the U+:)* $eralion 
(c)Data transfer fT-->T.) mowd to an earlier nosition __ 
Computation 
Data transfer from T, to T, 
ldhg 
PI 
&, 
Pn 
Figure 3. Modified schedules of Fig. l(c) 
At this juncture, we introduce a set of communication 
tasks, which is formed from the input DFG according to 
Fig. 4. For case 1, data object D may be used by a number 
of computation tasks. Each outlet edge (D,T,) is associated 
with a communication task T, for transfering the instance 
of D in the d(D) past iteration. In case 2, if D is a system 
output, its inlet edge is associated with a communication 
task. The Gc of the example DFG is shown in Fig. 5. We 
collectively call any computation or communication task a 
task, and V w V ,  is then the set of all tasks. 
-i+ Edge with dependence distance i 
Edge with zero dependence distance T2 Computation task 
Figure 5. Gc of the example DFG 
3.1. Schedule characterization and evaluation 
For iterative program, all the loops execute according 
to the same static schedule (RI,Mup,DS,Seq). Table 1 
297 
shows an example schedule for the DFG of Fig. 2. The 
modelled schedule performance is evaluated with several 
intermediate graphs, as depicted in Fig. 6. First, the DFG 
G is transformed into Gc. Second, the precedence relations 
between the tasks are determined with respect to their 
relative iteration indices (RI). Then, GpI is derived 
according to the platform resource constraints. 
1 0  0 -1 1 0  0 0 0 1 0  0 0 -1 
DFG G 
Platform 
4 
Schedule Length 
Figure 6. Evaluation of modelled performance 
- Precedence edge 
0 Communication task 
Computation task 
Ti,j Task T, of iteration j 
Figure 7. GP derived from the example Gc and RI 
3.1.1. Relative iteration index (RI). The tasks in the 
schedule may belong to different iterations. RI(T) is the 
relative iteration index of task T. As indicated by RI in the 
example schedule, 3 successive iterations are overlapped. 
3.1.2. Precedence graph (Gp). The precedence relation of 
the tasks in the schedule, as represented by GdV+Vm, 
Ep), is derived from Gc and RI. For (T,,T,)E Ec, (T,,TJ)E p 
if (T,,T,)E Ec and RI(T,)=RI(T,)-d(T,,T,). Obviously, Ep 
contains a subset of the edges of Ec. Fig. 7 depicts the Gp 
obtained from the example schedule and the Gc . 
3.1.3. Map, Seq and DS. Given a schedule, the execution 
time line is formed by traversing and scheduling the tasks 
in the order of Seq, which is a topological ordered 
sequence that satisfies the precedence relation of Gp. Each 
resource has a task list to guide its execution. During the 
scheduling, computation task Tis appended to the task list 
of processor Mup(T). For communication task T, it is 
appended to the channel resources between the source 
processor, Mup(Producer(T)), and the destination 
processor, Mup(Consumer(T)). If an alternative data 
source (Datu Forwurder [SI) is specified by DS(T), the 
source is the destination of DS(T). If the data object is 
already present in the destination processor, T is not 
scheduled and ET(Z') is set to zero. Fg. 8 shows the time 
line of the example schedule. The resources involved in 
each scheduling step are tabulated in Table 2. 
3.1.4. Augmented precedence graph (G; 1 The schedule 
length is the critical path of the augmented precedence 
graph G; (VwVm,Ep' ). For tasks Ti and Ti , (Z,T)e E; if 
(Ti,q)e Ep, or T,=DS(T,), or T is scheduled just before Ti 
in some resource. The schedule length can be obtained 
using (1). Using (2), the &vel can be obtained by 
traversing the tasks in the order of Seq since Seq satisfies 
the precedence of GpI. Similarly, blevel can be found 
uisng (3) by traversing the tasks in reverse order of Seq. 
 ch. length = max { tlevel(T) + blevel(T)p E V, U V, } , ( 1 ) 
tlevel(T) = max{O,tlevel(T,) + ET(T,)~T, ,T)E E,,'), (2) 
blevel(T) =max{O,blevel(T,~T,T)€ E,,'}+ ET(T) (3) 
Assume that the number of input and output data 
objects for each computation task and the number of 
resources in each channel are bounded by constants, it 
takes a constant time for finding tlevel and blevel for each 
task. As there are at most e+v tasks, it takes O(e+v) time to 
find the schedule length. 
Task T, 
of iteration J 
f I d h g  Time -+ 
Figure 8. Execution time line 
3.2. Neighborhood search 
Below are the three phases of search employed. 
3.2.1. Phase NSP-MAP. In this phase, neighbor solutions 
are obtained by changing the processor mapping for some 
computation task, while keeping the processor mapping of 
the other tasks fixed. The algorithm cycles through all the 
computation tasks for evaluation upon different processor 
mappings. It terminates if no improvement is found for all 
the tasks. In the best and worst cases, it requires p and p v  
evaluations per improvement respectively. Thus, the time 
complexity for finding an improvement is O[pv(e+v)]. 
3.2.2. Phase NSP-SEQ. This phase searches, for each task 
T, a new position in Seq that gives the shortest schedule 
298 
length. Due to precedence constraints, the search starts by 
shifting T backward from its original position until 
reaching a predecessor task. If this shifting reaches the 
head of Seq, T is wrapped around to the end of Seq with 
RI(7J increased by 1 and Gp updated. Then the shifting 
continues from the end of Seq. In this way, Tis  effectively 
shifted to the previous loop while the instance of T in the 
next loop is shifted in. After backward shifting, T is 
forward shifted from its original position until reaching a 
successor task. Similarly, when T reaches the end of Seq, 
it is wrapped around to the head with RI(T) decreased by 1 
and Cp updated. Wrap around in both directions is not 
performed if the maximum latency exceeds L. After the 
shifting, the Tis moved to the best position. 
For each task, there are at most e+v possible positions 
in Seq and L-1 times of wrap around in both directions. 
The algorithm terminates when no improvement is found 
after inspection for all the e+v tasks. As each evaluation 
takes O(e+v) time, an improvement takes O[L(e+vP] time. 
This time complexity can be reduced. The idea is to 
first remove T, then use the levels functions of the 
remaining tasks with the absence of T to find the levels for 
T in each insertion position in a constant time. After 
removing T, if SLT < SLX(l-&), the algorithm proceeds by 
shifting T. At each new position of T, SLNEW is given by 
SL,, = max{rfevef (T)+bfevef (T) ,SL,} .  (4) 
In (4), rlevel(T) and bfevel(7J are obtained by (2) and 
(3). During the shifting, pointers are used to keep the 
positions of Tin the resources where it is scheduled. When 
T is exchanged with its left (right) neighbor T’, those 
pointers corresponding to common resources with T’ are 
decreased (increased) by 1. Then the immediate 
predecessors and successors of T in GpI can be identified 
and the levels for T can be obtained in a constant time. For 
each task, evaluation of SLT, updating of Gp during wrap 
around and inspection of all positions take O[L(v+e)] time. 
So, an improvement now takes o [ L ( ~ + e ) ~ ]  time.
3.2.3. Phase NSP-SEQ-DF. NSP-SEQ may get stuck at a 
local optimum. We use the degree of freedom as a 
heuristic function to guide the search, which is defined as 
DF(T)  = SL- rlevel(T) - blevef(T) ( 5 )  
The search method is similar to NSP-SE&, but without 
the checking that SLT < SL x (I-&). During the search, SL 
may be reduced or unchanged at the new position for T. In 
order to limit the search time, the algorithm terminates if 
there is no improvement to SL after cycling all the tasks 
for ~ D F  (fixed at 2 in all the tests) times. Similar to NSP- 
SE&, it takes O[L(v+e)’] time to find an improvement. 
3.2.4. The overall NSP algorithm. In the overall NSP 
algorithm, the above 3 phases are cycled repeatedly until 
all of them fail to reduce SL. At worst, it takes the sum of 
worst case times of the 3 phases to find an improvement. 
The overall time complexity is O{  nlx [pv(e+v)+~(e+v>~]}.  
As n, I (SL, /SL* - I ) / & ,  this time complexity becomes 
2 
3 
4 
5 
O{ (SL,/SL*- 1 )D< [pv(e+v)+L(e+~)~]/&}. Fine improvement 
is ignored with a large E. With a small E, the search is 
likely to give better result using a longer search time. In 
all the tests done, the value of E is IO-’, which can be 
considered as typical value. 
300 10-55 5 10 0.2 
300 5 1-19 10 0.2 
300 5 5 2-20 0.2 
300 5 5 10 0.025-12.8 
4. Experimental results and discussions 
4.1. Comparison by random DFG 
Comparisons were made with DLS [7] and the GA of 
[8] since they have a similar model of IPC as our 
approach. For acyclic DFGs, five tests were done with 
variation in the parameters as shown in Table 3. In each 
random DFG, v data objects are added to v computation 
tasks with the producer and consumer tasks selected 
randomly. Each computation task has an expected unit 
execution time selected from the range 0.001 to 1.999 with 
uniform distribution. Each data object has a size also from 
this range but post-scaled by the desired CCR. In each test, 
50 DFGs were used, i.e. a total of 250 DFGs in the 5 tests. 
Table 3. Test parameter setting 
Test I V c p  NSHARE P CCR 
1 i 100-1000 5 5 10 0.2 
299 
4.1.5. Test 5. In Fig. 13, the improvement of NSP over CA 
and DLS starts to increase, reaches about 32% and 45% 
respectively at CCR=0.4. In fact, NSP gives a better IPC 
scheduling as reflected from this substantial improvement. 
4.1.6. Execution time:On a Pentium@ I1 350MHz system, 
the execution time of the algorithms was measured. For 
the case of CP=55, NSP starts from an initial solution 
from DU, which takes 0.16 second for scheduling. As 
depicted in Fig. 14, CA attains a steady value after about 
10 minutes while the NSP phases show stepwise drops and 
stop at about 3.5 minutes. This shows that NSP gives a 
substantially better schedule in a comparably short time. 
12 
10 
4 
8 8  
!?4 
$ 6  
2 
0 
100 300 500 700 900 
No. of computation tasks (v) 
Figure 9. Test 1 
Figure 11. Test 3 
1 a 0.1 - 
1 ,025 0.1 0.4 1.6 6.4 CCR 
10 20 30 40 50 
CP 
Figure 10. Test 2 
0 " " ' "  ' ' I  
2 6 10 14 18 
No. of processors (p) 
Fiaure 12. Test 4 
70 
C 65 
60 
- = 
.c 
- 
J 55 
U 
5 50 
cn 
0 5 10 15 20 25 30 
I Elapsed scheduling time (mln) 
Figure 13. Test 5 Figure 14. Running time 
4.2. Application to video encoding 
An H.261 [12] video encoder was implemented and 
comparisons were made with CA and the Multiple Master 
Multiple Slave (MMMS) solution [13]. DLS was not 
compared since it cannot be applied to cyclic DFG. The 
encoding algorithm is represented by the cyclic iterative 
DFG of Fig. 15 where macroblock (MB) is the basic unit 
of data decomposition. The platform used is the IBM-SP2 
in the University of Hong Kong, which is composed of 48 
160MHz IBM P2SC RISC processors connected by the 
High Performance Switch (HPS) with point-to-point 
bandwidth of 105MBytes/s and latency of 27.5psec. 
Blocking send and receive operations of the Message 
Passing Library were used. All the task execution times 
were measured using gettimeofday. For simplicity, we 
assume that the HPS is a completely connected switch 
such that each IPC channel is composed of the source and 
destination processors only. The video tested consists of 
50 frames of 352x240-pixel resolution. It shows a table 
tennis game that involves a zooming view. Each test was 
repeated 8 times and the average frame rates were taken. 
The decoded 
MB generated 
and its 8 
neighbonng 
decoded MBs 
(Dependence 
distance = I )  
Data object blt-stream 
Computation (To P, )  
Figure 15. DFG for video coding 
4 8 12 16 20 24 1 
No. of processors 1 
Figure 16. Frame rate 
4.2.1. Results & discussions 
I I  
4 8 12 16 20 24 1 
NO ofprocessors j 
Figure 17. Speedup- 
As depicted in Fig. 16, NSP 
gives about 31 framedsec at p=24, which is about 2 times 
that of CA and 37% better than MMMS. From Fig. 17, the 
speedup of GA tends to level at about 6 for p over 20 
while both MMMS and NSP show an increasing trend and 
reach about 9 and 12 respectively at p=24. NSP shows a 
curve closer to linear than MMMS. In fact, MMMS is the 
product of manual optimization by experience while NSP 
is an automatic scheduler for arbitrary DFG and platform. 
Owing to variation in message transfer time, deviation 
from the predicted performance is observed. Firstly, the 
HPS has message buffers so that the sender can complete 
earlier. Secondly, network congestion may cause the 
transfer time to be longer. In our case, the first effect 
dominates and the message transfer time is generally 
shorter than the predicted. Furthermore, there is variation 
in the MB encoding time depending on the video content. 
Since the schedules were generated based on a frame with 
above average encoding time, the result is better than the 
predicted. Moreover, the latency from input frame to 
output bit-stream is only 3 frames encoding time, which 
suits on-line applications such as video conferencing. 
5. Conclusions 
In this paper, a scheduling algorithm for heterogeneous 
multiprocessor systems is presented. First, a flexible 
300 
representation scheme is used so that communication 
scheduling can be done in a generic way. Second, loop 
pipelining is used to exploit parallelism between 
iterations. Third, an efficient technique is incorporated 
into the search that reduced the time complexity by an 
order of magnitude. Fourth, experimental comparisons 
were made with DLS and a CA algorithm using different 
suites of random DFGs with variations in different 
parameters including the effect of data sharing. Finally, 
the method is verified by actual implementation of a video 
encoder in which over 30 framedsec is obtained using 24 
processors. 
& 
nr 
APPENDIX. Definition of symbols 
DFG G( V+VD,E~-DUEDT) : 
VT,VD 
ETD,EDT 
]The sets of computation tasks and data objects. 
/The sets of directed edges from V, to VD and vice 
~~ 
Percentage decrease in SL that is counted as an 
improvement step. 
Total number of imurovement stem. 
d(D)  
v, e 
Communication task graph Gc( V+Vcr,Ec) : 
VC- 
[Dependency distance of data object D.  
lThe number of computation tasks and edges in G .  
]The set of communication tasks. 
CCR 
CP 
IMean ratio of communication to computation time. 
ICritical path length of G ignoring IPC. 
I 
Platform model : 
SR, Sp [The sets of resources and processors, S p c s ~ .  
P lThe number of processors. 
Solution characterization : 
M(7) lThe relative iteration index of T. 
SL” 
SL*, SLi 
DF( 7) 
k m  
Max. latency from input to output in terms of 
inserted in a new position, respectively. 
The optimal and the initial schedule lengths. 
Degree of freedom of T. 
No. of times that all the tasks are cycled since the 
Precedence graph Gp(V+.A’,-,Ep) : 
EP 
Augmented precedence graph G i  (V+VcT, E; ) : 
[The set of directed edges in Gp. 
I E,’ lThe set of directed edees in G,’ . I 
SL ]The current schedule length. 
SLT , !The schedule length when T is removed and re- 
-. 
(last improvement before the search stops. 
INsHaRE ]Max. no. of consumer tasks sharing each data object. 1 
ACKNOWLEDGMENT. The authors would like to express 
their gratitude to the Computer Center at the University of 
Hong Kong for their support of the IBM SP2 system. 
REFERENCES 
M. R. Garey, D. S. Johnson, Computers and Intractability, 
A Guide to the Theory of NP-Completeness, W. H. Freeman 
and Co., 1979. 
T. Yang, A. Gerasoulis, “DSC: Scheduling Parallel Tasks 
on an Unbounded Number of Processors”, IEEE Trans. 
Parallel Distrib. Syst. 5 ,  No. 9 (Nov. 1994), 951-967. 
S. Darbha, D. P. Agrawal, “Optimal Scheduling Algorithm 
for Distributed-Memory Machines”, IEEE Trans. Parallel 
Distrib. Syst. 9, No. 1 (Jan. 1998), 87-95. 
H. El-Rewini, “Scheduling Parallel Program Tasks onto 
Arbitrary Target Machines”, J. Parallel Distrib. Comput. 9, 
No. 2 (June 1990), 138-153. 
S.  Sriram, E. A. Lee, “Statically Scheduling 
Communication Resources in Multiprocessor DSP 
Architectures”, Conf. Rec. of 281h Asilomar Conf. on 
Signals, Systems and Computers 2, 1994, 1046-105 1. 
S. Sriram, E. A. Lee, “Design and Implementation of an 
Ordered Memory Access Architecture”, Roc. of the 
International Conference on Acoustics Speech and Signal 
Processing, Apr. 1993, pp. 1345-1348. 
G C. Sih, E. A. Lee, “A Compile-Time Scheduling 
Heuristic for Interconnection-Constrained Heterogeneous 
Processor Architectures”, E E E  Trans. Parallel Distrib. 
Syst. 4, No. 2 (Feb. 1993), 175-187. 
L. Wang, H. J. Siegel, V. P. Roychowdhury, A. A. 
Maciejewski, “Task Matching and Scheduling in 
Heterogeneous Computing Environments Using a Genetic- 
Algorithm-Based Approach”, Journal of Parallel & 
Distributed Computing 47, No. 1 (Nov. 1997), 8-22. 
L. E Chao, A. S. LaPaugh, E. H. M. Sha, “Rotation 
Scheduling: A Loop Pipelining Algorithm”, IEEE Trans. on 
Computer-Aided Design of Integrated Circuits and Systems 
16, No. 3 (Mar 1997), 229-239. 
S. Tongsima, E. H. M. Sha, N. L. Passos, “Communication- 
Sensitive Loop Scheduling for DSP Applications”, E E E  
Trans. on Signal Processing 45, No. 5 (May 1997), 1309- 
1322. 
Y. K. Kwok, I. Ahmad, “Bubble Scheduling: A Quasi 
Dynamic Algorithm for Static Allocation of Tasks to 
Parallel Architectures”, Proc of 7’ Symp. on Parallel & 
Dist. Proc., Oct. 1995, pp. 36-43. 
“ITU-T recommendation H.261: video codec for 
audiovisual services at px64 kbits”, ITU, 1990. 
N. H. C. Yung, K. K. Leung, “Parallelization of the H.261 
video coding algorithm on the IBM SP2 multiprocessor 
system”, Proc. of 3rd ICA3PP-97, Dec. 1997,571-578. 
30 1 
