Directed taskgraph scheduling using simulated annealing by D'Hollander, Erik & Devis, Yves
1991 International Conference on Parallel Processing 
Directed Taskgraph Scheduling 
U sing Simulated Annealing 
Erik H. D'Hollander and Yves Dem's 
Department of Electrical Engineering 
State University of Ghent 
B-9000 Ghent, Belgium 
Abstract 
Simulated annealing is recognized as a novel method to 
optimize the load in multicomputer systems, subject to 
the interprocessor communication overhead. Recently, 
highly nonlinear mapping and load balancing of undi-
rected taskgraphs has been solved in a successful way. In 
this paper the scope is extended to directed taskgraphs, 
representing the data and control dependencies in com-
mon programs. The annealing algorithm operates in stages. 
In each stage an annealing packet of ready tasks is formed 
and the tasks are allocated to the idle processors. The cost 
function is based on the priority level of the tasks in the 
taskgraph and the intertask communication requirements. 
The resulting schedule of four programs on three architec-
tures show a significant speedup improvement compared 
to the Highest Level First list algorithm. 
1 Introduction 
For the execution on a multiprocessor a program is par-
titioned into tasks and these tasks are allocated to the 
available processors. The scheduling process must pur-
sue two conflicting objectives: to maximize the processor 
utilization and to minimize the inter processor communi-
cation. This problem is known to be NP-complete and a 
solution is approximated by suboptimal heuristics such as 
the well known Highest Level First (HLF) list algorithm 
17,1,9]. These algorithms have proven adequate when the 
communication overhead is moderate, e.g. for strongly 
coupled shared memory multiprocessor systems. 
Since the revival of neural networks several assignment 
problems have been addressed by simulated annealing. In 
these problems an undirected taskgraph is mapped onto a 
machine graph. An undirected taskgraph represent a set 
of communicating tasks without precedence constraints. 
The cost function aims to balance the load subject to a 
minimal communication overhead. 
In this paper a directed graph is scheduled by simulated 
annealing. The algorithm performs a load balance and 
minimizes the communication, subject to the precedence 
constraints between the tasks. 
lThis research was supported by the Belgian Ministery of Science, 
under the contract OOA~87/93-117. 
In particular the algorithm takes into account the chang-
ing communication patterns during the execution of task-
graph. The performance of simulated annealing schedul-
ing has been measured by simulating the execution of four 
progra111B on three different multicomputer topologies. In 
all cases simulated annealing outperformed the best list-
algorithm. 
2 Definitions and Notations 
Host Configuration 
Consider a distributed processing system He = {P,L} 
consisting of a set of processors P and an interconnec-
tion network L. The Np processors are represented by 
P = {Pi,i = 1, ... ,Np}. The network topology is de-
scribed by the processor interconnection matrix L, where 
tij = 1 indicates the presence of a point-to-point link be-
tween two processors Pi and Pj' This includes a bus (star), 
a hypercube or a ring network. The distance d(i,j) be-
tween two processors equals the number of links on the 
shortest path joining the processors Pi and Pj. The links 
are bidirectional (L is symmetrical), have a bandwidth 
BW (Mbits per second) and can carry only one message 
at a time. It is assumed that incoming messages preempt 
an active processor. 
Taskgraph 
The program is partitioned into a directed taskgraph TG = 
{T, R, W, <*}. This quadruple consists of the set of tasks 
T = ti, i = 1, ... , NT, the load requirements R = {ri}, the 
communication weights W = {Wij} and the precedence 
constraints <*. The nodes tj represent tasks and have 
an estimated CPU-load rio The edges are labeled with 
weights Wi;, indicating the communication time between 
task ti and task tj. tj <* tj indicates that tj must start 
after the termination of ti' tj is a predecessor of tj and tj 
is a successor of t i . 
Simulated Annealing 
With the arrival of neural networks, statistical methods 
have gained success in the area of highly complex and 
combinatorial optimization problems with many interact-
ingvariables [10]. Most of these problems are NP-complete, 
and require ingenious heuristic approaches. Yet often the 
heuristics are trapped in local minima of the multidimen-
sional cost surface. Simulated annealing is able to over-
come this barrier by statistical hill climbing. Instead of 
II·l80 
1991 International Conference on Parallel Processing 
following a steepest descent trajectory, the path is per-
turbed by random walks with a decreasing probability. 
The minimization process is controlled by a cooling tem-
perature which makes the trajectory evolve from a purely 
random walk towards a deterministic path. The idea is 
to find the global minimum by escaping the local cavities 
during the cooling process. 
The simulated annealing technique is governed by the fol-
lowing components: the mapping junction, the cost func-
tion, the mapping scheme, and the cooling function. 
The mapping function m : T ---j. P assigns the tasks to 
the processors, such that Pk = m(tj) if task tj is allocated 
onto processor Pic. The cost function F(m) measures the 
quality of the mapping with respect to the load ba lance 
and the communica.tion overhead. 
The mapping scheme randomly redistributes the a.lloca. 
tion of tasks to processors, thereby producing a. new map-
ping function m'. The simulated annealing process will 
accept the new mapping m' depending on the cost F( m') 
and the temperature Temp, with a probability 
1 
B(F, T emp) = _--'--- . 
1 + e T .... p 
(1) 
where 0 ~ Temp ~ 00. For extreme values Temp = 0 
and Temp = 00, the mapping m' is accepted with the 
following probabilities: 
B(F, (0) = .5 
B(F, 0) = { 0
1 if F < 0 (accept move) 
if F ~ 0 (reject move) 
(2) 
The cooling function generates a sequence of temperatures 
Tempi, varying from 00 (an arbitrary acceptance) to 0 (a 
deterministic acceptance). The cooling policy influences 
the convergence speed and the quality of the obtained 
solution. 
3 Related Work 
The ass ignment problems solved using simulated anneal-
ing differ by the assumptions on the host architecture, 
the taskgraph and the cost function. Depending on the 
number of tasks and processors, the following assignment 
schemes were investigated: 
• The mapping problem 131: NT ~ Np , <* = 0. 
• The balancing problem 181: NT > Np , <* = 0. 
• The scheduling problem (this paper) : NT > Np , 
<* f. 0. 
In the mapping problem 131, Bollinger and Midkiff map 
an undirected taskgraph on the host architecture. There 
11-181 
is at most one task per processor and the objective is to 
minimize simultaneous ly the total communication and the 
maximal point-to-point communication on a single link. 
The authors take into account the communication weight 
between the tasks, Wand allow arbitrary routing. The 
simulating annealing approach allowed to adopt a more 
realistic communication model than the one used in other 
approaches [2,11], except for the condition NT < N p . 
Hwang and Lee removed the restriction on the number of 
tasks in the balancing problem [8]. With more tasks than 
processors there are two objectives: to balance the load 
and to minimize the interprocessor communication. The 
cost function therefore has a balance term and a commu-
nication term. The bahmce term sums the absolute devi· 
atian from the average processor load and the communi-
cation term sums the traffic on the interprocessor links. 
In the balancing problem it is assumed that all modules 
execute concurrently and communicate during the whole 
execut ion of t he program. While this is true when the 
modules are independent (e.g. production systems), in 
many partitioned programs data and control dependen-
cies create precedence constraints. In this case one has a 
scheduling problem. A load balancing scheme which takes 
into account the precedence rules to solve the scheduling 
problem is presented in this paper. 
4 The Scheduling Problem 
4.1 Annealing Packets 
In programs characterized by a directed taskgraph, the 
communication and the load patterns vary largely dur-
ing the execution time, invaHdating the assumptions of 
the bala.ncing problem. We solve the scheduling prob-
lem by creating annealing packets at discrete assignment 
epochs. The first epoch is at time zero and successive 
epochs occur when one or more processors become idle. 
An annealing packet contains the ready tasks and the id le 
processors. The ready tasks have no unfinished predeces-
sors. At each epoch a simulated annealing process maps 
the tasks of one packet onto the processors. Unassigned 
tasks are moved to the following annealing packet and 
new annealing packets are created until all t asks are as-
signed. The tasks compete for an ass ignment based on 
their priority and on the communication overhead with 
the other tasks. 
4.2 Cost Function 
T he cost function consists of a load balancing term and a 
communication term. 
a) Load Balancing Cost 
The critical path of a directed taskgraph consists of the 
longest chain joining the root task and a leaf task. In 
1991 International Conference on Parallel Processing 
order to minimize the execution time, the cost function 
must encourage the assignment of tasks on the critical 
path. Therefore tasks are given a priority measured by 
the tasklevel [4]. The level nj of a task ti equals the accu-
mulated execution time of every task on the longest path 
connecting ti with a leaf task. In other words, in a system 
with an arbitrary number of processors and no commu-
nication overhead, the tasklevel represents the minimal 
remaining execution time when the task is started. The 
annealing process should favor the selection of high-level 
tasks. This is realized using the following load balancing 
cost function 
N 
F, = - L: n;s(i) (3) 
i=1 
N is the number of task in the annealing packet, s(i) = 1 
when task ti is selected, else s(£) = o. Minimizing this 
function corresponds with assigning first the highest level 
tasks of the annealing packet. 
b) Interprocessor Communication Cost 
Two parameters characterize the cost of sending a mes-
sage between processors P~ and PII: CT, the time to for-
ward one message and 7, the time to receive or to route 
one message. These parameters account for the following 
events: the context switches (8) to save and restore the 
processor state, the output setup (0) to prepare the I/O 
hardware and the header control (H) to determine if an 
incoming message needs to be routed to other processors. 
With these parameters, one has 
u 
r 
28+0 
28+H +0 
For the bit-serial linked hypercube processor systems the 
parameters were set to 0 = 3fls, 8 = H = 2J.Lsj this gives 
CT = 7f.ls and 7 = 9f.ls. 
On a connection link of BW bits per second, the time to 
carry a message of length L over a path between proces-
sors £ and j equals 
L 
w··---IJ - BW 
The effective communication cost eij to send a message of 
weight Wij between tasks tj and tj located at processors 
m(t;) and m(t;) respectively is 
c;; = w;;d;; + (d;; -1 + om;m;)r + (1 - om;m;JU (4) 
where 6ii is the Kronecker delta. The communication cost 
has three parts. 
1. The distance-volume product measures the commu-
nication time on the links connecting the two pro-
cessors m(t;) and m(t;). 
11-182 
2. The intermediate processors contribute by routing 
the message. This term vanishes if tj and tj are 
located on neighboring processors. 
3. The third term represents the extra cost to setup 
a communication link. This term vanishes if both 
tasks reside on the same processor. 
The communication cost of the annealing packet is defined 
(5) 
c) Normalized cost function 
For different architecture graphs and taskgraphs, the load 
and communication terms can vary widely. A simple ad-
dition of the load balancing and the communication terms 
could outweigh one cost and discard the other. Therefore 
the load balancing and the communication costs are nor-
malized, each by their proper range. The range of the 
balancing term is 
"'F, = (Max - Min)jN;dl. 
where Max and Min represent the cumulative level val-
ues when the Nidl~ free processors would execute the tasks 
with the highest or the lowest levels respectively. The 
communication range is obtained by placing the tasks 
with the highest communication at the largest distance, 
giving an estimate of the maximum communication cost, 
fl.Fc· 
The cost function is a weighted sum of the normalized 
communication and load balancing terms, 
() F, F, F m = Wc fl.Fc + Wb fl.Fb (6) 
This function minimizes the communication and balances 
the load, while the weight factors WI! and Wc allow to em-
phasize one or the other element in the cost function. 
They are choosen such that Wb + Wc = 1 and can be 
tuned to optimize the allocation for the highest speed-up. 
5 Annealing Algorithm 
For notational convenience, we introduce the following 
abbreviation for the mapping function:m, = m(ti). 
Until all tasks ti E T are assigned, do: 
1. Assemble an annealing packet (AP) consisting of 
the free processors and the ready tasks (i.e. task 
without unfinished predecessors). 
2. for cooling temperatures Tempk, k = I, .. " NJ until 
convergence or until exceeding the maximum num-
ber of iterations, NIl do: 
1991 International Conference on Parallel processing 
(a) Arbitrarily select a task t, and a processor Pj) 
where Pi 'I mi· 
• If processor Pi is idle, assign tj to Pi (pos-
sibly by removing ti from another proces-
sor): 7'ni := Pi i 
• IT processor Pi is busy executing tj E AP) 
exchange ti and t;: m, := P;, mj := Pi-
(h) Accept the assignment with a probability given 
by the Boltzmann function B(F, T emp.) (equa-
tion 1). 
endfor 
3. Repeat from {I} if not aU tasks are assigned. 
6 Experimental Results 
Four programs were scheduled on three different multi-
computer architectures. In each case the execution was 
simulated to record the achieved speedup. The perfor-
mance analysis 151 covers both the simulated annealing 
process as the speedup improvement over scheduling by 
the HLF -list algorithm. 
The scheduled programs are: 
1. Newton·Euler Inverse Dynamics for robot control 
(NE) 
2. Gauss-Jordan linear system solver (GJ) 
3. Matrix multiply (MM) 
4. Fast Fourier Transform (FFT) 
The programs GJ, FFT and MM are partitioned into vee· 
tor operations and the NE program consists of scalar oper-
ations. The taskgraph characteristics are given in Table 
1. The communication time is calculated for a 10Mb/s 
link between two processors and 40 bit data per variable. 
Extra communication overhead occurs due to the cost to 
send, route and receive the messages (equation 4). 
The taskgraphs were mapped onto the following architec-
tures: 
1. A Hypercube with 8 processors 
2. A Bus (star) topology with 8 processors 
3. A Ring topology with 9 processors 
a) Annealing Process Figure 1 shows the trajectories of 
the level-, the communication- and the total cost, F6,Fe, 
Flot (equations 3, 5, 6) of one annealing packet in the 
Newton-Euler problem. It can be seen that the annealing 
process decreases both the balancing and the communi-
cation costs. The program contains 95 tasks, which are 
II-183 
assigned in 65 annealing packets. On the average there 
are 15 candidates for 1.46 free processors. The anneal-
ing stops when the cost function remains constant for five 
iterations, or when a preset maximum number is reached. 
b) Speedup To estimate the speedup improvement over 
an heuristic task placement by the Highest Level First 
(RLF) algorithm, a simulation program was developed 
which accurately records the execution and interprocessor 
communication. Figure 2 shows the start of the Newton-
Euler program partitioned on an 8 processor hypercube. 
Furthermore Table 2 gives the speedups for both the sim-
ulated annealing and the heuristic RLF algorithm. 
These results give rise to two observations. First, when 
the communication is not taken into account, simulated 
annealing gives the same or slightly better results than 
the HLF algorithm. This occurs despite the fact that an 
extensive statistical comparison of various list algorithms 
indicates that the HLF generated schedules remain within 
5% of the optimal solution in all but one of 900 random 
generated taskgraphs [1]. Moreover we observed that the 
SA algorithm is able to optimally solve the Graham list 
scheduling anomalies t6]. Second, the simulated annealing 
algorithm outperforms the RLF algorithm by 3.5 to 52 %. 
This reveals that this altorithm is a worthwile alternative 
to the arbitrary placement of the RLF-tasks, when the 
interprocessor communication is not neglectable. 
7 Conclusion 
In recent years, simulated annealing has been recognized 
as a novel method to balance the load in loosely coupled 
multicomputers. We extended the use of simulated an-
nealing to the scheduling of directed taskgraphs . This 
implies minimizing the communication and balancing the 
processor load, while preserving the data and control de~ 
pendent precedence constraints. The results indicate that 
the presented algorithm is able to improve the speedup in 
real program taskgraphs by more than 50%. 
References 
[1) Adam T.L., Chandy K.M., Dickinson J.R., A com-
pan'son 01 list schedules lor parallel processing sys· 
tems, Communications of the ACM 17, 12, 685-690, 
1974 
t2J Bianchini R.P., Shen J.P., interprocessor Traffic 
Schedulr'ng Algorithm lor Multiple.Processor Net· 
works, IEEE Trans. on Computers 36, 4, 396-409, 
1987, Vol. 36, 4, pp. 396-409, 1987 
[3) Bollinger S. Wayne, Midkiff Scott F., Processor and 
Link ASSignment in Multicomputers using Simulated 
1991 International Conference on Parallel Processing 
Annealing, Proceedings of the IntI. Conf. on Parallel 
Processing '88, I - Architecture, pp. 1-6, 1988 
141 Coffman E.G. Jr. (Ed.), Computer and Jo b-Shop 
Scheduling Theory, J. Wiley and Sons, New York, 
1976 
[5] Devis Yves, Process allocation in a distributed com-
puter system using a neural model, MS. Thesis, State 
Univ. of Ghent, Report LEM-T9021, 1990 
[6) Graham R.L., Bounds on Certain Multiprocessing 
Anomalies, SIAM Journal on Applied Mathematics, 
Vol. 17, 2, pp. 416-429, 1969 
17\ Hu T.e., Parallel sequencing and assembly line prob-
lems, Op. Res. , 9, 6, 841-848, 1961 
{SI Hwang K., Xu Jian, Mapping Partitioned program 
Modules onto Multicomputer Nodes Using Simulated 
Annealing, Proceedings of the IntI. Conf. on Parallel 
Processing '90, II - Software, August 13- 17, pp. 292-
293, 1990 
[91 Kaufman M.T., An almost optimal algorithm for the 
assembly line problem, IEEE Trans. on Computers-
23, 11, 1169-1174, 1974 
1101 Kirkpatrick S., Gelatt C.D., Vecchi M.P., Optimiza-
tion by Simulated Annealing, Science, Vol. 220, Num-
ber 4598, May, pp. 671-680, 1983 
1111 Lee S-Y., Aggarwal J.K., A Mapping Strategy for 
Parallel Processing, IEEE Trans. on Computers, Vol. 
36, 4, pp. 433442, 1987 
Table 1: Principal program characteris tics. The C/C ratio represents the 
communication vs. computation ratio . Times are in J,LS 
Program Tasks Average Average CIC Max. 
Duration Commun. Ratio Speedup 
Newton-Euler 95 9.12 3.96 43.0 % 7.86 
Gauss-Jordan I II 84.77 6.85 8.1 % 9.14 
FFT 73 72.74 6.41 8.8 % 40.85 
Matrix Multiply 111 73.96 7.21 9.7 % 82.10 
Table 2: Speedup figures for the benchmark programs. (S,)SA and (S,)HU 
denote the speedup obtained with Simulated Anealing and with the HLF 
heuristic respectively. 
Newton-Euler w/ o Comm. with Comm. 
(S.)SA (S. )HLF % gain (S, )SA (S')HLF % gain 
Hypercube (Sp) 7.20 6.90 4.4 5.6 4.9 14.3 
Bus (Sp) 7.20 6.90 4.4 6 .2 5.2 11.5 
Ring (9p) 8.00 8.00 0.0 5.5 3.6 52.8 
Gauss-Jordan wlo Comm. with Comm. 
(S')SA (S')HLF % gain (S,)SA (S,)HLF % gain 
Hypercube (Sp) 6.67 6.67 0.0 4.80 4.64 3.5 
Bus (Sp) 6.76 6.67 1.4 4.93 4.74 3.9 
Ring (9p) 8.25 8.25 0.0 5.02 4.77 5.0 
Matrix Multiply wlo Comm. with Comm. 
(S,lSA (S,)ULF % gain (Sp)SA (Sp)HLF % gain 
Hypercube (Sp) 7.75 7.75 0.0 6 .11 5.19 17 .7 
Bus (Sp) 7.75 7.75 0.0 6.34 5.71 11 .0 
Ring (9p) 8.38 8.38 0.0 6.04 4 .96 21.8 
FFT w/ o Comm. with Cornm. 
(S, )SA (S,)lILF % gain (Sp)SA (S,lHLF % gain 
Hypercube (8p) 7.38 7.38 0.0 6.23 4.93 26.3 
Bus (Sp) 7.48 7.38 1.4 6.27 5.58 12.3 
Ring (9p) 8.43 8.43 0.0 5.97 5.\0 17.0 
1I-184 
- ----_ . . _-_.-
1991 International Conference on Parallel Processing 
Cosl 
150 ,-----
100 
50 
-50 
-100 
-150 
o 50 100 
Iterations 
- Comm. Cost - - Level Cost - Tal. Cost 
Figure 1: Cost trajectories Fb (level), Fc (communicatioIl) :and Ftot 
(weighted sum) of a Newton-Euler annealing packet for an 8 node hyper-
cube. The weights are W6 = We = .5 
P a IamImzIF!mlmft!Iiffi:~ 
P 1 
....,......F 
P 6 _ _ __ --,IStd'~""Il5"'S""'t""i; L.-,~,.f[t;e"";; ;": ~"'i$""S'''''@\a''l~,.f;"'''~~'' ''if"''i@''''t$;!'''':$i'''}''''!!If""@""mt",'N/%I=~! 
P 7 -----.!lI.--------,&l..-----..IS:f1lJ",...f'''';;;''''%l'''·~'''':l''''tM'''jl,,''"Ul,p~'''; z'''p",Jjli 'lSd'fc""nw"":;g",·lillitt",,,Wj 
O.OO--OO:.O,""-'OCC.O •• --OO:.O;;"OO.:;;"'-'OiC.,;SSC-OO.:;;,.'-'O[;.2",~OO.2",-,O"~"?~oo.,",,, m. 
Figure 2: Gantt-chart of the Newton-Euler program on an 8 processor 
Hypercube (detail). Numbered blocks represent tasks, half-height blocks 
above and below the base line denote sending and receiving messages re-
spectively, quarter-height blocks represent routing messages. 
II-ISS 
