






















Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Critical Path Driven Cosynthesis for Heterogeneous Target Architectures
Bjørn-Jørgensen, Peter; Madsen, Jan
Published in:
International Workshop on Hardware/Software Codesign





Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Bjørn-Jørgensen, P., & Madsen, J. (1997). Critical Path Driven Cosynthesis for Heterogeneous Target
Architectures. In International Workshop on Hardware/Software Codesign (pp. 15-19). IEEE. DOI:
10.1109/HSC.1997.584573
Critical Path Driven Cosynthesis for Heterogeneous Target Architectures 
Peter Bj orn- Jorgensen Jan Madsen 
Department of Information Technology 
Technical University of Denmark 
Abstract 
This paper presents a critical path driven algorithm 
to produce a static schedule of a single-rate system onto 
a heterogeneous target architecture. Our algorithm is a 
list based scheduling algorithm which concurrently assigns 
tasks to processors and allocates nets to interprocessor com- 
munication. Experimental results show that our algorithm is 
able to find good results, as compared to other methods, in 
small amount of CPU time. 
1. Introduction 
Embedded systems are usually implemented using a mix- 
ture of technologies including off-the-shelf components, 
such as microprocessors, and dedicated hardware, such as 
full- or semi-custom ASICs. This results in a heterogeneous 
architecture, in which also the communication links between 
the components uses different technologies, i.e. point-to- 
point communication and busses with various bandwidths. 
In this paper we address the problem of cosynthesis of 
single-rate systems onto a heterogeneous target architecture. 
In particular, we solve the problem of mapping a system be- 
havior onto a given target architecture, i.e. performing a task 
scheduling. 
Our scheduling objective is to minimize the schedule 
length, taking into account interprocessor communication 
overhead due to data dependencies between tasks and the 
heterogeneity of the target architecture. 
2. Related Work 
Many multiprocessor scheduling approaches have taken 
interprocessor communication into account, e.g. [l,  2 ,3 ,6 ,  
71, but only very few have addressed the scheduling onto 
a heterogeneous target architecture, e.g. [5, 81. Most of 
these approaches are based on list scheduling and, hence, 
mainly differs in the way they select among the tasks which 
are ready to execute. The system behavior is typically rep- 
resented as a directed acyclic graph of tasks (or processes)' 
called a task graph. 
A popular algorithm is the Highest Levels First with Es- 
timated Times (HLFET) [l]. The static level (SL) of a task 
is defined as the largest sum of computation times along any 
path from the task to an end-task. Among the tasks ready to 
execute, the algorithm selects the one with the largest SL, 
i.e. the most critical, and schedules it on the first available 
processor. Communication is not taken into account during 
task and processor selection. 
Wu and Gajski [6] propose two scheduling algorithms. 
The Most Critical Path algorithm (MCP) and the Mobility 
Directed (MD) algorithm. The MCP algorithm first com- 
putes the latest start time (LST) for each task. Then each 
task is associated with a list of tasks containing itself fol- 
lowed by its successors in decreasing order of LST. The al- 
gorithm then constructs a list of tasks in an increasing lexi- 
cographical order of the task lists. While there are tasks in 
the list, the algorithm selects the top task and schedules it on 
the processor which results in the earliest start time (EST). 
The MD algorithm selects tasks based on the relative mo- 
bilities (RM) of 'the tasks. RM is defined as the difference 
between EST and LST of a task divided by the computation 
time. The algorithm schedules the task with the smallest RM 
on the processor with the earliest time slot large enough to 
hold the execution of the task. 
Hwang et. al. [2] propose an algorithm called Earliest 
Task First (ETF). Among the tasks ready to execute, the al- 
gorithm selects the task and processor which results in the 
earliest start time, EST. 
The Dominant Sequence Clustering (DSC) [7] algorithm 
has been proposed by Yang and Gerasoulis. It defines the 
dominant sequence (DS) length of a task as the longest path 
from any entry-task to any end-task containing the task it- 
self. The path length is calculated as the sum of computation 
and communication times along the path. The task with the 
highest DS is selected. If the task is not ready, the algorithm 
In this paper we will use the term task for the smallest computing entity, 
as opposed to process which is used by some authors. 
15 
0-8186-7895-WS7 $10.00 0 1997 IEEE 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 23,2010 at 10:25:39 EDT from IEEE Xplore.  Restrictions apply. 
selects among its predecessors the one with the highest DS. 
The EST determines where to schedule the selected task. 
Kwok and Ahmad [3] presents an algorithm called Dy- 
namic Critical-Path scheduling (DCP). The algorithm first 
schedules all tasks on different processors in order to find 
EST and LST. The task with the minimum LST-EST is then 
selected and scheduled on the processor that have the earli- 
est time slot large enough to execute the task. The algorithm 
continues until all tasks are scheduled. 
Sih and Lee [5] propose the Dynamic List Scheduling 
(DLS) algorithm which select tasks based on the largest Dy- 
namic Level (DL) which is defined as SL - EST, and sched- 
ules them on the processor which results in the earliest start 
time, EST. The algorithm is extended to handle heteroge- 
neous target architectures by associating to each task the 
average of executing the task on the different processors. 
Hence, SL is now the maximum sum of average execution 
times. Further, three values that account for processor varia- 
tion, descendant consideration and a cost are associated with 
the task not being executed on the preferred processor. 
One of the problems, with the execution model used by 
many of these algorithms, is the possibility to schedule two 
dependencies on a communication net in the same time slot, 
without increasing the total communication delay. 
Sensor CPU 
3. Problem Formulation 
ASIC ASIC 
Our aim is to map an embedded systems behavior onto a 
given target architecture which may be heterogeneous. 
3.1. Target Architecture 
A heterogeneous target architecture is represented by a 
graph, GA = (VA, EA)  in which each vertex describes an 
architecture component and the edges describe interconnec- 
tions among the architecture components. Each architecture 
component may be a processor, p ,  or a device, d. A proces- 
sor represents an active component, e.g. a CPU, an ASIP, or 
an ASIC. A device represents a passive component which 
may be controlled by a processor, e.g. an IO-device, a sen- 
sor, a display, or a memory. An edge, ni, represents a net 
connecting two or more architecture components, i.e. a bus. 
Each net ni is annotated with a bandwidth, bi, which repre- 
sents the amount of data which can be transferred over the 
net at any time. It is assumed that each processor has dedi- 
cated communication hardware so that computation can be 
overlapped with communication, i.e. a new task may be in- 
voked on the processor while the results of the previous task 
are transferred to another processor. It is further assumed 
that there exists at least one net between any two processors 
in the architecture. 
Figure 1. Target architecture with devices. 
however, use a somewhat simpler example to illustrate our 
scheduling algorithm, i.e. the target architecture shown in 
figure 2. Note, that this architecture has no devices. 
1 
Figure 2. Simple target architecture. 
0 
3.2. System Specification 
The behavior of an embedded system is described by a 
tusk graph, GT = (VT, ET) ,  which is a partially-ordered 
set of tasks represented as a directed acyclic graph. Hence, 
each vertex, ri E VT, in the task graph represents a task de- 
scribing a single thread of execution which cannot be pre- 
empted, i.e. an atomic operation. An edge, e i j  E ET, de- 
scribes a data dependency between the tasks ri and rj . Each 
edge is annotated with the amount of data which have to be 
transferred between the two tasks, denoted as d i , j .  Without 
loss of generality [SI, we assume that the data transferred on 
each edge emanating from a process are the same. 
We assume that an estimate of the execution time of task 
~i on processor pk ,  t ,  ( ~ i , p k ) ,  is available at compile-time. 
Further, a task may require one or more devices for its ex- 
ecution, these devices will be allocated the processor dur- 
ing the complete execution of the task. If a task ~j cannot 
be executed on a processor pz, for instance because the re- 
quired devices are not connected to the processor, the exe- 
cution time is set to infinite. 
The communication time between two processors is de- 
fined as, 
Definition1 Let ri and rj be two data dependent tasks 
scheduled on pk and pl respectively. The communication 
time is then given by, 
16 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 23,2010 at 10:25:39 EDT from IEEE Xplore.  Restrictions apply. 
where b, is the bandwidth of net nm which connectspk and 
Pl 
Example 2: Figure 3 shows a simple task graph and the pro- 
cessor execution times of each task when executed on the 










Figure 3. Task graph and processor execution times. 
Cl 
4. The HCP Scheduling Algorithm 
The scheduling problem when taking interprocessor 
communication into account contains two main aspects: as- 
signing tasks to processors, and allocating nets for interpro- 
cessor communication. Our scheduling strategy addresses 
both problems concurrently, and is based on the classical 
list scheduling algorithm, e.g. [4]. In the following we 
present the Heterogeneous Critical Path (HCP) scheduling 
algorithm, as outlined below, 
HCP(GT, G A )  % 
Vri E VT , Vpj E VA : calculate +(Ti, p j  ); 
update L; 
while L # 8 do { 
q = select-task ( L )  ;
p j  = select-processor ( L )  ;
schedule(Ti, p j ) ;  
update L; } 
1 
First a priority metric is calculated for each task-processor 
combination. A list L contains the tasks which are ready 
to execute, i.e. their predecessors have all been scheduled. 
A task is then selected based on the priority metric, and as- 
signed to a the best processor when taking communication 
into account. Finally, the task is scheduled on the processor 
and the possible communication is scheduled on the appro- 
priate net. In the following we will describe each step in the 
algorithm in more detail. 
4.1. 'Task Selection 
Tasks are selected from L according to their critical path 
length (CPL), i.e. the task with the largest CPL is selected. 
The CPL of a task ~i is the shortest path in the task graph 
from q to any end-task, that is, a task with no successors. 
The path length is calculated as the sum of execution times 
of the tasks on the path and communication times of inter- 
task dependencies under the assumption that all tasks are 
scheduled on their preferred processors and all data depen- 
dencies are scheduled on their preferred nets. The preferred 
processors and nets are those resulting in the shortest critical 
path length. 
Definition 2 The critical path length of a task ~i scheduled 
on a processor pk is dejined as: 
cpl(Ti,pk) = t e ( ' G , P k )  + 
where V T , ~ ~ ~ ~  is the set of direct successor tasks of ri. 
Example 3: Using the definition of CPL, we get the follow- 
ing values for the task graph of figure 3. 
The critical path length of the end-tasks ( ~ 4 , T 6  and 7 7 )  are 
the same as their execution times. As an example of calcu- 
lating CPL, consider the calculation of c p l (  7 3 ,  P I )  : 
cp1(7S,p1) = 1 + min(0 + 1 1 , l O  + 14,lO + 1) = 12 
0 
From the list of ready tasks L, the task with the longest 
minimum CPL is selected, i.e. this task is the most critical 
in order to obtain a short overall scheduling length. I.e. the 
selected task ~i is found as, 
Example 4: Consider our example from figure 3 and as- 
sume that 71 has been scheduled on processor p ~ .  The ready 
list contains two tasks, rZ and ~ 5 ,  and hence, 
max(min(l6,15,13), min(ll,9,9)) = 13 
which means that task 7-2 is selected. 
0 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 23,2010 at 10:25:39 EDT from IEEE Xplore.  Restrictions apply. 
4.2. Processor Selection 
Having selected a task ri, we have to select the processor 
p k  on which to schedule it. Here we have to consider two 
aspects: 
0 The earliest start time (EST) at which ri can start its ex- 
ecution on p k  when taking into account the communi- 
cation from predecessors of 7-i and their schedule on the 
available nets. This term we will denote es t ( r i ,  pk). 
0 The possible influence on the overall schedule length 
by scheduling 7i on pk. This is expressed as the critical 
path length, cpl(Ti,pL). 
Hence, the processor pk: on which to schedule 7-i is the one 
which satisfy the expression: 
min ( e s t ( ~ i , ~ k )  + cpl(Ti,Pk)) (1) 
P b  EVA 
The actual scheduling of a task on a processor is defined 
as follows: 
Definition 3 Let p k  be a processor on which to schedule a 
task ~ i ,  and let (TI, 7-2, . . . , T ~ )  denote the sequence of al- 
ready scheduled tasks on pk. The schedule s ( T ~ , P ~ )  is de- 
jined as the earliest possible time t at which the following 
equations are fuplled: 
t 2 tend(Tjj,Pk) 2 est(Ti,pk) 
t I t e n d ( ~ j + l , ~ k )  - te(Ti,Pk) 
The first equation ensures that all data is ready before ex- 
ecuting ri. The second equation states that there should be 
time for executing ~i before the next task, ~ j + l  in the sched- 
ule starts its execution. rfno such time slot can be found, ~i 
is scheduled after the last task, i.e. 7,. 
The scheduling, s (ei,j, n,), of data dependencies on 
nets is defined in the same way. 
Let t e n d ( r i , p k )  indicate the actual end-times of task 7i 
when scheduled on processorpk. Let V T , ~ ~ ~ ~ ( T ~ )  denote the 
set of predecessor tasks of Ti, and E T , ~ ~ ~ ~ ( T ~ )  the set of data 
dependencies ej,i, where 7 j  E V T , ~ ? ~ ~ ( T ~ ) .  Now, for all 
7-j E V T , ~ ~ ~ ~ ( T ~ ) ,  we select the one with the earliest end- 
time, i.e. 
and try to schedule the data dependency ej,i on all possible 
nets n,, that is, nets which connectpk andpl. We select the 
net resulting in the earliest completion time of the data trans- 
fer, let tend(ej,i,nm) denote this time. We continue this 
process until all data dependencies (q) have been 
scheduled. Then, 
From equation 1 we can now select the processor on 
which to schedule ~ i ,  and we perform the scheduling of ri 
and E ~ , ~ ~ ~ d ( r i )  accord ng to definition 3. 
If two tasks, 7-2 and 7 3 ,  dependent on the same task 7 1 ,  and 
7-3 is to be scheduled on the same processor, pz, as 7 2 ,  the al- 
gorithm takes into account that the data transfer has already 
been allocated, i.e. t,,,, ( ~ 1 , 7 3 ,  n,) = 0. However, the task 
cannot start execution before data is available. 
Example 5: ,Consider our previous example where task 72 
was selected. The processor on which to schedule r 2  is de- 
termined as, 
min(max(1+ 1) + 16, max(1 + 0) + 15, "(1 + 1) + 13) 
= 15 
which means processor p3. Note, that this is actually the 
processor on which 72 has its worst execution time. Figure 4 
shows the complete schedule of the task graph of figure 3. 
0 5 io is - 
Figure 4. Scheduled task graph. 
0 
5. Experimental Results 
The HCP algorithm has been implemented in C++ along 
with the DLS [5],  ETF [2], and ETF2 algorithms. The DLS 
algorithm is implemented for heterogeneous target architec- 
tures using the generalized dynamic level GDL1, which ac- 
counts for processor variation and resource scarcity, see [5]. 
The GDLl is exhaustively examined for all the ready tasks. 
The ETF algorithm is implemented using SL as the task pri- 
ority, while ETF2 uses CPL as the task priority. ETF2 se- 
lects the task-processor pair based on the earliest end-time of 
the task, i.e. it includes the task execution time when choos- 
ing a task. 
Table 1 shows the resulting schedule lengths when ap- 
plied to some benchmarks from the literature. A1 is the tar- 
get architecture of figure 2 while A2 is a two processor ar- 
chitecture. 
We have implemented a task graph generator, t askgen( 
s t ,  sd,  t e ,  d), somewhat like the ones described in [7] and 
[3]. st indicates the number of tasks, Sd the number of 
data dependencies, t, the maximum execution time, and d 
18 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 23,2010 at 10:25:39 EDT from IEEE Xplore.  Restrictions apply. 
example I target I HCP I DLS I ETF I ETF2 
fig3 I Ai I 15 I 20 I 19 I 20 
fig.1 in [5] 
fig.2in [53 
gausseli[3] 
A1 30 25 26 28 
A2 32 29 28 29 
AI 490 520 480 560 
the maximum size of data dependencies. For each task- 
processor pair an execution time is randomly selected from 
the uniformly distributed interval [l; C,]. The size of data 
produced by each task is randomly taken from [l; 4. Data 
dependencies are generated by randomly choosing two tasks 
with no dependency. 
For our experiments, we have generated and scheduled 
100 task graphs for each set of parameters. The resulting 
schedule length is taken as the average over these 100 task 
graphs. In all experiments, Sd was set to twice the number of 
tasks, and the target architecture was the one from figure 2. 
Table 2 shows the results of the four algorithms when 
scheduled onto a homogeneous target architecture, while ta- 
ble 3 shows the results on a heterogeneous target architec- 
ture. In both cases the HCP algorithm produces in aver- 
age the best schedules, though there are cases where one 
of the other algorithms is best, as indicated in table 1. All 
schedules were produced in less than one second on a SUN 
SPARC workstation. 
~t I t , , d  I HCP I DLS I ETF I ETF2 
10 I 10,lO I 39 I 39 I 38 I 40 
Table 2. Average schedule lengths for synthetic bench- 
marks on a homogeneous architecture. 
6. Conclusion 
We have presented a new list based scheduling algorithm 
which concurrently assigns tasks to processors and allocates 
nets to interprocessor communication. The main advantage 
over previous approaches is that we take communication 
into account while selecting tasks and processors, and that 
we do not allow multiple communications to take place si- 
multaneously over the same net. 
Experiments have shown that our algorithm compares 
well to previous approaches. However, all of the compared 
algorithms, including ours, fail to favorize the processor on 
which predecessors of the successors of a task have been 
scheduled. 
40,lO I 53 I 59 I 65 I 57 
10,lO I 41 I 70 I 68 I 69 
10,40 















Table 3. Average schedule lengths for synthetic bench- 
marks on a heterogeneous architecture. 
7. Acknowledgements 
This research has partly been sponsored by the Danish 
Technical Reseakh Council under the “Codesign” program. 
References 
[l] T. Adam, K. Chandy, and J. Dickson. A comparison of list 
scheduling for parallel processing systems. Comm. ACM, 
17(12):685-690, December 1974. 
[2] J. Hwang, Y. Chow, E Anger, and C. Lee. Scheduling prece- 
dence graphs in systems with interprocessor communication 
times. SIAM J. Computing, 18(2):244-257, April 1989. 
[3] Y. Kwok and I. Ahmad. Dynamic critical-path scheduling: An 
effective technique for allocating task graphs to multiproces- 
sors. IEEE Trans. Parallel Distrib. Syst., 7(5):506-521, May 
1996. 
[4] G. Micheli. Synthesis and Optimization of Digital Circuits. 
McGraw-Hill, Inc., Princeton Road, S-1, 1994. 
[5] G. Sih and E. Lee. A compile-time scheduling heuristic for 
itercennection-constrained heterogeneous processor architec- 
tures. IEEE Trans. Parallel Distrib. Syst., 4(2): 175-187, 
February 1993. 
[6] M. Wu and D. Gajski. Hypertool: A programming aid for 
message-passing systems. IEEE Trans. Parallel Distrib. Syst., 
1(3):330-343, July 1990. 
[7] T. Yang and A. Gerasoulis. Dsc: Scheduling parallel tasks on 
an unbound number of processors. IEEE Trans. Parallel Dis- 
trib. Sys t ,  5(9):951-967, Sebtember 1994. 
[8] T.-Y. Yen. Hardware-Sofiare CO-Synthesis of Distributed 
Embedded Systems. PhD thesis, Princeton University, 1996. 
19 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on March 23,2010 at 10:25:39 EDT from IEEE Xplore.  Restrictions apply. 
