unknown by Yuan Xie & Wayne Wolf
Allocation and scheduling of conditional task graph in hardwarelsoftware 
co-s  ynthesis 
Yuan Xie and Wayne Wolf 
Electrical Engineering Department 
Princeton University 
Princeton, NJ 08540, USA 
{  yuanxie,wolf} @ee.princeton.edu 
Abstract 
This paper introduces an allocation and scheduling al- 
gorithm  that eflciently handles conditional  execution  in 
multi-rate embedded system.  Control dependencies are in- 
troduced  into the task graph model.  We propose  a mu- 
tual exclusion detection algorithm that helps the schedul- 
ing algorithm to exploit the resource sharing.  Allocation 
and scheduling are performed  simultaneously  to take ad- 
vantage of the resource sharing among those mutual exclu- 
sive tasks. The algorithm is  fast and efJicient,and so is suit- 
able to be used in the inner loop of our hardwarehoftware 
co-synthesis  framework which must call the scheduling rou- 
tine many times. 
1. Instructions 
Hardwarekoftware co-synthesis partitions an embedded 
system specification into hardware and software modules to 
meet performance, power and cost goa1s.A common model 
to describe the system specification is the task graph  [I]. 
The task graph has a structure similar to a data flow graph, 
except that the tasks in a task graph represent larger units 
of  functionality.  Allocation  and  scheduling  of  a  set  of 
data-dependent tasks, which are described by task graphs, 
on  a multiprocessor  architecture  has been  intensively re- 
searched.  But most previous work that used a task graph 
model that has no control dependency information can only 
capture data dependency in the system specification.  Re- 
cently, some researchers in co-synthesis domain have tried 
to use conditional  task graph to capture both data depen- 
dencies and control dependencies of the system specifica- 
tion [7][8].  However, because of the time complexity, their 
algorithms are not very suitable for large task graphs or to 
be used in the inner loop of the co-synthesis procedure. 
This paper describes an allocation and scheduling algo- 
1530-1591/01$10.00  0 2001 IEEE 
620 
rithm that is used in the inner loop of our co-synthesis pro- 
cess of distributed, embedded computing systems[ 113. The 
co-synthesis process synthesizes a distributed multiproces- 
sor architecture and allocates processes to the target archi- 
tecture, such that the allocation and scheduling of the task 
graph meets the deadline of the system, while the cost of the 
system is minimized. The algorithm targets periodic multi- 
rate task graphs. The target architecture is a heterogeneous 
multiprocessor  architecture  that  consists of  multiple  pro- 
cessing elements (PES) of  various types:  general-purpose 
CPUs or domain-specific CPUs and ASICs. The allocation 
and scheduling algorithm has been implemented as part of 
our co-synthesis  tool [  111,  which is the first co-synthesis 
tool that takes into account the impact of different custom 
ASIC implementations of tasks on system performance and 
cost in the co-synthesis process. 
This paper is organized as follows.  Section 2 reviews 
previous related work. Section 3 describes the problem for- 
mulation and section 4 introduces a method to detect mutual 
exclusion among tasks. We then present our scheduling and 
allocation algorithm.  Finally we discuss the experimental 
results of our algorithm. 
2. Related Work 
Previous work in  hardwarekoftware  co-design has ad- 
dressed  various  aspects of  HW/SW  co-synthesis.  Hard- 
warehoftware  partitioning  algorithms  implement the sys- 
tem  specification on  some sort of  architectural  template, 
usually a single CPU with one or more ASICs connected to 
the bus. On the other hand, distributed system co-synthesis 
creates a multiprocessor architecture.  The target architec- 
ture  is  usually  heterogeneous  in  both  its  processing  ele- 
ments and its communication channels. It can employ mul- 
tiple CPUs, ASICs and FPGAs. 
Task allocation and scheduling are important aspects of 
the co-synthesis process. The scheduling routine is not only used to generate the final allocation and schedules for the 
design, but also used in  the inner loop of co-synthesis to 
evaluate the performance of intermediate solutions, and to 
help to generate new solutions. Both its result quality and 
its efficiency are critical to the co-synthesis algorithm. 
Some scheduling problems can be modeled as integer 
linear programming (ILP) problems. An ILP solver is used 
to find the optimal solution. An earlier example for optimal 
approaches is the SOS  system, which used mixed integer 
linear programming technique (MILP) [5]. Because of the 
time complexity, optimal approaches are suitable only for 
small task graphs and impractical. Heuristic approach is by 
far the most widely used approach. Many heuristic schedul- 
ing algorithms are variants and extensions of list scheduling 
[1][2][4].  But most of those works consider only the data 
dependencies in the task graph model, though scheduling 
of  control and data flow graphs has been a very active re- 
search area in high level synthesis [9][10].  Recently, peo- 
ple in the system level synthesis pay attentions to handle 
the control dependencies in the task graph model.  Eles et 
al.  described the conditional task graph model and used a 
list-scheduling algorithm to generates different schedule ta- 
bles for all processing elements in the architecture [7][13]. 
The schedule table lists all schedules for different condition 
combination in the task graph. Their algorithm has two lim- 
itations:  1.  They assume that the allocation of the tasks is 
fixed for each task;  2.  Their algorithm has to enumerate 
all possible schedules for all condition combinations, thus 
it is not suitable for control-intensive application. Kuchcin- 
ski et al.  [S] used constraint programming techniques to 
model the scheduling problem of conditional task graphs, 
and then used commercial constraints solver to find the so- 
lution. This approach does not take advantage of heuristics 
unique to co-synthesis. 
3. Problem formulation 
Many  real-time applications  are periodic,  running  at 
multiple rates.  We  use a task graph model [3] to describe 
each application. Application is partitioned into task graph, 
which is a directed acyclic graph, as shown in Figure, 1. In 
a task graph, each node represents task that may have mod- 
erate to large granularity; the directed edges represent data 
dependencies between tasks. An edge, say A  -+  Djmplies 
that task D cannot start execution until A is finished. Data 
dependency  edges  ensure  the correct  order of  execution. 
Each edge is associated with a scalar describing the amount 
of data that must be transferred between the two connected 
tasks, which decides the communication time between the 
tasks if they are allocated onto different PE. We assume that 
if two tasks are allocated onto the same PE, the communica- 
tion time is 0. An edge with an assigned condition value is 
a conditional edge (represented with dot lines in Figure  I). 
The task with output conditional edge is a branch fork task. 
Conditional paths meet at a branch join task. For example, 
in Figure 1, A is a branch fork task, and E is a branch join 
task.  Depending on the condition, one of the out-branches 
of task A (A  --f B or A +  C) is activated. The task graph is 
executed periodically at its specified rate. For simplicity, in 
this paper, we assume that the deadline, by which the task 
graph must complete its execution, is equal to the period. 
The deadline can actually shorter or longer than the period. 
period= 150  A->B: condion=False 
A->C:condition=True 
v?  0  deadline= 150 
Figure 1.  Conditional task graph 
We use a heterogeneous shared memory multiprocessor 
as the target architecture as shown in Figure 2. The archi- 
tecture has a number of processing elements (PES), which 
may be CPUs or ASICs, which  are communicating  with 
each other via communication links (such as a bus).  Each 
CPU has its private instruction cache and data cache. The 
task-level cache performance model we  used is proposed 
by Li [2]. Each task can have many implementation options 
differing in PE type, cost and execution time. 
Figure 2.  The target architecture. 
The technology library provides a number of  choices 
of  the types  of  CPUs and the worst case execution time 
(WCET) for the tasks on each type CPU. Part of the tech- 
nology library for the example in Figure 1 is shown in Table 
1. The WCET of a task on CPU can be estimated using the 
techniques described in [14]. If a task can be implemented 
as ASIC, then there is a related behavioral VHDL file for 
this task. An architectural exploration system called Monet 
621 [6],  which is provided by Mentor Graphics, is used to de- 
termine the performance and cost of  the ASIC. By using 
Monet as an estimation tool for the custom ASICs, our co- 
synthesis system [  l l]  can explore the tradeoffs of different 
ASIC implementations during the co-synthesis process. 
Tasks  I  WCETonCPUl  I  WCETonCPU2 
R  I  In  I  IQ 
A  I  IO  I 20 
20 
30  20 
Table 1. Part of the technology library. 
cost ofCPUI=100  cost 0fCPU2=150 
Given the conditional task graph, target architecture and 
technology library, the co-synthesis algorithm produces an 
allocation of tasks on target architecture and constructs the 
static global schedule of the tasks on specified processing 
elements. 
4. Detection of mutual exclusion 
In the conditional task graph, if two tasks belong to dif- 
ferent conditional branch paths that have conflicting condi- 
tion combination, they are mutual exclusive and it is impos- 
sible for them to be executed at the same time. For example 
in Figure 1, depends on the condition value, task B or task C 
is executed but not both. If task B and task C are allocated 
to the same CPU, they can have overlapping execution time 
when they are scheduled, since they are mutual exclusive. 
We can use a branch labeling method, which is similar to 
the branch numbering procedure described in  [9], to iden- 
tify the mutually exclusive tasks. Each task in the task graph 
defined as following: 
is associated with a branch information structure, which is 
struct branch-info{ 
int level; 
branchlabel[  i] ; 
branch-condition[i]; 
}; 
level  the number of branch fork tasks that have to be exe- 
cuted before reaching this task. 
branch-label[i]  the name of the irh level branch fork task. 
branch-condition[i]  the condition value for the  irh  level 
branch. 
For  the  example  in  Figure  3, The task  A  and  F are 
not in any branch, so their branch level is 0 and does not 
have branch-label and branch-condition information.  Task 
J  belongs  to  one  of  the  branches  beginning  from  C, as 
well as one of the branches beginning from I. When con- 
dition  at C  is C3 and  at I is 11,  task  J  is executed.  So 
the branch  level of  task  J  is 2,  and the branch  lable=[C, 
I], branch-~ondition=[C3,11]. We  can  use  the  algorithm 
shown in  Figure 4  to go through the task graph and cal- 
culate the branchinfo struct for each task.  The result is 
shown in Table 2. The fictitious branch joint tasks are cre- 
ated in  Figure 3 for outlining control structure.  Note that 
in [7][13], their conditional process graph model only con- 
sider boolean condition while our approach allows multiple 
conditions (such as the switch statement in C language). 
Figure 3.  An  example conditional task graph. 
Branch-labeling(task  ps) 
( 
if  (ps  is the branch  join task) ( 
delete ps->branch_label[ps.leveL]  and 
ps->  branch-condition[ps. level]; 
ps.  level-;} 
foreach child task ps-child  of ps( 
if(ps  is the branch fork task) 
(  i =ps-child. level+ +: 
ps-child->branch-label[i]  =ps.name; 
ps-child->branch-condition[i] 
=  branch-condition;} 
ps-child  has the same branch-info  asps 
else 
) 
Branch-la beling(ps-child) 
I 
Figure 4. Branch-labeling recursion algorithm outline. 
Table2. The branch-info  struct for Figure3. 
622 By using this scheme, it is easy to decide if two tasks are 
mutual exclusive. For any two tasks, 
1.  If the level of one task is 0, then they are not mutual 
exclusive, such as task A and task I. 
2.  If  taskl.leve1  =  nl,  task2.level  =  n2, 
let  N=min(nl,n2),  then  we  compare  the  first  N 
branch-label  and  branch-condition,  beginning  from 
branch-label  [  11  and branch-condition[l]: 
a  if taslcl.6ranch_label[i]  # task2.branch-label[i], 
they are not mutual exclusive, such as task I and 
task D. 
b  if taskl.branch-label[i]  = task2.branch-label[z] 
taslc2.branch-condition[i],  they  are  mutual 
exclusive, such as G and J. 
c  else, compare the i+l level. If i > N,  they are not 
and  task1  .branch-condztion[i]  # 
exclusive. 
By  using the same scheme, we can decide the mutual 
exclusive communication edges. For example, the commu- 
nication edges C  --+  G and C --$ H in Figure 4  are mutual 
exclusive, if they are scheduled on the same communication 
link, they can have overlapping execution time slot.  Note 
that two tasks on different branches might not be mutual 
exclusive, such as D and I. 
5. Allocation and scheduling algorithm 
During the co-synthesis process 121, when the architec- 
ture space is explored and the partition of tasks on software 
(CPUs) and hardware (ASICs) is generated, a scheduling 
routine is used in the inner loop of co-synthesis to evaluate 
the performance of  intermediate solutions, and to help to 
generate new solutions. Since it is used in the inner loop of 
co-synthesis process, it is called by the synthesis procedure 
many times. The efficiency and the quality of the schedul- 
ing are very important to the quality of the co-synthesis re- 
sult. 
Our  scheduling  algorithm  performs  allocation  of  the 
tasks on CPUs and scheduling of  the tasks on CPUs and 
ASICs simultaneously, such that the algorithm can take ad- 
vantage of the resource sharing among those mutual exclu- 
sive tasks that belong to different branches. This is different 
from the algorithms proposed in [7][13],  which assume that 
the allocation of  tasks on CPUs is fixed. 
Our allocation and scheduling algorithm is similar to that 
designed by Li [2] and Sih 141. However, our approach takes 
into account mutually exclusive tasks identified by earlier 
phases. Figure 5 outlines the allocation and scheduling pro- 
cedure. 
I. 
2. 
2. 
4.  else 
5. 
6.  calculate dynamic-urgency(task4, pe-j) 
7. 
8.  dynamic-urgency value 
9. 
for each task, calculate static-urgency(task) 
if there is task i in ready list is partitioned on ASIC 
schedule task i; goto  9 
for each ready task i and each CPU pe-j 
schedule task-i on  pe-j with maximium 
update ready task list and goto 2 until all tasks  are 
scheduled. 
Figure 5. Outline of allocation and scheduling 
algorithm 
The static urgency is calculated for each task based on 
the maximum distance of the task to the end task of the task 
graph.  This is similar to the priority assignment in some 
list schedulers.  For example, in Figure 1, we assume that 
the communication time for each edge is 1, and task D is 
partitioned to be implemented as ASIC. The execution time 
of D as ASIC is estimated by using Monet, as we mentioned 
in section 3. Suppose that the execution time for D on ASIC 
is 5, then the static urgency (SU) of the tasks are shown as 
Table 3.. Note that the weight for each task that allocated 
on CPUs is calculated as the average WCET on CPUs, user 
can also specify to use the mediate WCET as the weight 
for the task.  For the tasks that are allocated to  the ASIC, 
the weight is exactly the estimated WCET from Monet by 
taking the VHDL description of that task as the input. The 
longest branch path is used to calculate the static urgency of 
a branch fork task. 
Table3. The static urgency for the example in Figure 1. 
The dynamic urgency is defined as: 
DU(task,CPU)=SU(task) 
- max{ready-time(task),CPU available time} 
- WCET(task, CPU) 
The dynamic urgency is related to the following factors: 
1. Static urgency (SU). If a task’s SU is high, it implies 
that this task is a critical task and should be given a 
high priority. 
2. The earliest start time of this task on the CPU. Note 
that the ready-time(task) takes into account the com- 
munication  time  from  its predecessor.  We  assume 
that the communication time between two tasks on the 
same CPU is 0.  Furthermore, the mutual exclusive 
communication edges can share the same communi- 
cation link with overlapping time slot. 
3. The worst case execution time (WCET) of this task on 
the CPU. When the CPU available time is calculated, if 
the allocated task is mutually exclusive with the ready 
623 task, then these two tasks can share the same time slot 
of the processing element to share the resource. 
Figure 6 is the subroutine  in the scheduler that deter- 
mines the CPU available time for a task by using the infor- 
mation obtained from the mutual exclusive detection proce- 
dure ,  which is described in section 4. 
PE-available(Task  ready-task,  CPU pe) //ready-task  is the 
task attempt to be allocated on pe 
I 
I.ifno task scheduled on pe, return 0; 
2.psl  =latest allocated task on pe; 
3.ifreadyjs is not mutual exclusive with psl 
return psl  .completion-time; 
4.else psl=previou scheduled task on pe, 
got0 3  j 
Figure 6.  Calculation of CPU available time. 
#oftasks 
Table 4 shows the first several steps to schedule the task 
graph in figure 1 on the target architecture with one CPUl 
and one CPU2.  Task D is implemented as ASIC, which is 
decided by the iterative improvement procedure during the 
co-synthesis procedure[2]. 
Schedule (from,tci, PE) 
(O,lO,  CPUI) 
29,  31  (12.30,  CPU2) 
--  __  25.35  (12,32. CPUZ) 
31  __  __  __  __  __  __  (I 1.16, ASIC) 
A-- > D  (l0,Il.  BUS) 
A-- > B  (11.12. BUS) 
A-- > C  (11.12, BUS) 
Table 4. The first several scheduling steps for figure 1. 
#of  #of  #ofCPU  Ave. 
edges  condition  Time 
After A is scheduled on CPUl from 0 to 10, tasks B,C,D 
are all ready to be  scheduled.  Since D is partitioned  as 
ASIC, D  and  the  communication edge from  A  to  D  are 
scheduled.  Then, C is allocated and scheduled on  CPU2 
since DU(C,CPU2)=35 is the greatest among all combina- 
tion of B,C with CPUl and CPU2. When DU(B,CPU2) is 
calculated, since B and C are mutual exclusive, B can still 
be allocated and scheduled on CPU2 and have overlapping 
schedule time with task C. Furthermore, the communication 
edge A  -+  C and A -+ B are mutual exclusive and so can 
be scheduled on the bus at the same time slot. 
6. Experimental results 
We  have implemented our conditional task graph allo- 
cation and scheduling algorithm as part of  our co-synthesis 
framework [l  I].  For the example shown in Figure 3, the 
schedule on two CPUs is shown in Figure 7 (we did not 
show the communication schedule here). By using our mu- 
tual exclusion test scheme, task G, H and J are mutual ex- 
clusive,  so they  can have overlap execution  time slot on 
CPU].  Similarly, D and E are mutual exclusive, they can 
share CPU2 at the same time. But for task K and E, they are 
not mutual exclusive though they belong to different condi- 
tional branch, so K must wait until E is done. 
A 
CPU2 
Figure 7. Scheduling result for example in fugure 2. 
Compared  with the algorithms proposed  by  Eles et al. 
and Pop[7][13], which generated  schedule table  for each 
condition combination, our global schedule is not necessary 
to be the optimal one for a subset of the task graph. For ex- 
ample, under a certain condition, only a subset of task graph 
A, C, B, D, G, F is executed. For this subset of task graph, 
a shorter schedule exists.  But our global schedule guaran- 
tees the worst case schedule is available. For example, the 
global schedule is fine for subset A, C, B, E, I, J, F, which 
has the longest schedule length in this case.  This guaran- 
tee is important during co-synthesis, which has to find out 
the architecture  that fits all cases.  After the co-synthesis 
process, we can use on-line resource reclaim to perform on- 
line scheduling, which can produce a better schedule for a 
subset of task graph. Furthermore, the ASIC cost reduction 
procedure presented in [  113 can also be used to reduce the 
ASIC cost, which is facilitated by Monet. 
We ran our algorithm with examples from Eles et al. and 
Pop[7][ 131,  and compared our schedule with their sched- 
ule table, which has different schedules for each subset of 
the task graph.  We found out that our global schedule cor- 
responds to the worse case schedule for each task in their 
schedule table. 
We  also did  some statistical  experiments.  We  modi- 
fied TGFF [12] to generate an extensive number of random 
conditional task graphs with different structures. The task 
graphs were then allocated and scheduled on various archi- 
tectures. Table 5 shows part of the experimental results. 
Table 5. Part of the experimental results. 
The  fourth  column  shows  the  number  of  condition 
branch fork tasks in the task graph. The firth column shows 
the number of CPUs in the target architecture (We did not 
show ASIC information here, our co-synthesis algorithm in 
624 [  1  13 decided the partitioning  of  task graph between CPUs 
and ASICs).  The last column shows the average running 
time of  the  algorithm  on a  Celeron  533 computer with 
Linux.  Example EX3 shows that even though the condi- 
[IO]  G.Lakshminarayana.N.Jha,”Wavesched:  A novel  scheduling  tech- 
nique for control-flow intensive behavioral descriptions”, Proceeding 
of  ICCAD, pp.244-250,  1997. 
[I I]  Yuan Xie,Wayne Wolf,”Co-synthesis with custom ASIC”, Procred- 
tion variables increase, which means the number of  execu- 
tion subset of the task graph increase, the running time of 
the algorithm does not increase dramatically. The reason is 
that our algorithm did not schedule each subset of the task 
graph corresponding  to each condition  combination  sepa- 
rately.  On the contrary, it calculates the mutual exclusion 
information for the whole task graph and then schedules the 
whole task graph by  exploring the resource sharing among 
those mutual exclusive tasks. 
7. Conclusion 
This paper introduces an allocation and scheduling algo- 
rithm that is used in the inner loop of co-synthesis proce- 
dure. The conditional task graph is targeted with the facili- 
tation of a mutual exclusion detection procedure. 
8. Acknowlegement 
This work was funded by Mentor Graphics with addi- 
tional funding from NSF. 
References 
[I]  Wayne Wolf, ”An architectural co-synthesis algorithm for distributed, 
embedded  computing  systems”,lEEE transaction  on  VU/,  vo1.5, 
No.2, pp. 218-229,June  1997. 
[2]  Yanbing Li, ”Hardware-software co-synthesis of  embedded real-time 
multiprocessor”, Ph.D. disserrarion, Princeton University, 1998. 
[3] Wayne Wolf  and’lorgen Staunstrup.  ”Hardware/Sojiware co-design 
Principles and Practice”. Kluwer Academic Publishers,  1997. 
[4]  GSih and E.A.Lee, ”A compile-time scheduling heuristic for inter- 
connection constrained heterogeneous processor architectures”, IEEE 
transactions on Parallel and Distributed systems,voI.4,no.2,pp. 175- 
187,Feb. 1993. 
[5]  Prakash,  Parker, ”SOS: synthesis  of  application  specific heteroge- 
neous multiprocessor systems”, Journal of Parallel and Distributed 
compuring, vol. 16,pp.338-351 1992. 
[6]  Monet reference manual. Mentor Graphics Company. 
[7]  Petru Eles er  al. ”Scheduling of  conditional  process  graphs for the 
synthesis  of  embedded  systems”, Proceedings of  DATE, pp.23-26, 
1998. 
[SI  K.Kuchcinski,. ”Embedded  system  synthesis  by  timing  constraint 
sovling”, Proc. Inr.Symp.on Syst.synfh,pp.S0-57,I997. 
[9]  Said  Amellal  et  al. ”Functional  synthesis  of  digital  systems  with 
TASS”. IEEE Trans. on CAD,vol.l3, NOS,  pp.S37-551,1994. 
ings ofASP-DAC2000,pp.  129- 135.2000. 
121  R.P.Dick,D.L.Rhodes,  W.Wolf,  ”TGFF:  Task  graphs  for  free”, 
Proc.Int.  Workshop  Hardware/Sofiare  Codesign,  pp.97- 101, 
Mar.1998. 
131  Paul Pop, ”Scheduling and communication synthesis for distributed 
real-time systems”, Ph.D. rhesis, Linkopings Univrrsiry,2000. 
141  Y-T.S.Li,”Performance analysis  of  real-time  embedded  software”, 
Ph.D. thesis,Princeron University,  1997. 
625 