Advanced list scheduling heuristic for task scheduling with communication contention for parallel embedded systems by Mu, Pengcheng et al.
Advanced list scheduling heuristic for task scheduling
with communication contention for parallel embedded
systems
Pengcheng Mu, Jean Franc¸ois Nezan, Mickael Raulet, Jean-Gabriel Cousin
To cite this version:
Pengcheng Mu, Jean Franc¸ois Nezan, Mickael Raulet, Jean-Gabriel Cousin. Advanced list
scheduling heuristic for task scheduling with communication contention for parallel embed-
ded systems. Science China Information Sciences, Springer, 2010, 53 (11), pp.2272-2286.
<10.1007/s11432-010-4097-3>. <hal-00526387>
HAL Id: hal-00526387
https://hal.archives-ouvertes.fr/hal-00526387
Submitted on 14 Oct 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
SCIENCE CHINA
? 2010 Vol. ? No. ?: 1–14
doi:
Advanced List Scheduling Heuristic for Task Scheduling with
Communication Contention for Parallel Embedded Systems
MU PengCheng1∗, NEZAN Jean-Franc¸ois2, RAULET Mickae¨l2 & COUSIN Jean-Gabriel2
1Ministry of Education Key Lab for Intelligent Networks and Network Security,
School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China,
2IETR/Image and Remote Sensing Group, CNRS UMR 6164/INSA Rennes, 35043 RENNES Cedex, France
Received September 18, 2009; accepted April 2, 2010
Abstract Modern embedded systems tend to use multiple cores or processors for processing parallel applications. This paper
indeed aims at task scheduling with communication contention for parallel embedded systems and proposes three advanced
techniques to improve the list scheduling heuristic. Five groups of node levels (two existing groups and three new groups) are
firstly used as node priorities to generate node lists. Then the critical child technique improves the selection of a processor in the
scheduling process. Finally, the communication delay technique enlarges the idle time intervals on communication links. We also
propose an advanced dynamic list scheduling heuristic by combining the three techniques. Experimental results show that the
combined advanced dynamic heuristic is efficient to shorten the schedule length for most of the randomly generated DAGs in the
cases of medium and high communication. Our method accelerates an application up to 80% in the case of high communication
and can also reduce the use of hardware resources.
Keywords list scheduling, communication contention, node level, critical child, communication delay
Citation Mu P C. Advanced List Scheduling Heuristic for Task Scheduling with Communication Contention for Parallel Embedded Systems.
1 Introduction
The recent evolution of digital communication and video compression applications has dramatically increased
complexities of both the algorithm and the embedded system. To face this problem, System on a Chip (SoC),
which embeds several cores (e.g. multi-core DSPs) and several hardware accelerators (e.g. Intellectual Properties),
becomes the basic element to build complex embedded systems; and dataflow programming has been proposed for
multiprocessor programming[1]. Task scheduling of a dataflow program over a multi-component embedded system
is becoming more and more important due to the growing requirements of applications. However, task scheduling
is not straightforward; when performed manually, the result is usually a suboptimal solution. Scheduling on general
parallel computer architectures has been actively researched, but task scheduling on parallel embedded systems[2]
is different from the general scheduling problem. Communications between cores have a very important impact on
the scheduling and the resulting use of the hardware resources. Hence, it is necessary to find new task scheduling
methodologies which produce optimal or near optimal results for parallel embedded systems.
In the task scheduling problem, the program is represented as a task graph modeled by Directed Acyclic Graph
(DAG)[2,3], where nodes represent tasks (i.e. computations) and edges represent dataflows (i.e. communications)
between tasks. The objective of task scheduling is to respectively assign computations and communications to
∗Corresponding author (email: pengchengmu@gmail.com)
2 Mu P C, et al.
processors and buses (communication links) of the target system in order to get the minimum schedule length
(makespan). The scheduling could be static (done at compile time) or dynamic (done at run time). Static schedul-
ing is more suitable than dynamic scheduling for deterministic applications in parallel embedded systems by lead-
ing to lower code size and higher computation efficiency. This paper tackles the static scheduling problem for
programming on parallel embedded systems, and all the task scheduling heuristics in the following parts are done
at compile time.
The general task scheduling problem is proven to be NP-hard[3,4]; hence, many works try to find heuristics
to go up to the optimal solution. Early task scheduling heuristics do not consider communications between
tasks[5,6]. As communications increase in modern applications, many scheduling heuristics have to take them
into account[3,7−10]. Most of these heuristics use fully connected topology structures of systems in which all com-
munications can be concurrently performed. Different arbitrary processor networks are then used in refs. [11-15]
to accurately describe real parallel systems, and the task scheduling takes into account communication contentions
on communication links.
Most of the above heuristics are based on the approach of list scheduling. Basic techniques are given in ref.
[16] for list scheduling with communication contention. This paper will give an advanced list scheduling heuristic
with several advanced techniques for task scheduling with communication contention in parallel embedded sys-
tems. Three new groups of node levels will be firstly defined and used as node priorities to generate node lists in
addition to the two existing groups; secondly, a technique of using a node’s critical child will be given to improve
the performance for selecting a processor for a node; and thirdly, the communication delay technique delays a
communication when necessary in order to enlarge idle time intervals on communication links. This paper will
finally combine these three techniques and show the efficiency of the results.
The paper is organized as follows: Section 2 firstly introduces the necessary models and definitions, then the
task scheduling problem with communication contention is described in this section. Different node levels are
given in section 3 by considering the communication contention. Section 4 gives the list scheduling heuristics
including the classic static heuristic and our advanced heuristic. Experimental results to compare our heuristic to
the classic one are given in section 5. The paper is concluded in section 6.
2 Models and Definitions
The program to be scheduled is called an algorithm and is modeled as a DAG in this paper. The multiprocessor
parallel embedded system is called an architecture and is modeled as a topology graph. These two models are
detailed as follows.
2.1 DAG Model
A DAG is a directed acyclic graph 퐺 = (푉,퐸,푤, 푐) where 푉 is the set of nodes and 퐸 is the set of edges. For
two nodes 푛푖, 푛푗 ∈ 푉 , 푒푖푗 denotes the edge from the origin node 푛푖 to the destination node 푛푗 . A node represents
a computation, and the weight of node 푛푖 (denoted by 푤 (푛푖)) represents the time cost of computation. An edge
represents the communication between two nodes, and the weight of edge 푒푖푗(denoted by 푐 (푒푖푗)) represents the
time cost of communication. In this model, the set {푛푥 ∈ 푉 : 푒푥푖 ∈ 퐸} of all the direct predecessors of node 푛푖 is
denoted by 푝푟푒푑 (푛푖); the set {푛푥 ∈ 푉 : 푒푖푥 ∈ 퐸} of all the direct successors of node 푛푖 is denoted by 푠푢푐푐 (푛푖).
A node 푛푖 with 푝푟푒푑 (푛푖) = ∅ is named a source node, and a node 푛푖 with 푠푢푐푐 (푛푖) = ∅ is named a sink node,
where ∅ is the empty set.
The execution of computations on a processor is sequential. A computation can not be divided into parts. A
computation can not start until all its input communications finish; all its output communications can not start until
this computation finishes. Communications are also sequential on a communication link, but different computa-
tions and communications can be executed simultaneously respecting the input and output constraints given above.
Figure 1(a) gives a DAG example used in ref. [17] to illustrate performances of different scheduling heuristics. It
is also used in subsection 5.1 to show the performance of our method.
Mu P C, et al. Sci China Inf Sci ? 2010 Vol. ? No. ? 3
2
1114
111
533
n1
n2 n3 n4 n5
n6 n7 n8
n9
10
1
5 6 5
4
4 4 4
1
(a) Algorithm
P1
P2
P3
S1 S2
P4
P6
P5
L1 L6
L2
L3
L7 L5
L4
(b) Architecture 1
L2
L4
L3
S1
P1 P4
P3P2
L1
(c) Architecture 2
P2
P4
P6
L8
L6L2
L3
L7
L5L4
P8
S1
P1 P7
P5P3
L1
(d) Architecture 3
Figure 1: System models
2.2 Topology Graph Model
A topology graph 푇퐺 = (푁,푃,퐷,퐻, 푏) has been used to model a target system of multiple processors intercon-
nected by communication links and switches[14]. 푁 is the set of vertices; 푃 is a subset of 푁 , 푃 ⊆ 푁 ; 퐷 is the set
of directed edges; 퐻 is the set of hyperedges; 푏 is the relative data rate of edge. The union of the two edge sets 퐷
and 퐻 is designated the link set 퐿, 퐿 = 퐷 ∪퐻; an element of this set is denoted by 푙 ∈ 퐿. The topology graph is
also denoted by 푇퐺 = (푁,푃,퐿, 푏).
Since a parallel embedded system usually consists of multiple heterogenous components, the topology graph is
used to model it in this paper. A vertex 푝 ∈ 푃 represents a processor; a vertex 푛 ∈ 푁,푛 /∈ 푃 represents a switch.
It is supposed that directed edges are not used in a topology graph. Hence, a link 푙 ∈ 퐿 is actually a hyperedge
ℎ, which is a subset of two or more vertices of 푁 , ℎ ⊆ 푁, ∣ℎ∣ > 1. A hyperedge connects multiple vertices and
represents a half duplex multidirectional communication link (e.g. a bus). The weight 푏 (푙) associated with a link
푙 ∈ 퐿 represents its relative data rate.
Differing from the processor, a switch is an ideal vertex only used for connecting communication links, and no
computation can be executed on it.
Ideal Switch: For a switch 푠, let 푙1, 푙2, . . . , 푙푛 be all the communication links connected to 푠. If two links 푙푖1
and 푙푖2 of them are not used for the moment, a communication can be transferred from 푙푖1 to 푙푖2 without any impact
from/to communications on other communication links connected to 푠.
Figure 1(b) gives an architecture example with six processors (푃1, 푃2, 푃3, 푃4, 푃5 and 푃6) interconnected by
seven links (퐿1, 퐿2, 퐿3, 퐿4, 퐿5, 퐿6 and 퐿7) and two switches (푆1 and 푆2). This architecture models TI’s C6474
Evaluation Module (EVM) which includes two C6474 multicore DSPs1. Figure 1(c) and 1(d) also show two other
architectures which will be used for the experimental results in subsection 5.1 and 5.2, respectively.
A route is used to transfer data from one processor to another in a parallel embedded system. It is a chain of
links connected by switches from the origin processor to the destination processor. For example, 퐿1→ 퐿7→ 퐿4
is a route from 푃1 to 푃4 in Figure 1(b). A link 푙 on a route 푅 is denoted by 푙 ∈ 푅. All the routes from processor
푝푖 to processor 푝푗 compose a set of routes 푅푆 (푝푖, 푝푗). If 푝푖 = 푝푗 , then 푅푆 (푝푖, 푝푗) = ∅, which means no route is
needed.
Routing is a procedure of generating routes and is an important aspect of task scheduling. In ref. [15], the route
is dynamically created during the scheduling to improve the performance, but it does not use switches in the system
architecture. In fact, routes are usually determined once and stored in a table for parallel embedded systems using
switches, which means static routing. This paper uses the static routing and supposes that there is at least a route
between any two processors. Hence, the routing during the scheduling becomes looking up the table of routes.
2.3 Task Scheduling with Communication Contention
A schedule of a DAG is the association of a start time and a processor with each node of the DAG. When the com-
munication contention is considered, a schedule also includes allocating communications to links and associating
start times on these links with each communication. A schedule 푆 of a DAG 퐺 = (푉,퐸,푤, 푐) over a topology
graph 푇퐺 = (푁,푃,퐿, 푏) is described by the following terms.
1http://www.ti.com/
4 Mu P C, et al.
The start time of a node 푛푖 ∈ 푉 on a processor 푝 ∈ 푃 is denoted by 푡푠 (푛푖, 푝); the finish time is given by
푡푓 (푛푖, 푝) = 푡푠 (푛푖, 푝) + 푤 (푛푖, 푝)
where 푤 (푛푖, 푝) is the execution duration of 푛푖 on 푝. The schedule length of 푆 is the maximum finish time among
all the nodes,
푠푙 (푆) = max
푛푖∈푉
{푡푓 (푛푖, 푝푟표푐 (푛푖))}
where 푝푟표푐 (푛푖) denotes the processor on which 푛푖 is allocated.
Since execution durations of a node on different processors can be very different (푤 (푛푖, 푝푗) ≫ 푤 (푛푖, 푝푘)),
this node is usually constrained to some processors which give relatively small execution durations. The set of
processors on which 푛푖 can be executed is denoted by 푃푟표푐 (푛푖). The average computation duration of a node on
different processors is used to represent the node weight which is given by
푤 (푛푖) =
1
∣푃푟표푐 (푛푖)∣
∑
푝∈푃푟표푐(푛푖)
푤 (푛푖, 푝)
where ∣푃푟표푐 (푛푖)∣ is the number of processors in 푃푟표푐 (푛푖).
The communication represented by an edge is needed only when the edge’s origin node and destination node
are not allocated on the same processor. The start time of an edge 푒푖푗 ∈ 퐸 on a link 푙 of route 푅 is denoted
by 푡푠 (푒푖푗 , 푙, 푅). Communications are handled in the way of cut-through on a route because of the use of circuit
switching in embedded systems. Hence, 푒푖푗 is aligned on all the links of the route 푅 = 푙1 → 푙2 → . . . → 푙푘 with
푡푠 (푒푖푗 , 푙1, 푅) = 푡푠 (푒푖푗 , 푙2, 푅) = . . . = 푡푠 (푒푖푗 , 푙푘, 푅). The route on which 푒푖푗 is allocated is denoted by 푅 (푒푖푗).
The start time and finish time of 푒푖푗 on all the links of the route 푅 = 푅 (푒푖푗) are uniformly denoted by 푡푠 (푒푖푗 , 푅)
and 푡푓 (푒푖푗 , 푅) with 푡푓 (푒푖푗 , 푅) = 푡푠 (푒푖푗 , 푅) +
푑(푒푖푗)
min
푙∈푅
{푏(푙)} , where 푑 (푒푖푗) is the number of data to be transferred by
푒푖푗 , and min
푙∈푅
{푏 (푙)} is the minimum data rate of the links in the route 푅. The average communication duration of
an edge on all its possible routes is used to represent the edge weight which is given by
푐 (푒푖푗) =
1∑
푝푥,푝푦
∣푅푆 (푝푥, 푝푦)∣
∑
푝푥,푝푦
⎧⎨⎩ ∑
푅∈푅푆(푝푥,푝푦)
푑 (푒푖푗)
min
푙∈푅
{푏 (푙)}
⎫⎬⎭
where 푝푥 ∈ 푃푟표푐 (푛푖) , 푝푦 ∈ 푃푟표푐 (푛푗). This kind of calculation for 푐 (푒푖푗) is firstly proposed in this paper and is
more suitable for task scheduling in parallel embedded systems.
A node (computation) can start on a processor at the time when all the node’s input edges (communications)
finish. This time is called the Data Ready Time (DRT) and is denoted by
퐷푅푇 (푛푗 , 푝) = max
푒푖푗∈퐸
{푡푓 (푒푖푗 , 푅 (푒푖푗))}
DRT is the earliest time when a node can start. If 푛푗 is a node without input edge, then 퐷푅푇 (푛푗 , 푝) = 0,∀푝 ∈ 푃 ,
which means data of 푛푗 are ready at the beginning (time 0).
The insertion technique is usually used for node and edge scheduling[16]. The conditions to use the insertion
technique for node and edge scheduling are explained as follows.
Node Scheduling Condition: For a node 푛푖, let [퐴,퐵] (퐴,퐵 ∈ [0,∞]) be an idle time interval on the processor
푝. 푛푖 can be scheduled on 푝 within [퐴,퐵] if max {퐴,퐷푅푇 (푛푖, 푝)}+ 푤 (푛푖, 푝) ⩽ 퐵. The start time of 푛푖 on 푝 is
given by 푡푠 (푛푖, 푝) = max {퐴,퐷푅푇 (푛푖, 푝)}.
Edge Scheduling Condition: For an edge 푒푖푗 , let 푅 be a route for this edge and let [퐴,퐵] (퐴,퐵 ∈ [0,∞])
be a common idle time interval on all the links of this route. 푒푖푗 can be scheduled on 푅 within [퐴,퐵] if
max {퐴, 푡푓 (푛푖, 푝푟표푐 (푛푖))} + 푑(푒푖푗)min
푙∈푅
{푏(푙)} ⩽ 퐵. The start time of 푒푖푗 on this route is given by 푡푠 (푒푖푗 , 푅) =
max {퐴, 푡푓 (푛푖, 푝푟표푐 (푛푖))}.
Mu P C, et al. Sci China Inf Sci ? 2010 Vol. ? No. ? 5
3 Node Levels with Communication Contention
The top level and bottom level are usually used as node priorities which are important for DAG scheduling[11,18].
The top level of a node is the length of the longest path from any source node to this node, excluding the weight
of this node; the bottom level of a node is the length of the longest path from this node to any sink node, including
the weight of this node. Two groups of top and bottom levels have been used in task scheduling heuristics, which
are: 1) computation top and bottom levels (푡푙푐표푚푝 and 푏푙푐표푚푝), 2) top and bottom levels (푡푙 and 푏푙). In addition,
this paper proposes three new groups which are named as: 3) input top and bottom levels (푡푙푖푛 and 푏푙푖푛), 4) output
top and bottom levels (푡푙표푢푡 and 푏푙표푢푡), 5) input/output top and bottom levels (푡푙푖표 and 푏푙푖표). Figure 2 illustrates the
dependencies between nodes to define different top levels and bottom levels, where the red dotted nodes and edges
are used to recursively define the top levels and bottom levels of 푛푖.
n pred
n i
nsucc
tlcomp
n pred
n i
nsucc
bl comp
(a)
n pred
n i
nsucc
tl
n pred
n i
nsucc
bl
(b)
n pred
n i
nsucc
tl in
n pred
n i
nsucc
blin
(c)
n pred
n i
nsucc
tlout
n pred
n i
nsucc
bl out
(d)
n pred
n i
nsucc
tl io
n pred
n i
nsucc
blio
(e)
Figure 2: Five groups of node levels
1. Computation top level and bottom level (Figure 2(a))
The computation top level of a node is the length of the longest path from any source node to this node only
including the weights of nodes; the computation bottom level of a node is the length of the longest path from
this node to any sink node only including the weights of nodes. The weights of edges are not taken into
account in the computation top level and bottom level. They are recursively defined as follows:
푡푙푐표푚푝 (푛푖) =
{
0, if 푛푖 is a source node
max
푛푘∈푝푟푒푑(푛푖)
{푡푙푐표푚푝 (푛푘) + 푤 (푛푘)} , otherwise
푏푙푐표푚푝 (푛푖) =
{
푤 (푛푖) , if 푛푖 is a sink node
max
푛푘∈푠푢푐푐(푛푖)
{푏푙푐표푚푝 (푛푘)}+ 푤 (푛푖) , otherwise
2. Top level and bottom level (Figure 2(b))
The top level and bottom level additionally take into account the weights of edges on the path by contrast
with the computation top level and bottom level. They are recursively defined as follows:
푡푙 (푛푖) =
{
0, if 푛푖 is a source node
max
푛푘∈푝푟푒푑(푛푖)
{푡푙 (푛푘) + 푤 (푛푘) + 푐 (푒푘푖)} , otherwise
푏푙 (푛푖) =
{
푤 (푛푖) , if 푛푖 is a sink node
max
푛푘∈푠푢푐푐(푛푖)
{푏푙 (푛푘) + 푐 (푒푖푘)}+ 푤 (푛푖) , otherwise
3. Input top level and bottom level (Figure 2(c))
The input top level and bottom level take into account weights of nodes on the path as well as weights of all
the input edges of a node on the path. They are recursively defined as follows:
푡푙푖푛 (푛푖) =
{
0, if 푛푖 is a source node
max
푛푘∈푝푟푒푑(푛푖)
{푡푙푖푛 (푛푘) + 푤 (푛푘)}+
∑
푒푙푖∈퐸
푐 (푒푙푖) , otherwise
푏푙푖푛 (푛푖) =
⎧⎨⎩
푤 (푛푖) , if 푛푖 is a sink node
max
푛푘∈푠푢푐푐(푛푖)
{
푏푙푖푛 (푛푘) +
∑
푒푙푘∈퐸
푐 (푒푙푘)
}
+ 푤 (푛푖) , otherwise
6 Mu P C, et al.
4. Output top level and bottom level (Figure 2(d))
The output top level and bottom level take into account weights of nodes on the path as well as weights of
all the output edges of a node on the path. They are recursively defined as follows:
푡푙표푢푡 (푛푖) =
⎧⎨⎩
0, if 푛푖 is a source node
max
푛푘∈푝푟푒푑(푛푖)
{
푡푙표푢푡 (푛푘) + 푤 (푛푘) +
∑
푒푘푙∈퐸
푐 (푒푘푙)
}
, otherwise
푏푙표푢푡 (푛푖) =
{
푤 (푛푖) , if 푛푖 is a sink node
max
푛푘∈푠푢푐푐(푛푖)
{푏푙표푢푡 (푛푘)}+
∑
푒푖푙∈퐸
푐 (푒푖푙) + 푤 (푛푖) , otherwise
5. Input/output top level and bottom level (Figure 2(e))
The input/output top level and bottom level take into account weights of nodes on the path as well as weights
of all the input and output edges of a node on the path. They are recursively defined as follows:
푡푙푖표 (푛푖) =
⎧⎨⎩
0, if 푛푖 is a source node
max
푛푘∈푝푟푒푑(푛푖)
{
푡푙푖표 (푛푘) + 푤 (푛푘) +
∑
푒푘푙∈퐸
푐 (푒푘푙)− 푐 (푒푘푖)
}
+
∑
푒푙푖∈퐸
푐 (푒푙푖) , otherwise
푏푙푖표 (푛푖) =
⎧⎨⎩
푤 (푛푖) , if 푛푖 is a sink node
max
푛푘∈푠푢푐푐(푛푖)
{
푏푙푖표 (푛푘) +
∑
푒푙푘∈퐸
푐 (푒푙푘)− 푐 (푒푖푘)
}
+
∑
푒푖푙∈퐸
푐 (푒푖푙) + 푤 (푛푖) , otherwise
The three new groups take into account the communication contention between nodes in comparison with the
two existing groups which are usually used in the list scheduling without communication contention. Table 1 gives
all the five groups of top levels and bottom levels for the DAG given in Figure 1(a). This table will be used in
subsection 5.1.
Table 1: Different node levels
푡푙푐표푚푝 푏푙푐표푚푝 푡푙 푏푙 푡푙푖푛 푏푙푖푛 푡푙표푢푡 푏푙표푢푡 푡푙푖표 푏푙푖표
푛1 0 11 0 23 0 41 0 35 0 55
푛2 2 8 6 15 6 35 19 16 19 36
푛3 2 8 3 14 3 26 19 14 19 26
푛4 2 9 3 15 3 27 19 15 19 27
푛5 2 5 3 5 3 5 19 5 19 5
푛6 5 5 10 10 10 21 24 10 24 21
푛7 5 5 12 11 20 21 24 11 34 21
푛8 6 5 8 10 9 21 24 10 25 21
푛9 10 1 22 1 40 1 34 1 54 1
4 List Scheduling Heuristics
List scheduling is an important task scheduling heuristic. Algorithm 1 gives a commonly used static list scheduling
heuristic as given in ref. [14].
This algorithm consists of three procedures. Nodes are firstly sorted into a static list by the procedure of
Sort Nodes() in the heuristic, then a processor is selected for each node by Select Processor() and
this node is scheduled by Schedule Node(). Since the order of nodes in the list affects the schedule result,
many different priority schemes have been proposed to sort nodes[10,11]. Experiments in ref. [18] show that list
scheduling with static list sorted by bottom level outperforms other compared contention aware algorithms. Hence,
this paper uses the following rule to sort nodes.
Rule of Sorting Nodes: Nodes are sorted by the decreasing order of their bottom levels; if two nodes have
equal bottom levels, the one with greater top level is placed before the other; if both the bottom level and the top
level are equal, these nodes are randomly sorted.
Mu P C, et al. Sci China Inf Sci ? 2010 Vol. ? No. ? 7
Algorithm 1: Static List Scheduling(퐺, 푇퐺)
Input: A DAG 퐺 = (푉,퐸,푤, 푐) and a topology graph 푇퐺 = (푁,푃,퐿, 푏)
Output: A schedule of 퐺 on 푇퐺
푁표푑푒퐿푖푠푡← Sort Nodes(푉 );1
for each 푛 ∈ 푁표푑푒퐿푖푠푡 do2
푝푏푒푠푡 ← Select Processor(푛, 푃 );3
Schedule Node(푛, 푝푏푒푠푡);4
end5
Details about the static list scheduling heuristic can be found in ref. [16]. This heuristic is considered as a classic
list scheduling heuristic and will be used for the comparison with our advanced method. The following gives two
advanced list scheduling techniques and an advanced dynamic list scheduling heuristic using these two techniques.
4.1 Processor Selection with Critical Child
The classic list scheduling heuristic selects the processor allowing the earliest finish time for a node. This rule
probably gives a locally optimized result. In fact, this rule usually gives bad results for the join structure of a
DAG especially in the case of great communication cost and communication contention. Figure 3(a) shows such
an example; Figure 3(b) gives the schedule result with the classic processor selection method, which selects a new
processor for each one of 푛1, 푛2 and 푛3 to provide the earliest finish time. Hence, the execution of node 푛4 has to
wait until the communications from 푛2 and 푛3 finish, and the schedule length is 6 at last. By contrast, the schedule
of all nodes on the same processor is shown in Figure 3(c) and has a schedule length of 4. The reason for the bad
result of the classic method is that the successor is not taken into account during the processor selection, hence we
propose a technique of critical child to avoid this bad result.
n1 n2 n3
n4
1 1 1
1
2 2 2
(a)
P1
0 5
P2
P3
L1
n1
n2
6
n3
n4
e3,4e2,4
time
(b)
P1
0 5
P2
P3
L1
n1 n2
4
n3 n4
time
(c)
Figure 3: A join DAG and two different schedule results
In ref. [10], the critical child of a node is defined as one of its successors that has the smallest difference between
the absolute latest possible start time (ALST) and the absolute earliest possible start time (AEST). It is used for
scheduling in the case of unbounded number of processors and without communication contention. We use the
concept of critical child for list scheduling in the case of bounded number of processors and with communication
contention. The critical child is differently defined as follows.
Critical Child: Given a static node list 푁표푑푒퐿푖푠푡, the critical child of node 푛푖 is denoted by 푐푐 (푛푖) and is one
of 푛푖’s successors that firstly emerges in 푁표푑푒퐿푖푠푡.
According to this definition, the critical child of 푛푖 may be different if 푁표푑푒퐿푖푠푡 differs though the DAG is
not changed. This is the difference between our critical child and that in ref. [10]. Using critical child makes the
processor selection take into account not only the predecessors of a node, but also its most important successor.
Our method of using the critical child to select processor is given in Algorithm 2.
An unscheduled node with all its predecessors having been scheduled is called a free node. Since it is possible
that 푐푐 (푛푖) is not a free node during the processor selection for 푛푖, the scheduling of 푐푐 (푛푖) only takes into account
the critical child’s scheduled predecessors in the procedure of Select Processor(), which will be shown in
the algorithm of edge scheduling.
4.2 Node and Edge Scheduling with communication delay
Our methods of node and edge scheduling differ from those of the classic one by using the As Late As Possible
(ALAP) start time to delay communications. Given the route 푅 = 푙1 → 푙2 → . . .→ 푙푘 for edge 푒푖푗 , let 푒푚 be the
8 Mu P C, et al.
Algorithm 2: Select Processor(푛푖, 푃 )
Input: A node 푛푖 ∈ 푉 and the set 푃 of all the processors
Output: The best processor 푝푏푒푠푡 for the input node 푛푖
Find the critical child 푐푐 (푛푖);1
퐵푒푠푡퐹 푖푛푖푠ℎ푇 푖푚푒←∞;2
for each 푝 ∈ 푃푟표푐 (푛푖) do3
퐹푖푛푖푠ℎ푇 푖푚푒← Schedule Node(푛푖, 푝, 푡푟푢푒);4
푀푖푛퐹푖푛푖푠ℎ푇 푖푚푒←∞;5
if 푐푐 (푛푖) ∕= 푛푢푙푙 then6
for each 푝′ ∈ 푃푟표푐 (푐푐 (푛푖)) do7
퐹푖푛푖푠ℎ푇 푖푚푒← Schedule Node(푐푐 (푛푖), 푝′, 푡푟푢푒);8
푀푖푛퐹푖푛푖푠ℎ푇 푖푚푒← min {푀푖푛퐹푖푛푖푠ℎ푇 푖푚푒, 퐹 푖푛푖푠ℎ푇 푖푚푒};9
Unschedule the input edges of 푐푐 (푛푖);10
Unschedule 푐푐 (푛푖) from 푝′;11
end12
else13
푀푖푛퐹푖푛푖푠ℎ푇 푖푚푒← 퐹푖푛푖푠ℎ푇 푖푚푒;14
end15
if 푀푖푛퐹푖푛푖푠ℎ푇 푖푚푒 < 퐵푒푠푡퐹 푖푛푖푠ℎ푇 푖푚푒 then16
퐵푒푠푡퐹 푖푛푖푠ℎ푇 푖푚푒←푀푖푛퐹푖푛푖푠ℎ푇 푖푚푒;17
푝푏푒푠푡 ← 푝;18
end19
Unschedule the input edges of 푛푖;20
Unschedule 푛푖 from 푝;21
end22
edge before which 푒푖푗 is scheduled on link 푙푚, the ALAP of 푒푖푗 is defined as
퐴퐿퐴푃 (푒푖푗) = min {푡푠 (푒1, 푅 (푒1)) , 푡푠 (푒2, 푅 (푒2)) , . . . , 푡푠 (푒푘, 푅 (푒푘)) , 푡푠 (푛푗 , 푝푟표푐 (푛푗))} − 푑(푒푖푗)min
푙∈푅
{푏(푙)}
If 푒푚 does not exist, which means 푒푖푗 is the last edge scheduled on 푙푚, then 푡푠 (푒푚, 푅 (푒푚)) =∞.
The communication can be delayed by using the ALAP, hence, an idle time interval is enlarged on the link.
The idle time interval changes from [푡푓 (푒푛−1, 푅 (푒푛−1)) , 푡푠 (푒푛, 푅 (푒푛))] to [푡푓 (푒푛−1, 푅 (푒푛−1)) , 퐴퐿퐴푃 (푒푛)]
between two successive edges 푒푛−1 and 푒푛 on link 푙. If 푒푛 is the first edge on link 푙, then 푡푓 (푒푛−1, 푅 (푒푛−1)) = 0;
and if 푒푛−1 is the last edge on link 푙, then 푡푠 (푒푛, 푅 (푒푛)) = 퐴퐿퐴푃 (푒푛) =∞.
Figure 4(a) shows the use of ALAP. If 푒푖푗 is delayed to its ALAP, the idel time interval on 퐿1 between 푒푎푏 and
푒푖푗 will be enlarged and a greater communication can be inserted bewteen 푒푎푏 and 푒푖푗 .
The method of scheduling a node 푛푖 onto a processor 푝 is given in Algorithm 3. When a node is scheduled,
the ALAPs of its input edges are then calculated (line 6 to 10 in Algorithm 3). The ALAP of an edge can not
be calculated during the processor selection. Hence, a Boolean value is used to indicate whether the procedure
Schedule Node() is used in the procedure Select Processor() or not.
Algorithm 4 gives our method for edge scheduling which is similar to that of the classic heuristic. However,
there is also some improvements: The origin node 푛푖 of 푒푖푗 is tested because some predecessors of the critical
child may be non-scheduled; the best route is chosen to give the earliest finish time; and the ALAP is considered
in the edge scheduling condition.
Figure 4(b) gives a DAG example to show the effect of communication delay. Nodes are sorted into a static list
of 푛1, 푛2, 푛3, 푛4, 푛5, 푛6 by using the priority of 푏푙 & 푡푙. Figure 4(c) gives a partial schedule result with 푛1, 푛2,
푛3, 푛4 having been scheduled. As to 푛5, the input edge 푒1,4 for 푛4 can start at its ALAP of time 3. Hence, the
edge 푒1,5 is inserted between 푒1,3 and 푒1,4 as shown in Figure 4(d) and finally a schedule length of 8 is obtained in
Figure 4(e). If ALAP is not used, another schedule result is obtained in Figure 4(f) with the schedule length of 9.
Mu P C, et al. Sci China Inf Sci ? 2010 Vol. ? No. ? 9
P1
0 5 10
P2
L1
n y
n jna
eab eij
nb
t seij
ALAP eij
n i
nx
e yz
nz
time
(a) ALAP
n1
n3
n6
n2 n4 n5
1
1111
1
111
2 226
1
(b) A DAG example
P1
0 5 10
P2
P3
L1
n1 n2
n4n3
e1,4e1,3
time
(c) Partial schedule result
P1
0 5 10
P2
P3
L1
n1 n2
n4n3
e1,5e1,3
n5
e1,4
time
(d) Schedule 푛5 with communi-
cation delay
P1
0 5 10
P2
P3
L1
n1 n2
n4n3
e1,5e1,3
n5
e1,4 e5,6e4,6
n6
8
time
(e) Schedule result with commu-
nication delay
P1
0 5 10
P2
P3
L1
n1 n2
n4n3
e1,4e1,3
n5
e1,5 e4,6e5,6
n6
9
time
(f) Schedule result without com-
munication delay
Figure 4: Communication delay
Algorithm 3: Schedule Node(푛푖, 푝, 퐼푠푇푒푚푝표푟푎푟푦)
Input: 푛푖 ∈ 푉 , a processor 푝 ∈ 푃푟표푐 (푛푖) and a Boolean value 퐼푠푇푒푚푝표푟푎푟푦
Output: The finish time of 푛푖 on 푝
for each 푛푙 ∈ 푝푟푒푑 (푛푖) , 푝푟표푐 (푛푙) ∕= 푝 do1
Schedule Edge(푒푙푖, 푝);2
end3
Calculate DRT of node 푛푖;4
Find the earliest idle time interval for node 푛푖 on processor 푝 respecting the node scheduling condition;5
if 퐼푠푇푒푚푝표푟푎푟푦 = 푓푎푙푠푒 then6
for each 푛푙 ∈ 푝푟푒푑 (푛푖) , 푝푟표푐 (푛푙) ∕= 푝 do7
Calculate the ALAP of 푒푙푖;8
end9
end10
Schedule 푛푖 on 푝 and calculate the finish time;11
4.3 Advanced Dynamic List Scheduling
Algorithm 5 shows our advanced dynamic list scheduling heuristic. The “dynamic” means that the node list is not
determined before the scheduling but created during the scheduling. Hence, the procedure Sort Nodes() for
the static list scheduling heuristic is no longer necessary. The procedure Choose Node() is used in place of
the procedure Sort Nodes() to choose a node for scheduling. The procedures Select Processor() and
Schedule Node() use the new method given above.
As used for sorting nodes into static lists, node levels are also effective to create dynamic node lists. In the
dynamic list scheduling, any free node can be scheduled in the next step, but we should choose the most critical
one. Since the length of the longest path passing a free node during the scheduling is crucial to the final schedule
length, the free node in this path, which is the first unscheduled node in this path, must be treated immediately in
order to be executed as early as possible. This node is named the critical node and is obtained by considering the
bottom level as shown in Algorithm 6.
In this algorithm, the bottom level (푏푙 (푛푖)) is used as the node priority. The bottom level reflects the time
needed from this node to the end of the DAG; our new bottom levels reflect better the reality in the case of
communication contention. Hence, 푏푙 (푛푖) can be replaced by other bottom levels like 푏푙푐표푚푝 (푛푖), 푏푙푖푛 (푛푖),
푏푙표푢푡 (푛푖) and 푏푙푖표 (푛푖). Different bottom levels may give different dynamic node lists and can finally lead to
different schedule results.
10 Mu P C, et al.
Algorithm 4: Schedule Edge(푒푖푗 , 푝)
Input: 푒푖푗 ∈ 퐸 and a processor 푝 ∈ 푃푟표푐 (푛푗) on which the node 푛푗 to be scheduled
Output: None
if 푛푖 is scheduled then1
if 푝푟표푐 (푛푖) ∕= 푝 then2
퐹푖푛푖푠ℎ푇 푖푚푒←∞;3
for each 푅 ∈ 푅푆 (푝푟표푐 (푛푖) , 푝) do4
Find the earliest common idle time interval on all the links of 푅 respecting the edge scheduling5
condition with ALAP;
if 푡푓 (푒푖푗 , 푅) < 퐹푖푛푖푠ℎ푇 푖푚푒 then6
퐹푖푛푖푠ℎ푇 푖푚푒← 푡푓 (푒푖푗 , 푅);7
푅푏푒푠푡← 푅;8
end9
end10
Schedule 푒푖푗 on 푅푏푒푠푡;11
end12
end13
Algorithm 5: Dynamic List Scheduling(퐺, 푇퐺)
Input: A DAG 퐺 = (푉,퐸,푤, 푐) and a topology graph 푇퐺 = (푁,푃,퐿, 푏)
Output: A schedule of 퐺 on 푇퐺
푈푁푆 ← 푉 ;1
while existing nodes in 푈푁푆 do2
푛← Choose Node(푈푁푆);3
푝푏푒푠푡← Select Processor(푛, 푃 );4
Schedule Node(푛, 푝푏푒푠푡, 푓푎푙푠푒);5
Remove 푛 from 푈푁푆;6
end7
5 Experimental Results
This section gives experimental results of our proposed list scheduling heuristics compared to the classic one given
in ref. [14]. The architecture in Figure 1(c) and 1(d) are used for the comparison in subsection 5.1 and 5.2,
respectively.
5.1 Comparison with an Example
The DAG given in Figure 1(a) is used in this section to show the improvement by using the advanced dynamic
heuristic with different node priorities. Table 1 has given all the five groups of top levels and bottom levels for
this DAG, the resulting static lists according to the rule of sorting nodes are given in Table 2 which also shows the
critical children according to these different static node lists.
Figure 5 gives the schedule result of the classic static list scheduling heuristic with nodes sorted by 푏푙&푡푙. In
this figure, two different symbols for an edge respectively represent the sending and receiving of this edge. The
classic heuristic gives the schedule length of 21.
Our advanced dynamic heuristic with different node priorities may give different dynamic node lists and finally
gives different schedule results. Table 3 shows the generated dynamic node lists with the five node priorities, and
it is noticed that four different node lists (from (a) to (d)) are obtained.
The schedule result for the node priority 푏푙푐표푚푝 is shown in Figure 6(a). The schedule length of 18 is obtained
by using 3 processors. The schedule result for the node priority 푏푙 is shown in Figure 6(b), and the schedule length
is also 18 with 3 processors. Figure 6(c) shows the schedule result with the node priority 푏푙푖푛. The schedule length
is also 18 but with 4 processors. Figure 6(d) gives the schedule result for the same node list obtained by 푏푙표푢푡 and
Mu P C, et al. Sci China Inf Sci ? 2010 Vol. ? No. ? 11
Algorithm 6: Choose Node(푈푁 )
Input: A set 푈푁 of all the unscheduled nodes
Output: The critical node 푛푐 among all the unscheduled nodes
Create a set 퐹푁 of all the free nodes from 푈푁 ;1
푀푎푥퐿푒푛푔푡ℎ← 0;2
for each 푛푖 ∈ 퐹푁 do3
퐿푒푛푔푡ℎ← 0;4
for each 푛푙 ∈ 푝푟푒푑 (푛푖) do5
퐿푒푛푔푡ℎ←max {퐿푒푛푔푡ℎ, 푡푓 (푛푙, 푝푟표푐 (푛푙)) + 푏푙 (푛푖)};6
end7
if 푀푎푥퐿푒푛푔푡ℎ < 퐿푒푛푔푡ℎ then8
푀푎푥퐿푒푛푔푡ℎ← 퐿푒푛푔푡ℎ;9
푛푐 ← 푛푖;10
else if 푀푎푥퐿푒푛푔푡ℎ = 퐿푒푛푔푡ℎ then11
if 푏푙 (푛푐) < 푏푙 (푛푖) then12
푛푐 ← 푛푖;13
end14
end15
end16
Table 2: Different static node lists and corresponding critical children
Critical child
Node priority Static node list 푛1 푛2 푛3 푛4 푛5 푛6 푛7 푛8 푛9
푏푙푐표푚푝 & 푡푙푐표푚푝 푛1, 푛4, 푛3, 푛2, 푛8, 푛7, 푛6, 푛5, 푛9 푛4 푛7 푛8 푛8 null 푛9 푛9 푛9 null
푏푙 & 푡푙 푛1, 푛2, 푛4, 푛3, 푛7, 푛6, 푛8, 푛5, 푛9 푛2 푛7 푛8 푛8 null 푛9 푛9 푛9 null
푏푙푖푛 & 푡푙푖푛 푛1, 푛2, 푛4, 푛3, 푛7, 푛6, 푛8, 푛5, 푛9 푛2 푛7 푛8 푛8 null 푛9 푛9 푛9 null
푏푙표푢푡 & 푡푙표푢푡 푛1, 푛2, 푛4, 푛3, 푛7, 푛8, 푛6, 푛5, 푛9 푛2 푛7 푛8 푛8 null 푛9 푛9 푛9 null
푏푙푖표 & 푡푙푖표 푛1, 푛2, 푛4, 푛3, 푛7, 푛8, 푛6, 푛5, 푛9 푛2 푛7 푛8 푛8 null 푛9 푛9 푛9 null
푏푙푖표. The schedule length is 17 with 4 processors and is better than the three former schedule lengths of 18. All
the schedule results of the advanced dynamic heuristic are better than that of the classic heuristic; sometimes the
number of used processors is also reduced.
5.2 Comparison with Random DAG
Random graphs are commonly used to compare scheduling algorithms in order to get statistical results which are
more persuasive than the result for some particular graphs. We implement a graph generator based on SDF3 to
generate random SDF graphs[19] except that the SDF graphs are constrained to be DAGs (no cycles).
A random DAG is constrained in five aspects: (1) the number of nodes, (2) the average in degree, (3) the
average out degree, (4) the random weights of nodes, (5) the random weights of edges. The average in degree and
out degree are assumed to be same in this paper. The weights of nodes vary randomly from 푤푚푖푛 to 푤푚푎푥. The
communication to computation ratio (퐶퐶푅) is used to generate random weights of edges. The 퐶퐶푅 is defined
P1
0 5 10 15 20
P2
P3
L3
n1 n2 n7
n5
n4 n8
n3
n6
n9
e1,5
L1 e1,3e1,5e2,6
e3,8
e8,9
L2
e1,3
e2,6
e3,8
e8,9
P4
L4
e6,9
e6,9
e1,4
e1,4
21
time
eijSending:
Receiving: eij
Figure 5: Schedule result of classic heuristic
12 Mu P C, et al.
Table 3: Different dynamic node lists
Node priority Dynamic node list No.
푏푙푐표푚푝 푛1, 푛4, 푛2, 푛6, 푛7, 푛3, 푛8, 푛9, 푛5 (a)
푏푙 푛1, 푛4, 푛2, 푛7, 푛6, 푛3, 푛8, 푛9, 푛5 (b)
푏푙푖푛 푛1, 푛2, 푛4, 푛3, 푛8, 푛6, 푛7, 푛9, 푛5 (c)
푏푙표푢푡 푛1, 푛2, 푛4, 푛3, 푛8, 푛7, 푛6, 푛9, 푛5 (d)
푏푙푖표 푛1, 푛2, 푛4, 푛3, 푛8, 푛7, 푛6, 푛9, 푛5 (d)
P1
0 5 10 15 20
P2
P3
L3
n1 n2 n7
n5
n4
n8n3
n6 n9
e1,5
L1 e1,3e1,5
e4,8
e8,9
L2 e1,3
e4,8
e8,9
P4
L4
18
time
(a)
P1
0 5 10 15 20
P2
P3
L3
n1 n2 n6
n5
n4
n8n3
n7 n9
e1,5
L1 e1,3e1,5
e4,8
e8,9
L2 e1,3
e4,8
e8,9
P4
L4
18
time
(b)
P1
0 5 10 15 20
P2
P3
L3
n1 n2 n7
n5
n4 n8
n3
n6 n9
e1,5
L1 e1,3e1,5
e3,8
e8,9
L2
e1,3 e3,8
e8,9
P4
L4
18
e1,4
e1,4
time
(c)
P1
0 5 10 15 20
P2
P3
L3
n1 n2 n7
n5
n4 n8
n3
n6 n9
e1,5
L1 e1,3e1,5
e3,8 e7,9L2
e1,3 e3,8
e7,9
P4
L4
17
e2,6
e2,6
e1,4
e1,4
time
(d)
Figure 6: Schedule results of advanced dynamic heuristic
as the average weight of edges divided by the average weight of nodes in this paper, that is, 퐶퐶푅 =
1
∣퐸∣
∑
푒∈퐸
푐(푒)
1
∣푉 ∣
∑
푛∈푉
푤(푛)
.
The 퐶퐶푅’s typical values of 0.1, 1 and 10 represent the low, medium and high communication cases, respectively.
The weights of edges are generated randomly from 푤푚푖푛 × 퐶퐶푅 to 푤푚푎푥 × 퐶퐶푅.
The advanced dynamic list scheduling heuristic can use the five groups of node priorities to get different results.
We combine the five groups of node priorities with the advanced dynamic heuristic and choose the best result; the
whole process is called a combined advanced dynamic heuristic. To compare the performance difference between
the combined advanced dynamic heuristic and the classic list scheduling heuristic with the node priority of 푏푙 & 푡푙,
we generate random DAGs as follows: The number of nodes is fixed to be 100, weights of nodes vary randomly
from 푤푚푖푛 = 100 to 푤푚푎푥 = 1000, and according to the average in/out degree and 퐶퐶푅, 9 groups of random
DAGs are generated with 1000 samples in each group. Table 4 compares the combined advanced dynamic heuristic
with the classic heuristic. Although the combined advanced dynamic heuristic is worse than classic heuristic for
most random DAGs in the case of 퐶퐶푅 = 0.1, it is better for most random DAGs as the 퐶퐶푅 increases.
Table 4: Comparison of the combined advanced dynamic heuristic with the classic list scheduling heuristic
Average in/out degree 2 3 4
퐶퐶푅 0.1 1 10 0.1 1 10 0.1 1 10
Better 1.2% 86.4% 94.7% 1.9% 78.2% 95.6% 2.3% 76.6% 95.3%
Equal 24.2% 0.9% 0.0% 13.7% 0.0% 0.0% 8.7% 0.0% 0.0%
Worse 74.6% 12.7% 5.3% 84.4% 21.8% 4.4% 89.0% 23.4% 4.7%
To illustrate more clearly the performance of the combined advanced dynamic heuristic, we define the accel-
eration factor (퐴푐푐) as 퐴푐푐 = 푠푙푐푙푎푠푠푖푐푠푙푎푑푣푎푛푐푒푑 to show the speed-up of the advanced heuristic. We tested 27 groups of
random DAGs, and Figure 7(a) shows the average 퐴푐푐 of the combined advanced dynamic list scheduling heuris-
tics. It is noticed that their performances are similar and the schedule results are sped up (퐴푐푐 > 1) by using
Mu P C, et al. Sci China Inf Sci ? 2010 Vol. ? No. ? 13
the combined advanced heuristic in the cases of 퐶퐶푅 = 1 and 퐶퐶푅 = 10. We can see that the average 퐴푐푐
increases when 퐶퐶푅 varies from 0.1 to 10. The schedule result can be accelerated up to 80% when 퐶퐶푅 = 10.
If the number of nodes is fixed, the average 퐴푐푐 increases as the average in/out degree increases when 퐶퐶푅 = 10.
The reason for this phenomenon is that the critical child technique helps to select better processors for nodes with
multiple predecessors. The greater the in/out degree is, the better the critical child works. Since the communica-
tion cost is increasing in modern embedded applications like digital communication and video compression, our
method is suitable for scheduling these applications on parallel embedded systems.
(50;2) (50;3) (50;4) (100;2) (100;3) (100;4) (200;2) (200;3) (200;4)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
Combined advanced dynamic list scheduling for random DAGs
CCR=0,1
CCR=1
CCR=10
(Number of nodes; Average in/out degree)
Av
er
ag
e 
ac
ce
le
ra
tio
n 
fa
ct
or
(a) Average 퐴푐푐
100 200 300 400 500
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
Time complexity
P=16
P=12
P=8
P=4
V
Ti
m
e 
(m
s)
(b) Time complexity with 푉
2 4 6 8 10 12 14 16
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
Time complexity
V=500
V=300
V=100
P
Ti
m
e 
(m
s)
(c) Time complexity with 푃
Figure 7: Average 퐴푐푐 of the advanced dynamic heuristic and its time complexity
5.3 Time Complexity of the Advanced Dynamic Heuristic
The classic list scheduling heuristic has the time complexity of 푂
(
푃퐸2푂 (푟표푢푡푖푛푔) + 푉 2
)
, where 푃 , 푉 and 퐸
are the number of processors, the number of nodes and the number of edges, respectively. 푂 (푟표푢푡푖푛푔) represents
the maximum number of links on a route and is usually fixed because of the static routing strategy in parallel
embedded systems. The time complexity increases by a factor of 푃 when using the critical child. Hence, the
time complexity of our advanced dynamic heuristic is 푂
(
푃
(
푃퐸2푂 (푟표푢푡푖푛푔) + 푉 2
))
, while combination of the
advanced dynamic heuristic with the five node priorities does not increase the degree of the time complexity.
Figure 7(b) and 7(c) shows the time consumed to schedule different sizes of DAGs on architectures with different
numbers of processors by our combined advanced dynamic heuristic. All the DAGs have the average in/out degree
of 4; all the processors are connected to the same switch by different communication links. It is shown that the
time increases with the square of 푉 and also with the square of 푃 . We ran our heuristic on a Pentium Dual-Core
PC at 2.4GHz, and it toke about 3 minutes to schedule a DAG with 500 nodes on an architecture of 16 processors.
In fact, a complicated embedded application usually has no more than 500 nodes in models of coarse and medium
grain, and 푃 is usually much smaller than 푉 and 퐸 in a parallel embedded system. Hence, the increase of time
complexity is reasonable and acceptable for rapid prototyping methodologies.
6 Conclusions
This paper proposes three new groups of node levels (top level and bottom level) and two advanced techniques
(critical child and communication delay) for list scheduling with communication contention. We also give an ad-
vanced dynamic list scheduling heuristic using the new node levels and the two advanced techniques. Our method
is used for heterogeneous parallel embedded systems. The new node levels take into account the communication
contention and are used as node priorities to generate different node lists; the critical child technique helps to select
a better processor for a node; and the communication delay technique delays communications when necessary in
order to enlarge idle time intervals on communication links.
The advanced dynamic heuristic can use different node lists to get different scheduling results for a given
DAG. We combine the five groups of node priorities with the advanced dynamic heuristic and choose the best
result; the whole process is a combined advanced dynamic heuristic. To compare with the classic method, we
use homogeneous parallel systems and randomly generated DAGs. Experimental results show that the combined
advanced dynamic heuristic is efficient to shorten the schedule length for most of the randomly generated DAGs
in the cases of medium and high communication. Our method accelerates a scheduling result up to 80% in the
14 Mu P C, et al.
case of high communication and sometimes also reduces the use of hardware resources. Since the communication
cost is increasing from low to medium and even to high in modern digital communication and video compression
applications, our method will work well for scheduling these applications on parallel embedded systems.
Acknowledgements
This work was supported by the China Scholarship Council. We thank Profs. YIN QinYe of Xi’an Jiaotong University for
giving precious propositions during the redaction of this paper.
References
1 Lee E, Parks T. Dataflow process networks, Proceedings of the IEEE, 1995, 83(5): 773–801
2 Sriram S, Bhattacharyya S S. Embedded multiprocessors - scheduling and synchronization. New York, NY, USA: Marcel Dekker, Inc, 2000
3 Sarkar V. Partitioning and scheduling parallel programs for multiprocessors. Cambridge, MA, USA: MIT Press, 1989
4 Garey M R, Johnson D S. Computers and intractability: A guide to the theory of NP-completeness. New York, NY, USA: W H Freeman &
Co, 1990
5 Adam T L, Chandy K M, Dickson J R. A comparison of list schedules for parallel processing systems. Commun ACM, 1974, 17(12):
685–690
6 Kasahara H, Narita S. Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans Comput, 1984, 33(11):
1023–1029
7 Hwang J J, Chow Y C, Anger F D, Lee C Y. Scheduling precedence graphs in systems with interprocessor communication times. SIAM J
Comput, 1989, 18(2): 244–257
8 Wu M Y, Gajski D. Hypertool: A programming aid for message-passing systems. IEEE Trans Parallel Distr Syst, 1990, 1(3): 330–343
9 Yang T, Gerasoulis A. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Trans Parallel Distr Syst, 1994, 5(9):
951–967
10 Kwok Y K, Ahmad I. Dynamic critical-path scheduling: An effective technique for allocating task graphs onto multiprocessors. IEEE Trans
Parallel Distr Syst, 1996, 7(5): 506–521
11 Sih G, Lee E. A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans
Parallel Distr Syst, 1993, 4: 175–187
12 Kwok Y K, Ahmad I. Bubble scheduling: A quasi dynamic algorithm for static allocation of tasks to parallel architectures. In: Proceedings
of the 7th IEEE Symposium on Parallel and Distributed Processing, Washington, DC, USA, 1995
13 Grandpierre T, Lavarenne C, Sorel Y. Optimized rapid prototyping for real-time embedded heterogeneous multiprocessors. In: Proceedings
of 7th International Workshop on Hardware/Software Co-Design, Rome, Italy, 1999
14 Sinnen O, Sousa L. Communication contention in task scheduling. IEEE Trans Parallel Distr Syst, 2005, 16(6): 503–515
15 Tang X, Li K, Padua D. Communication contention in APN list scheduling algorithm. Sci China Inf Sci, 2009, 52(1): 59–69,
16 Sinnen O. Task scheduling for parallel systems. Hoboken, NJ, USA: John Wiley & Sons, Inc, 2007
17 Kwok Y K, Ahmad I. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 1999,
31(4): 406–471
18 Sinnen O, Sousa L. List scheduling: Extension for contention awareness and evaluation of node priorities for heterogeneous cluster archi-
tectures. Parallel Computing, 2004, 30(1): 81–101
19 Stuijk S, Geilen M, Basten T. SDF3: SDF for free. In: Proceedings of 6th International Conference on Application of Concurrency to
System Design, Los Alamitos, CA, USA, 2006
