Iterative schedule optimisation for voltage scalable distributed embedded systems by Schmitz, Marcus T. et al.
Iterative Schedule Optimisation for Voltage
Scalable Distributed Embedded Systems
MARCUS T. SCHMITZ and BASHIR M. AL-HASHIMI
University of Southampton
and
PETRU ELES
Link¨ oping University
We present an iterative schedule optimisation for multi-rate system speciﬁcations, mapped onto
heterogeneous distributed architectures containing dynamic voltage scalable processing elements
(DVS-PEs). To achieve a high degree of energy reduction, we formulate a generalised DVS prob-
lem, taking into account the power variations among the executing tasks. An eﬃcient heuristic is
presented that identiﬁes optimised supply voltages by not only ”simply” exploiting slack time, but
under the additional consideration of the power proﬁles. Thereby, this algorithm minimises the
energy dissipation of heterogeneous architectures, including power managed processing elements,
eﬀectively. Further, we address the simultaneous schedule optimisation towards timing behaviour
and DVS utilisation by integrating the proposed DVS heuristic into a genetic list scheduling ap-
proach. We investigate and analyse the possible energy reduction at both steps of the co-synthesis
(voltage scaling and scheduling), including the power variations eﬀects. Extensive experiments
indicate that the presented work produces solutions with high quality.
Categories and Subject Descriptors: C.3 [Special-purpose and application-based systems]:
Real-time and embedded systems; J.6 [Computer-aided engineering]: Computer-aided design
General Terms: Algorithms, Design, Optimization
Additional Key Words and Phrases: Dynamic voltage scaling, Embedded systems, Energy min-
imisation, Scheduling, System synthesis, Heterogeneous distributed systems
1. INTRODUCTION AND RELATED WORK
The dramatically growing market segment for embedded computing systems is
driven by the ever increasing demand for new application speciﬁc devices, which
can be generally found in almost every application domain, such as consumer elec-
tronics, home appliances, automotive, and avionic devices. To help balancing the
production costs with development time and cost, these embedded systems are
commonly composed of several heterogeneous processing elements (PEs), which
are interconnected by communication links (CLs) [Wolf 1994]. For example, very
Authors’ addresses: M. T. Schmitz and B. M. Hashimi, Department of Electronics and Com-
puter Science, University of Southampton, SO17 1BJ Southampton, UK, email: {ms99r,
bmah}@ecs.soton.ac.uk; P. Eles, Department of Computer and Information Science, Link¨ oping
University, S-581 83 Link¨ oping, Sweden, email: petel@ida.liu.se.
Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distributed for proﬁt or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,
to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee.
c  2003 ACM 0000-0000/2003/0000-0001 $5.00
ACM Journal Name, Vol. V, No. N, May 2003, Pages 1–35.2 · Marcus T. Schmitz et al.
often the combination of less powerful and cheap PEs leads to a more cost eﬃcient
design implementation than the usage of a powerful single processor system. Typi-
cally, such embedded systems have to concurrently perform a multitude of complex
tasks under a strict timing behaviour, given in the system speciﬁcation.
System-level co-design is a methodology aiming to aid the system designers at
solving the diﬃcult problem of ﬁnding the ”best” suitable implementation for a
system speciﬁcation. The traditional co-design ﬂow for distributed systems involves
solving three subproblems, namely:
(1) Allocation: determining the numbers and types of PEs and CLs used to com-
pose the system architecture,
(2) Mapping: assignment of computational tasks to PEs and of data transfers
between diﬀerent PEs to CLs,
(3) Scheduling: determining the execution order (sequencing) of tasks mapped to
a PE and communications mapped to a CL.
These problems (allocation/mapping and scheduling) are well-known to be NP-
complete [Garey and Johnson 1979], and therefore an optimal co-design of dis-
tributed systems is intractable. This justiﬁes the usage of heuristic optimisation
algorithms of diﬀerent types, e.g., simulated annealing [Henkel et al. 1993; Eles et al.
1997], tabu-search [Eles et al. 1997], genetic algorithm [Dick and Jha 1998; Teich
et al. 1997], or constructive techniques [Wolf 1997], to tackle the computational
complexity.
In the last decade power dissipation has become a mandatory issue of concern
in the design of embedded systems because of: (a) The popularity of mobile appli-
cations powered by batteries with limited capacity, (b) the operational costs and
environmental reasons aﬀected by the high electrical power consumption of large
computing systems, and (c) the reliability and feasibility problems caused by exten-
sive heat production exceeding the physical substrate limitations, especially when
implementing systems on a single chip (SoCs). Several useful techniques have been
proposed to reduce the power dissipation of integrated circuits, targeted at diﬀer-
ent levels of abstraction [Devadas and Malik 1995; Pedram 1996]. One approach
aiming to reduce the power dissipation at the system-level is recently receiving a
lot of attention from the research community and industry, namely, dynamic volt-
age scaling (DVS) [Weiser et al. 1994; Gutnik and Chandrakasan 1997; Hong et al.
1999; Ishihara and Yasuura 1998; Okuma et al. 1999; Quan and Hu 2001; Shin and
Choi 1999; Shin et al. 2000; Simunic et al. 2001]. The main idea behind DVS is
to conjointly scale the supply voltage Vdd and operational frequency f dynamically
during run-time in accordance to the temporal performance requirements of the
application. In this way the dynamic power dissipation Pdyn (disregarding short-
circuit power) is reduced in a near cubic manner, since it depends quadratically on
the supply voltage and linearly on the operational frequence. The exact relation is
expressed by the following two equations,
Pdyn = CL · N0→1 · f · V 2
dd (1)
f = k ·
(Vdd − Vt)2
Vdd
(2)
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 3
where CL denotes the load capacitance of the digital circuit, N0→1 represents the
zero to one switching activity, k is a circuit dependent constant, and Vt is the
threshold voltage. DVS is thereby able to exploit the idle and slack times (time
intervals where system components do not carry out any computations), given in the
system schedule, in order to lower the power dissipation. The occurrence of idle and
slack times has three reasons: (a) It is often the case for a given application to show
various degrees of parallelism, i.e., not all PEs will be utilised constantly during
run-time, (b) the performance of the allocated architecture cannot be adapted
perfectly to the application needs, since the allocation of ”performance” is not
given as continuous range, but is rather quantised, and (c) schedules for hard real-
time systems are constructed by considering worst case execution times (WCETs),
however, actual execution times of tasks during operation are, for most of their
activations, smaller than their WCETs. Several state-of-the-art implementations of
DVS enabled processors [Burd et al. 2000; Gutnik and Chandrakasan 1997; Klaiber
2000] have successfully shown that power consumption can be reduced signiﬁcantly
(by up to 10 times compared to ﬁxed voltage approaches) when running real world
applications. In order to achieve such a high level of power and energy eﬃciency,
it is essential to identify optimised scaling voltages for the task executions [Okuma
et al. 2001] to exploit the available idle and slack times eﬃciently. Such voltage
scheduling algorithms can be divided in two broad categories: on-line (dynamic)
[Lee and Sakurai 2000; Quan and Hu 2001; Shin and Choi 1999] and oﬀ-line (static)
approaches [Bambha et al. 2001; Gruian and Kuchcinski 2001; Ishihara and Yasuura
1998]. The ﬁrst class dynamically re-calculates the priorities and scaling voltages
of tasks at run-time, i.e., the voltage schedule is changed during the execution of
the application. Obviously, such approaches consume additional power and time
during execution. On the other hand, they are able to make use of the dynamic
slack introduced by execution times smaller than the WCET. In the second class,
a static voltage schedule is calculated once before the application is executed, i.e.,
the voltage schedule is maintained unchanged during run-time. Hence, power and
time overheads are avoided. The technique proposed in this paper falls into the
class of static voltage schedulers.
Voltage selection is already a complex problem when only single DVS proces-
sor systems, executing independent tasks, are considered [Hong et al. 1999]. The
problem is further complicated in the presence of distributed systems speciﬁed by
dependent tasks where the allocation, mapping, and scheduling inﬂuence the possi-
bility to exploit DVS [Bambha et al. 2001; Gruian 2000; Luo and Jha 2000; Schmitz
and Al-Hashimi 2001]. Most previous DVS approaches [Hong et al. 1999; Lee and
Sakurai 2000; Quan and Hu 2001; 2002; Shin and Choi 1999] concentrate on sin-
gle processor systems executing independent task sets and, hence, are not directly
applicable to the problem addressed here. Nevertheless, we need to consider DVS
at all these optimisation steps during co-synthesis, in order to ﬁnd high quality
system implementations. In this paper, we will concentrate on the scheduling and
voltage scaling aspects of such systems. Further details concerning the mapping
and allocation steps can be found in [Schmitz et al. 2002; Schmitz 2003].
Previous research in system-level co-synthesis is extensive but has mainly focused
on traditional architectures excluding issues related to power consumption [Ernst
ACM Journal Name, Vol. V, No. N, May 2003.4 · Marcus T. Schmitz et al.
et al. 1993; Henkel and Ernst 2001; Micheli and Gupta 1997; Prakash and Parker
1992; Wolf 1997; Xie and Wolf 2001] or considering energy optimisation with com-
ponents that are not DVS enabled [Dick and Jha 1998; Kirovski and Potkonjak
1997]. A system-level scheduling technique for power-aware systems in mission-
critical applications was presented in [Liu et al. 2001]. This approach satisﬁes
min/max timing constraints as well as max power constraints taking into account
not only processor power consumption but additionally the power dissipated by
peripheral system components. All this research provides a valuable basis for the
work presented here. However, three research groups recently proposed approaches
for the voltage scaling problem in distributed systems that have close relationship
to the problems we address in this paper. Bambha [Bambha et al. 2001] presented
a hybrid search strategy based on simulated heating. This method uses a global ge-
netic algorithm to ﬁnd appropriate parameter settings for a local search algorithm.
The local search algorithms are based on hill climbing and Monte Carlo techniques.
In [Luo and Jha 2000], a power conscious joint scheduling of aperiodic and pe-
riodic tasks is introduced, which reserves execution slots for aperiodically arriving
tasks within a static schedule of a task graph. Their algorithm aims for energy min-
imisation through DVS by distributing the available deadline slack evenly among
all tasks. They further extend their approach towards a battery-aware scheduling
with the aim to improve the battery discharge proﬁle [Luo and Jha 2001]. Gruian
and Kuchcinski [Gruian and Kuchcinski 2001] extend a dynamic list based schedul-
ing heuristic to support DVS by making the priority function energy aware. In each
scheduling step the energy sensitive task priorities are re-calculated. If a schedul-
ing attempt fails (exceeded hard deadline), the priority function is adjusted and
the application is re-scheduled. Despite their power reduction eﬃciency, all these
DVS approaches [Bambha et al. 2001; Gruian and Kuchcinski 2001; Luo and Jha
2000] do not consider and target heterogenous distributed architectures contain-
ing power managed DVS-PEs in which the dissipated power for each task execu-
tion might vary. It was shown in [Ishihara and Yasuura 1998] and [Manzak and
Chakrabarti 2000] that the variations in the average switching activity (equivalent
to variations in power) inﬂuence the optimal voltage schedule and hence need to be
considered during the voltage selection. However, both approaches do not target
distributed systems with multiple PEs executing tasks with dependencies. Thus,
new system-level co-synthesis approaches for DVS-enabled architectures, which take
into account that power varies among the executed tasks, are needed. Recently,
an approach to solve this problem has been presented in [Zhang et al. 2002]. The
scheduling optimisation towards DVS utilisation in this approach is based on a
constructive technique, as opposed to our iterative scheduling optimisation which
allows a thorough search to ﬁnd schedules of high quality.
In this paper, we formulate a generalised DVS problem that considers power
variation eﬀects and is based on an iterative scheduling optimisation. We assume
that typical embedded architecture employ gate level power reduction techniques,
such as gated clocks, to switch oﬀ un-utilised blocks in the circuit [Devadas and
Malik 1995; Tiwari et al. 1994]. It is therefore necessary to take into account
that power varies considerably among the tasks carried out by the system. This
holds also for DVS-PEs [Burd 2001]. For example, in the case of a general purpose
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 5
processor (GPP) including an integer and a ﬂoating point unit, it is not desirable to
keep the ﬂoating point unit active if only integer instructions are executed. Thereby,
diﬀerent tasks (diﬀerent use of instructions) dissipate diﬀerent amounts of power on
the same PE. In the case of an ARM7TDMI processor the current varies between
5.7 and 18.3mA, depending on the functionality which is carried out [Brandolese
et al. 2000]. Taking this into account, the assumptions to Lemma 1 and Lemma
2 in [Ishihara and Yasuura 1998], stating that energy consumption is independent
from the type of operations and input data and depends only on the supply voltage,
have to be rejected. In addition to this, our problem formulation also takes into
consideration the diﬀerent power dissipations among diﬀerent processing elements.
This is important since high power consuming PEs are likely to have a greater
impact on the energy saving (when scaled to lower performance) than low power
consuming PEs.
The aim of this paper is twofold: Firstly, we are formulating and examining a
generalised DVS problem which allows power variations, in the following also called
PV-DVS problem. We introduce a new, generalised DVS heuristic for distributed
systems containing heterogenous and power managed PEs. Secondly, we illustrate
the incorporation of this scaling technique into a genetic list scheduling approach,
which optimises the system schedule simultaneously towards timing feasibility and
DVS exploitability. This incorporation necessitates a careful adaption of the em-
ployed list scheduler to ensure its suitability for both optimisation goals. We provide
a detailed analysis of the DVS and scheduling approach revealing how scheduling
inﬂuences the DVS utilisation. This analysis is carried out on several benchmark
examples from literature [Bambha et al. 2001; Gruian and Kuchcinski 2001] and
generated for experimental purposes, as well as on an optical ﬂow detection real-life
examples.
The remainder of the paper is organised in the following way. In Section 2, we
formulate the system-level synthesis problem and give a general and brief overview
of genetic algorithms, since they are used for the schedule optimisation. Section 3
describes in detail our approach to the system-level scheduling problem for architec-
tures including DVS components. In Section 4 numerous benchmark examples are
evaluated and compared with approaches that neglect power proﬁle information.
Finally, in Section 5 we give some conclusions drawn from the presented work.
2. PROBLEM FORMULATION AND PRELIMINARIES
In this work, we consider that a multi-rate application is speciﬁed as a set of
communicating tasks, represented by a task graph GS(T ,C). This (hyper) task
graph might be the combination of several smaller task graphs, capturing all task
activations for the hyper-period (LCM of all graph periods). Figure 1(a) shows a
task graph example. Each node τ ∈ T in these acyclic directed graphs represents a
task, an atomic unit of functionality to be executed without preemption. Further,
each task might inherit a speciﬁc hard deadline θ. These deadlines must be met
to fulﬁl the feasibility requirements of the speciﬁed application. In addition, the
task graph inherits a period p which speciﬁes the maximal allowed time between
two successive invocations of the initial task. The edges γ ∈ C, in the task graph,
denote precedence constraints and data dependencies between tasks. If two tasks,
ACM Journal Name, Vol. V, No. N, May 2003.6 · Marcus T. Schmitz et al.
t3
1 t
t2
q4
q
p
=
2
m
s
t4
t0
3
=1.6ms
=1.5ms
(a) Single task
graph example
with speciﬁed
deadlines
CI CI
CL
MEM1 MEM2
(GPP) (DSP)
DVS−PE1 DVS−PE2
(b) Architecture
containing two
DVS-PEs with lo-
cal memory linked
by a single bus
Fig. 1. Speciﬁcation and Architectural Models
τi and τj, are connected by an edge then the execution of task τi must be ﬁnished
before task τj can be started. Data dependencies inherit a data value, reﬂecting the
quantity of information to be exchanged by two tasks. A feasible implementation
of an application must respect all timing constraints and precedence requirements
when executed on an underlying architecture. This type of speciﬁcation model is
most suitable for data ﬂow intensive application with a repetitive behaviour, as
they can be found in systems for image, speech, and video processing.
The architectures we consider here consist of heterogeneous PEs, like general pur-
pose processors (GPPs), ASIPs, FPGAs, and ASICs. These components include
state-of-the-art DVS-PEs. An infrastructure of communication links, like buses
and point-to-point connections, interconnects these PEs. Processors are capable
to execute software tasks, which are accommodated in local memory, in a sequen-
tial manner, while tasks implemented on FPGAs or ASICs can be performed in
parallel and occupy silicon area. Figure 1(b) shows an example architecture built
out of two DVS-PEs connected by a single bus. Such architectures can be found
in application domains which target multimedia and telecommunication systems.
The architecture is captured using a directed graph GA(P,L) where nodes π ∈ P
represent processing elements and edges λ ∈ L denote communication links.
Each task of the system speciﬁcation might have multiple implementation alter-
natives and can therefore be potentially mapped to several PEs able to execute
this task. If two communicating tasks are accommodated on diﬀerent PEs, πn and
πm with n 6= m, then the communication takes place over a CL, involving a com-
munication time overhead. For each possible task mapping certain implementation
properties, like e.g. execution time, dynamic power dissipation, memory, and area
requirements, are given in a technology library. These values are either based on
previous design experience or on estimation and measurement techniques [Bran-
dolese et al. 2000; Fornaciari et al. 1999; Li et al. 1995; Tiwari et al. 1994; Muresan
and Gebotys 2001]. The technology library further includes information about the
available PEs and CLs, such as price, DVS enable ﬂag, etc.
The overall co-synthesis process includes three traditional co-design problems,
namely, allocation, mapping, and scheduling. These optimisation steps determine
ACM Journal Name, Vol. V, No. N, May 2003.8 · Marcus T. Schmitz et al.
information consists of two main parts, a system speciﬁcation and a component
library. The system speciﬁcation is captured using the directed acyclic graph, as
outlined in Section 2, and includes the performance requirements. The properties
of processing elements and communication links (price, idle power dissipation, etc.)
are collected in the component library, which additionally includes estimated infor-
mation (i.e. execution time, dynamic power dissipation, etc.) about each task/PE
and communication/CL combination. The input to our co-design approach further
involves a knowledge-based pre-allocation of system components. Using the pre-
sented synthesis approach the designer evaluates the suitability of this allocation.
If an architecture proves to be unsuitable or of low quality the designer modiﬁes the
allocation and re-evaluates the design. The presented co-synthesis system takes this
input information and establishes the necessary data structures. This is followed
by the allocation step (Step 4 in Figure 2) that determines the type and quantity
of PEs and CLs used to compose the architecture. An appropriate allocation min-
imises system cost while providing suﬃcient computational performance. In Step 3,
the task mapping is carried out. This step determines the mapping of tasks to PEs
and uses a GA-based iterative improvement technique. Task mapping optimises
the distribution of tasks towards energy savings, but additionally aims to satisfy
imposed area constraints on hardware components. After a mapping is established,
the next step involves the scheduling (Step 2) of the tasks and communications in
order to meet the hard time constraints of the application and to further minimise
the energy dissipation in the presence of DVS-PEs. This optimisation is based on a
list scheduling heuristic using a GA for the determination of priorities. At the core
of this co-synthesis approach, as shown in Figure 2, is the PV-DVS algorithm (Step
1). In this step the algorithm identiﬁes scaling voltages for the task executions on
DVS-PEs under the consideration of power variations in order to eﬃciently reduce
the energy dissipation of the distributed system. The output of the proposed co-
design ﬂow consists of three results: (a) an allocated architecture, (b) a mapping of
tasks and communications onto that architecture, and (c) a feasible schedule for the
task executions and the communication activities, such that no time constraints are
violated. In addition to these traditional aspects, the proposed technique further
outputs scaling voltages for the tasks executed by DVS-PEs. Note that the architec-
ture, the mapping, and the schedule are optimised for the exploitation of DVS and
therefore diﬀer from the results obtained by traditional co-design approaches [Dick
and Jha 1998; Ernst et al. 1993; Kirovski and Potkonjak 1997; Micheli and Gupta
1997; Prakash and Parker 1992; Wolf 1997]. The outcome, which is of relevance to
the designer/architect, consists of the system cost (system price), the total system
energy dissipation, and the implementation quality (e.g. performance related to
soft deadlines). Using these values, the designer is able to judge the overall quality
of the implementation and can operate certain changes if necessary.
3.1 Generalised DVS Approach for Distributed Systems containing Power Managed
PEs
In this section, which is concerned with the identiﬁcation of scaling voltages, we ﬁrst
motivate the consideration of power variation eﬀects using an illustrative example.
This is followed (Subsection 3.1.1) by the formulation of a generalised DVS problem
for distributed systems. In Subsection 3.1.2, we introduce an heuristic algorithm
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 9
to solve the formulated problem.
The aim of the generalised DVS approach is to identify scaling voltages under the
consideration of power variation eﬀect. This is done for the scheduled and mapped
system speciﬁcation such that the total dynamic energy dissipation is minimised.
The presented approach assumes that no restrictions are placed on the scaling volt-
ages, i.e., our technique targets variable-voltage systems (nearly continuous range
of possible supply voltages) rather than multi-voltage systems (small and limited
number of potential supply voltages). However, we will explain in Section 3.1.2 how
the obtained scaling voltages can be easily adapted to suit multi-voltage systems.
The term generalised DVS refers to the key observation that the power dissipation
varies considerably upon the PE types and the instructions executed by the PEs.
This is not new and well known [Burd and Brodersen 1996; Tiwari et al. 1994].
However, unlike previous approach to DVS for distributed systems [Bambha et al.
2001; Gruian and Kuchcinski 2001; Luo and Jha 2000], the presented technique
takes these power variation eﬀects into account and is suﬃciently fast to be used in
the inner optimisation loop of a co-synthesis tool. The following example is used to
motivate the necessity of considering power variations during the voltage selection,
in order to minimise the dynamic energy dissipated by the system. Before we start
with the example, it is necessary to deﬁne the term energy diﬀerence, which will be
used throughout this section.
Deﬁnition 1. We deﬁne an energy diﬀerence ∆Eτ as the diﬀerence between the
energy dissipation of task τ with the execution time t and the reduced energy
dissipation (due to voltage and clock scaling) of the same task when extended by a
time quantum ∆t. Formally:
∆Eτ = Eτ(t) − Eτ(t + ∆t) (3)
where Eτ(t) and Eτ(t + ∆t) are calculated using Equations (1) and (2). 
Motivational Example 1: Considering Power Variations during Voltage Scaling
The intention with this illustrative example is to motivate the consideration of
power variation eﬀects during the voltage scaling of heterogeneous distributed sys-
tems. This is done by using two diﬀerent models during the voltage scaling: (a) a
ﬁxed power model which does not allow power variations and (b) a variable power
model which takes power variation into account (as used in the proposed approach).
The starting point for the DVS technique is a system speciﬁcation scheduled (at
nominal voltage) and mapped onto an allocated architecture which includes power
managed DVS components. In this simple example, we consider an architecture
composed of two hypothetical, heterogeneous DVS-PEs connected through a single
bus as illustrated in Figure 1(b). The system is speciﬁed by the task graph shown
in Figure 1(a).
Nominal supply voltage Vmax and threshold voltage Vt for the two PEs are given
in Table II(a). This table further shows the nominal execution times and dy-
namic power dissipations of tasks, according to their mapping. Furthermore, the
transfer times and power dissipations of the communication activities are shown in
Table II(b), reﬂecting the inter PE communications through the bus. Communi-
cations between tasks on the same PE are assumed to be instantaneous, and their
ACM Journal Name, Vol. V, No. N, May 2003.10 · Marcus T. Schmitz et al.
Table I. Execution times and power dissipations for the motivational example
PE0 (Vmax = 5V , Vt = 1.2V ) PE1 (Vmax = 3.3V , Vt = 0.8V )
task exe. time power dis. exe. time power dis.
(ms) (mW) (ms) (mW)
τ0 0.15 85 0.70 30
τ1 0.40 90 0.30 20
τ2 0.10 75 0.75 15
τ3 0.10 50 0.15 80
τ4 0.15 100 0.20 60
(a) Task execution times and power dissipations at nominal supply voltage
comm. comm. power
time (µs) dis. (mW)
γ0→1 0.05 5
γ1→2 0.05 5
γ1→3 0.15 5
γ2→4 0.10 5
(b) Communication times and
power dissipations of communi-
cation activities mapped to the
bus
power dissipation is neglected, as in most co-synthesis approaches.
A possible mapping and scheduling of the system tasks onto the underlying ar-
chitecture is shown in Figure 3, which describes the power dissipation over time,
hence, the power proﬁle of PEs and CLs. It can be observed that PE0 accommo-
dates tasks τ0 and τ4, while the remaining tasks are mapped to PE1. The com-
munication link, connecting both PEs, shows two communications, γ0→1 = (τ0,τ1)
and γ2→4 = (τ2,τ4). The dynamic system energy dissipation of this conﬁguration
at nominal supply voltage can be calculated as 57.75µJ, using the dynamic power
values and execution times given in Tables II(a) and II(b). Obviously, since the
execution of task τ3 ﬁnishes at 1.4ms and the task deadline is at 1.5ms, a slack
of 0.1ms is available, as indicated in Figure 3. The same holds for task τ4, which
ﬁnishes its execution after 1.5ms, leaving a slack of 0.1ms until the deadline is
reached. These slacks can be used to extend the task execution times. Thus, the
DVS-PEs can be slowed down by scaling the supply voltage and accordingly the
clock frequency, following the relation given in Equation (2). Let us consider two
cases for the identiﬁcation of scaling voltages: (a) When a ﬁxed power model is
used (power variations are neglected), i.e., all tasks mapped to the same PE are as-
sumed to consume the same constant amount of power, and (b) a more generalised
and more realistic power model allowing for power variations among the tasks (as
proposed in this work).
One approach to optimise the energy dissipation, which neglects the power pro-
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 13
Table II. Energy diﬀerences during the execution of the PV-DVS algorithm
Energy diﬀerence ∆E (µJ)
iteration τ0 τ1 τ2 τ3 τ4
1 0.960 0.234 0.156 0.899 1.130
2 0.960 0.234 0.156 0.899 0.965
3 0.960 0.234 0.156 0.899 0.833
4 0.820 0.234 0.156 0.899 0.833
5 0.820 0.234 0.156 0.768 0.833
6 0.820 0.234 0.156 0.768 0.725
7 0.708 0.234 0.156 0.768 0.725
8 0.708 0.234 0.156 0.663 0.725
9 0.708 0.234 0.156 0.663 0.636
10 0.616 0.234 0.156 0.663 0.636
11 0.616 0.234 0.156 0.578 0.636
12 0.616 0.234 0.156 0.578 0.562
13 0.541 0.234 0.156 0.578 0.562
14 0.541 0.234 0.156 0.507 0.562
15 - - - 0.507 -
16 - - - 0.451 -
extension 4 0 0 6 6
the extendable tasks and their potential energy gain. The bold numbers indicate
which task is extended in each iteration, and in this simple example task τ4 is the
ﬁrst task to be extended (iteration 1). Observing iteration 2, the energy diﬀerence
of task τ4 has changed to ∆E4 = 0.965µJ, however, τ4 is still the task which will
gain most from an extension. This iterative extension of tasks is repeated until no
slack is left. The last row of Table II shows how many extensions are distributed
to each task. Accordingly, the new execution times are as follows: t0 = 0.19ms,
t1 = 0.3ms, t2 = 0.75ms, t3 = 0.21ms, and t4 = 0.21ms. These extended execu-
tion times allow to lower the supply voltages, which results in the following power
dissipations: P0 = 50.77mW, P1 = 20mW, P2 = 15mW, P3 = 38.74mW, and
P4 = 48.33mW. The total energy dissipation is E = 45.93µJ. This means an
energy reduction of 20.5% compared to a reduction of 8.2% obtained with a power
proﬁle neglecting approach. 
3.1.1 Generalised DVS Problem Formulation. The DVS problem, including power
variation eﬀects, can be stated as follows:
Find for all DVS-PE mapped tasks τ ∈ TDVS of the system speciﬁcation
a single scaling voltage Vdd(τ) (between the threshold voltage Vt and the
nominal supply voltage Vmax) under consideration of individual power
dissipations Pmax(τ) such that the dynamic energy dissipation EΣ is
minimised and no deadline and precedence constraints are violated.
The problem can be mathematically expressed using the following deﬁnitions, where
R
+
0 = {x|x ∈ R, 0 ≤ x < +∞} and R+ = R
+
0 \ 0:
—GS(T ,C) is the system speciﬁcation graph, where T is the set of tasks and C is
the set of communications, as deﬁned in Section 2
ACM Journal Name, Vol. V, No. N, May 2003.14 · Marcus T. Schmitz et al.
—GA(P,L) is a directed architecture graph, where P is the set of PEs and L is the
set of CLs, as deﬁned in Section 2
—PDV S ⊆ P denotes the set of all DVS-enabled processing elements
—A = T ∪ C deﬁnes the set of all activities
—K = P ∪ L deﬁnes the set of all allocated components
—TDVS ⊆ T denotes the set of all tasks mapped to DVS-PEs PDV S
—Pmax : T 7→ R+ is a function returning the power dissipation of task τ executed
at maximal PE supply voltage Vmax
—tmin : T 7→ R+ is a function returning the minimal execution time of task τ ∈ T
at maximal PE supply voltage Vmax
—Vt : P 7→ R+ is deﬁned as a function which returns the threshold voltage of the
PE to which task τ ∈ T is mapped
—Vmax : T 7→ R+ is a function returning the maximal supply voltage of the PE
to which task τ ∈ T is mapped
—Td ⊆ T denotes the set of all tasks having a hard deadline
—texe : A 7→ R+ is a function deﬁned by:
texe =
(
tmin() ·
Vdd()
(Vdd()−Vt())2 ·
(Vmax()−Vt())
2
Vmax() if  ∈ T
tC if  ∈ C
where tC is the communication time for the communication activity γ ∈ C
—td : Td 7→ R
+
0 is a function returning the deadline of task τ ∈ Td
—Cin : T 7→ 2C returns the set of all ingoing edges of task τ ∈ T
—tS : A 7→ R
+
0 is a function which returns the start time of an activity  ∈ A
(i.e., the time when the activity begins execution)
—A : K 7→ 2A deﬁnes a function, returning the set of all activities mapped to a
component κ ∈ K
—I = [tS(),(tS() + texe())] is the execution interval of activity  ∈ A
—i : A 7→ R
+
0 × R
+
0 is a function returning the execution interval of an activity
 ∈ A
Using this deﬁnitions it is possible to formalise the problem mathematically as the
minimisation of
EΣ =
X
τ∈TDVS
Pmax(τ) · tmin(τ) ·
V 2
dd(τ)
V 2
max(τ)
subject to
Vt(τ) < Vdd(τ) ≤ Vmax(τ), ∀τ ∈ TDVS
tS(τ) + texe(τ) ≤ td(τ), ∀τ ∈ Td
tS(γ) + texe(γ) ≤ tS(τ), ∀τ ∈ T , γ ∈ Cin(τ)
i(n) ∩ i(m) = ∅, ∀ (n,m) so that n ∈ A(κ1),m ∈ A(κ2) ⇒ κ1 = κ2
Please note that a single scaling voltage for each task executing on a DVS-PE has
to be calculated for the statically scheduled application. However, in dynamically
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 15
Algorithm: PV DVS OPTIMISATION
Input: - task graph GS(T ,C), mapping, schedule, architectural information,
minimum extension time ∆tmin
Output: - energy optimised voltages Vdd(τ), dissipated dynamic energy E
01: Generate MSTG from GS
02: QE ← ∅
03: for all (τ ∈ Td) {∆td(τ) := td(τ) − (tS(τ) + texe(τ))}
04: for all (τ ∈ T ) {calculate t}
05: for all (τ ∈ T ) {if t ≥ ∆tmin then QE := QE + τ}
06: ∆t = min t
|QE| , if ∆t < ∆tmin then ∆t = ∆tmin
07: for all (τ ∈ QE) {calculate ∆E(τ)}
08: reorder QE in decreasing order of ∆E
09: while (QE 6= ∅) {
10: select ﬁrst task τ∆Emax ∈ QE
11: tτ∆Emax := tτ∆Emax + ∆t
12: update Eτ∆Emax
13: for all (τ ∈ T ) {update tS, tE and t}
14: for all (τ ∈ QE) {if (t(τ) < ∆tmin) ∨ (Vdd(τ) ≤ Vt(τ))
then QE := QE − τ}
15: ∆t = min t
|QE| , if ∆t < ∆tmin then ∆t = ∆tmin
16: for all (τ ∈ QE) {update ∆E(τ)}
17: reorder QE in decreasing order of ∆E
18: }
19: delete MSTG
20: return EΣ, and Vdd(τ) for all (τ ∈ T )
Fig. 5. Pseudo code of the proposed heuristic (PV-DVS) for the generalised DVS problem
scheduled systems the voltage of a single task might not be restricted to one volt-
age in order to dynamically adapted the system performance to the performance
requirements.
3.1.2 Generalised DVS algorithm for heterogeneous distributed systems. Having
formalised the problem, described the eﬀects of power variations on the voltage
selection and the necessity for their consideration in a generalised power model,
we introduce next our DVS algorithm. The algorithm, summarised in Figure 5,
is based on a constructive heuristic using the deﬁned energy diﬀerence (Equation
(3)). The starting point of the presented algorithm is a mapped and scheduled task
graph (MSTG), i.e., it is known where and in which order the tasks are executed.
Execution times and power dissipations are part of the architectural information,
which also includes other necessary component properties, like the nominal supply
voltage Vmax, the threshold voltages Vt, etc. The minimal extension time ∆tmin
denotes the minimal time quantum to be distributed in each step of the algorithm.
It is deﬁned in order to speed up the determination of the voltage selection by
preventing insigniﬁcant small extensions leading to trivial power reductions.
To allow for a fast and correct extension of task executions, which might inﬂu-
ence other tasks and communications of the system, it is beneﬁcial to capture the
schedule and mapping information into the task graph (line 01 in Figure 5). This
can be performed by generating a mapped and scheduled task graph, which is a
transformed copy of the initial task graph, as shown in Figure 6. The transforma-
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 17
the extendable task queue QE, if their available slack t is smaller than the minimal
extension time ∆tmin, or their scaled supply voltage Vdd is small or equal to the
threshold voltage Vt. Taking into account the tasks in the new extendable queue,
the time quantum ∆t is recalculated (line 15) to enable a potential distribution
of slack to all tasks in the queue. Based on this ∆t value, the energy diﬀerences
∆E are updated (line 16). The priority queue QE is reordered according to the
new energy diﬀerences (line 17). At this point, the algorithm either invokes a new
iteration or ends, based on the state of the extendable task queue. If it terminates,
the scaling voltages for each task execution and the total dynamic energy dissipation
are returned (lines 19 and 20).
The algorithm, as described above, produces scaling voltages under the assump-
tion that variable-voltage PEs are available that support continuous voltage scaling.
However, it is possible to adapt the generated scaling voltages towards multi-voltage
PEs, which are able to run at a restricted number of predeﬁned voltages. It has
been shown in [Ishihara and Yasuura 1998] that the two discrete supply voltages
Vd1 and Vd2, Vd1 < Vdd < Vd2, around the continuous selected voltage Vdd are the
ones which minimise the energy dissipation, under the assumption that the time
overhead for switching between diﬀerent voltages can be neglected. Thus, our ap-
proach can be used for voltage selection on multi-voltage PEs. Given a task τ with
execution time texe at the continuous selected voltage Vdd, then, in order to achieve
minimal energy consumption, the same task τ will execute on the multiple voltage
PE for tdis1 time units at the supply voltage Vdis1 and for tdis2 time units at supply
voltage Vdis2, where
texe = tdis1 + tdis2 (7)
tdis1 = texe ·
Vdis1 · (Vdd − Vt)2
(Vdis1 − Vt)2 · Vdd
·
Vdd
(Vdd−Vt)2 − Vdis2
(Vdis2−Vt)2
Vdis1
(Vdis1−Vt)2 − Vdis2
(Vdis2−Vt)2
. (8)
Complexity Analysis. The complexity of the proposed PV-DVS algorithm can be
calculated as follows: The while loop (line 09) is executed in the worst case n · m
times, where n = |T | is the number of nodes in the graph, since all tasks might be
extendable. However, depending on ∆tmin and ∆t, tasks might be extended more
than once, and m, for the worst case, is the maximum number of such extensions.
The inner part of the while loop shows the following complexities: The propagation
of extensions takes n + c in the worst case (c = |C| is the number of edges in the
graph), since all nodes and edges might have to be visited by the breadth-ﬁrst search
(line 13). Removing inextensible tasks, again, might take n steps. Determination
of the new extension time ∆t is done in most n steps. And ﬁnally, updating
the extendable queue takes n operations (the queue is implemented as Fibonacci
heap). All other calculations inside the while loop are executed in constant time.
Therefore, the ﬁnal time complexity of the proposed PV-DVS algorithm is given
as O(n · m(4n + c)). Note that the extendable task queue QE is progressively
reduced from length n to zero. The reduction is not uniform since it might occur
that suddenly (at the same time) many tasks become inextensible and are excluded
from the queue. This, additionally, indicates that the complexity is valid for the
worst case. 
ACM Journal Name, Vol. V, No. N, May 2003.18 · Marcus T. Schmitz et al.
t0
t
t t4
t 1 2
3
t
t5
6
q
q4,5
6=1.4ms
  =1.6ms
Fig. 7. Second task graph example
3.2 DVS optimised Scheduling
This section is concerned with the scheduling problem for heterogeneous distributed
systems containing power managed DVS-PEs. In Section 3.1, we have shown that
our generalised DVS algorithm is able to further improve the scaling voltages for the
already scheduled tasks, which are mapped to DVS-PEs. However, as mentioned
in Section 2, the task scheduling greatly inﬂuences how eﬃciently DVS can be
exploited. Simply put, the more slack is available in the schedule, the higher the
achieved energy savings by exploiting DVS will be. Again, this becomes more
complex and does not hold always for distributed systems under the proposed
generalised power model (considering the power proﬁles) when compared with a
ﬁxed power model. In such a case, the available slack for high energy dissipating
tasks should be considered more important than the slack of tasks consuming a
minor amount of power.
Motivational Example 2: Energy Conscious Scheduling
The purpose of this motivational example is to illustrate the importance to take the
PE power proﬁle into account while scheduling tasks and communications in the
presence of DVS-PEs. It highlights the importance to take into account the power
dissipations for diﬀerent DVS-PEs, in order to make DVS conscious scheduling
decisions.
The speciﬁcation task graph shown in Figure 7 is mapped to an architecture
consisting of three heterogeneous and power managed DVS-PEs, linked through
a single bus. Table IV(a) gives the execution time, power dissipation, and the
mapping of each task. Additionally, the values for the nominal supply voltage Vmax
and the threshold voltage Vt of each PE are given in Table IV(b). For the sake of
simplicity, in this example, the communications are considered to be instantaneous.
Figure 8(a) shows a feasible schedule for the mapped tasks, executing at nominal
supply voltage. This schedule results in an energy dissipation of 71µJ, according
to the values given in Table IV(a). It can be observed that task τ6 has a deadline at
1.4ms but it ﬁnishes its execution after 1.0ms, which results in an available deadline
slack of 0.4ms. This slack time can be used to extend the tasks and hence reduce
the supply voltage of the PE during the task execution. However, τ3 and τ6 are the
only extendable tasks, and any other extension of the remaining tasks cannot be
tolerated, since task τ5 ﬁnishes execution just on deadline and the tasks τ0, τ1, and
τ2 inﬂuence the start and end time of task τ5. Therefore, an optimal DVS schedule
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 21
—The objective, which needs to be optimised, can be based on an arbitrary complex
function.
—The enlarged search space (at most (|T | + |C|)! diﬀerent schedules can be pro-
duced) provides the opportunity to ﬁnd solutions of potentially higher quality.
—There is a large freedom to trade-oﬀ between acceptable synthesis time and so-
lution quality, as opposed to constructive techniques where only one solution is
produced.
—GAs with parallel populations and migration scheme provide a powerful approach
to leverage additional computational power of computer clusters, which are be-
coming more and more commonplace.
—Multi-objective optimisation is an important feature which is supported by ge-
netic algorithms. It provides the opportunity to simultaneously optimise the im-
plementation towards competing goals and allows the system designer to choose
among several suitable implementations with diﬀerent properties.
A detailed functional description of genetic list scheduling approaches can be found
in [Dhodhi et al. 1995; Grajcar 1999]. Nevertheless, our implementation varies in
two fundamental issues from this previous research:
—Instead of optimising the schedule solely for timing behaviour (reducing the
makespan1), we consider additionally the issue of energy minimisation with re-
spect to DVS.
—The algorithms described in [Dhodhi et al. 1995] and [Grajcar 1999] employ a
list scheduler which determines not only the execution order of tasks but also
their mapping. We avoid this combination because of the greediness problems
described in [Kalavade 1995] which might lead to infeasible mappings due to
exceeded area constraints (memory and gates) of pre-allocated hardware compo-
nents. A list scheduling, including the mapping step, serially traverses all nodes
of the task graphs and maps them to allocated components based on local deci-
sions taken in each step. This might lead to low quality solutions, as opposed to
approaches in which mapping is decided in an external loop, based on iterative
improvement techniques. Another problem, which occurs when determining the
mapping during the list scheduling processes, is that the execution times and
power dissipations of the mapped tasks are inﬂuenced by the voltage scaling.
Therefore, the mapping decisions based on these values might prove to be wrong.
For example, mapping a task to a low power consuming ASIC might involve an
expensive development of hardware, while the mapping of the same task onto a
DVS-enabled ASIP might prove satisfactory when the task execution is scaled.
List scheduling algorithms make scheduling decisions based on task priorities and
determine static schedules. Unlike constructive list scheduling techniques that use
a sophisticated algorithm for the priority assignment, genetic list scheduling tech-
niques construct and evaluate many diﬀerent schedules during an iterative priority
optimisation process. By encoding the task priorities into a priority string, it be-
comes possible to utilise genetic operators (crossover and mutation) to change task
1Makespan is duration from starting the ﬁrst task until the last task ﬁnishes execution.
ACM Journal Name, Vol. V, No. N, May 2003.22 · Marcus T. Schmitz et al.
Priority
String
1
2 7 3 4
Task t t t t t 0 1 2 3 4
t0
3 t t4
t2 t
5
Pr = 7
Pr = 3
Pr = 2
Pr = 5
Pr = 4
Fig. 10. Task priority encoding into a priority string
priorities and hence generate new scheduling solutions using static list scheduling.
Figure 10 shows the encoding and the relations between priority string and tasks.
To preserve some string locality, important for an eﬃcient search when using GAs
[Goldberg 1989], the priorities are ordered in the same way as visited by a breadth-
ﬁrst search. Now we give an overview of our DVS optimised genetic list scheduling
algorithm, as shown in the optimisation Step 2 of Figure 2. The solution pool (25
individuals) of the ﬁrst generation is initialised half by mobility-based [Wu and
Gajski 1990] and half by randomly generated priorities (with values between the
lowest and highest mobility), respectively. This initial population was empirically
found to be a good starting point, leading to fast convergence. The algorithm then
enters the main schedule optimisation loop, which is repeated until no improve-
ment of at least 1% (with respect to the best found feasible schedule) is made for
10 generations. Each iteration of the loop goes successively through the following
steps: All new priority candidate strings in the solution pool are used by the list
scheduling algorithm to generate schedules at nominal supply voltage. Our imple-
mented list scheduler relies solely on the task priorities to make schedule decisions,
i.e., no other techniques, like e.g. hole ﬁlling, are used to optimise the schedule.
Although such techniques can improve the timing behaviour by eliminating idle
periods in the schedule, we dissociate from them since the DVS technique exploits
exactly these idle times. The algorithm proceeds by passing the built schedules to
the previously presented PV-DVS algorithm (Section 3.1.2), which identiﬁes scaling
voltages that minimise the energy dissipation. Note that schedules which exceed
hard deadline constraints are still scaled as much as possible and are not excluded
from the optimisation, since good solutions are likely to be found as result of trans-
formations performed on invalid conﬁgurations. However, a violation penalty is
applied in such cases, as explained next. The scaled schedule is evaluated in terms
of deadline violations and energy dissipation including DVS reductions. Based on
this evaluation, the ﬁtness FS of each schedule candidate is calculated using the
following equation:
FS =



 


 
X
τ∈T
P(τ) · texe(τ)
!
| {z }
task energy
+


X
γ∈C
P(γ) · texe(γ)


| {z }
comm. energy


 



·


1 +
P
τ∈Td
DV 2
τ
T2
HP



| {z }
Time Penalty
,
(9)
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 23
DVτ = max
 
0,(tS(τ) + texe(τ)) − td(τ)

where P and texe denote power dissipation and execution time of task τ or com-
munication activity γ, summed to calculate the total dynamic energy dissipation
which needs to be minimised. Note that the power dissipations and the execution
times of the tasks depend on the found scaling voltages Vdd. In order to assign
a deadline violation penalty, the energy value is multiplied with a penalty factor
based on the sum of the squared deadline violations. THP is the hyper task graph
period (least common multiplier of all task graph periods) used to normalise the
deadline violation. Squaring has been applied in order to apply a higher penalty
to larger violations of imposed deadlines. By guiding the optimisation with this
ﬁtness function, the search for schedules is pushed into regions where low energy
and feasible schedules are likely to be found. The algorithm then checks the halting
criterion as mentioned above. If the end of the optimisation has not been reached
the algorithm continues, and the new priority candidates are ranked and inserted
into the solution pool based on their ﬁtness values. Low ranked individuals of
the pool are replaced by new ones, which are generated through genetic crossover
and mutation. We use a steady state GA, due to its performance advantage com-
pared to generational GAs as indicated in [Rogers and Pr¨ ugel-Bennett 1999], with
a generation gap of 50%, i.e., half of the individuals in the solution pool survive
unchanged in each generation. The crossover is carried out by means of a random
two point crossover. To avoid a premature convergence towards suboptimal sched-
ules we leverage the idea of a dynamic mutation probability [Fogarty 1989]. This
approach gives the algorithm the additional capability to easily escape local min-
ima in the beginning of the optimisation run. The mutation probability follows the
equation 1/exp(NS ·0.05) and is never allowed to drop below 15%. NS denotes the
current generation during the schedule optimisation. At this point, the next itera-
tion is invoked and so diﬀerent schedules are tried out. The experimental results,
given later in Section 4.2, indicate the advantages of our approach in optimising
the schedule towards DVS usability when compared to conventional constructive
list scheduling approaches.
4. SYNTHESIS EXPERIMENTS
To demonstrate the eﬃciency and the applicability of the proposed generalised
DVS synthesis technique in reducing the energy dissipation of heterogeneous dis-
tributed systems containing power managed PEs, we have carried out numerous
experiments and comparisons with power neglecting approaches. The PV-DVS and
scheduling algorithm as outlined in the previous section have been implemented on
a Pentium-III/750MHz Linux PC with 128MB RAM. We have used 68 experimental
benchmark examples, partially taken from previously published literature [Gruian
2000; Bambha et al. 2001; Hou and Wolf 1996] and generated using TGFF [Dick
et al. 1998], to cover a wide spectrum of application diversity. To demonstrate the
real-world applicability of the presented work, we carried out an additional set of
experiments on an optical ﬂow detection real-life example. The complexity of the
used task graph examples varies between 8 to 100 tasks and 7 to 151 edges. The
amount of PEs and CLs in the component libraries varies between 4 and 16. These
benchmarks are grouped into ﬁve major sets:
ACM Journal Name, Vol. V, No. N, May 2003.24 · Marcus T. Schmitz et al.
(1) Our TGFF generated task graphs (tgff1-tgff25) consists of 8 to 100 tasks
and are mapped to heterogeneous architectures containing power managed DVS
PEs and non-DVS enabled PEs. Therefore, these examples show various power
characteristics and component properties. The variations in power are up to 2.6
times on the same PEs. The examples tgff4 t and tgff4 fixed are identical
to tgff4 with slight modiﬁcations; tgff4 t denotes a task graph alternative
with a critical tight deadline, while tgff4 fixed uses only DVS-PEs with a
ﬁxed power dissipation.
(2) The examples of Hou et al. [Hou and Wolf 1996] are hypothetical task graphs.
Hou clustered represents the same functionality as Hou, but the task graph
is collapsed from 20 to 8 tasks. Since the initial technology library does not
contain any DVS-enabled PEs, we extended the given PEs to DVS-PEs with
Vt = 0.8V and Vmax = 3.3V . These examples also show diﬀerent power dissi-
pations (power variations) among the tasks.
(3) Gruian’s and Kuchcinski’s graphs [Gruian and Kuchcinski 2001], used in our
experiments, represent two sets (TG1 and TG2) of 30 randomly generated com-
municating tasks with tight deadlines (determined by a critical path scheduling
algorithm). These graphs show a high degree of parallelism and are mapped
to architectures built of 3 or 10 identical DVS-PEs, assuming constant power
consumption. These PEs are multi-voltage processor able to run at 3.3V , 2.5,
1.7V , and 0.9V , while the threshold voltage Vt is 0.4V .
(4) The applications used by Bambha et al. [Bambha et al. 2001] consist of two
diﬀerently implemented Fast Fourier Transforms (fft1 and fft3), a Karplus-
Strong music synthesis algorithm (Karp10), a quadrature mirror ﬁlter bank
(qmf4), and a measurement application (meas). These benchmarks are small
real-life examples and use architectures composed of 2 to 6 identical DVS-PEs,
assuming constant power consumption. Supply voltages are between 0.8 and 7
volts. The throughput constraints and initial average power consumptions are
calculated at a reference voltage of 5 volts.
(5) The ﬁnal benchmarks represents a real-life example, consisting of 32 tasks. It is
a traﬃc monitoring system based on an optical ﬂow detection (OFD) algorithm.
This application is a sub-system of an autonomous model helicopter [WITAS ;
Gruian and Kuchcinski 2001].
In our experiments, we assume that computation and voltage scaling can be carried
out concurrently, as is the case of the processor introduced in [Burd 2001]. Further,
we neglected the time overhead needed by the processor to switch between two
supply voltages (for real-life DVS processors this is in the range of 10–70µs for a
full transition from the highest to the lowest supply voltage and vice versa [Burd
2001]), since the used tasks are considered to be of coarse granularity (in the range
of 1–100ms). Therefore, the switching overhead can be considered to be only a small
fraction of the total task execution time. However, in the case of ﬁne grained tasks
this overhead might inﬂuence the voltage selection and should then be considered.
All results presented here, except the deterministic ones given in Section 4.1, were
obtained by running the optimisation process ten times and averaging the outcomes.
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 25
Table IV. Comparison of the presented PV-DVS optimisation with the EVEN-DVS approach
(scheduling and mapping are ﬁxed)
No. of NO-DVS EVEN-DVS Presented Approach
Example tasks / Energy Energy Reduc. Energy Reduc.
edges Dissip. Dissip. (%) Dissip. (%)
tgﬀ1
∗ 8 / 9 355 193.49 45.50 112.87 68.21
tgﬀ2 26 / 43 743224 722412.15 2.80 683954.54 7.97
tgﬀ3 40 / 77 554779 410653.67 25.98 267651.03 51.76
tgﬀ4 20 / 33 431631 402904.08 6.66 375914.03 12.91
tgﬀ4 t 20 / 33 431631 412854.36 4.35 397201.93 7.98
tgﬀ4 ﬁxed 20 / 33 176723 142986.91 19.09 124905.61 29.32
tgﬀ5 40 / 77 4187382 3963647.60 5.34 3767450.25 10.03
tgﬀ6 20 / 26 1419124 1401605.68 1.23 1396445.06 1.60
tgﬀ7 20 / 27 2548751 2289878.34 10.16 1951579.52 23.43
tgﬀ8 18 / 26 1913519 1774151.42 7.28 1668485.33 12.81
tgﬀ9
∗ 16 / 15 996590 974159.01 2.25 918048.34 7.88
tgﬀ10 16 / 21 69352 51263.60 26.08 46483.97 32.97
tgﬀ11 30 / 29 4349627 4293736.56 1.28 4263279.98 1.99
tgﬀ12 36 / 50 2316431 2243710.55 3.14 2212111.25 4.50
tgﬀ13 37 / 36 2912660 2425431.77 16.73 2333338.86 19.89
tgﬀ14 24 / 33 15532 13546.62 12.78 12479.41 19.65
tgﬀ15 40 / 63 62607 62078.93 0.84 60334.62 3.63
tgﬀ16 31 / 56 3494478 2913341.14 16.63 2518711.99 27.92
tgﬀ17 29 / 56 23459 20396.41 13.06 18334.01 21.85
tgﬀ18 12 / 15 1851688 1851687.99 0.00 1526059.97 17.59
tgﬀ19 14 / 19 5939 4713.59 20.63 4395.37 25.99
tgﬀ20
∗ 19 / 25 77673 48334.30 37.77 40280.98 48.14
tgﬀ21 70 / 99 3177705 3175497.22 0.07 2658534.22 16.34
tgﬀ22 100 / 135 5821498 5036657.40 13.48 4445545.63 23.64
tgﬀ23
∗ 84 / 151 11567283 10791880.89 6.70 10133912.03 12.39
tgﬀ24 80 / 112 5352217 5349024.86 0.06 5238478.58 2.13
tgﬀ25 49 / 92 5735038 5648816.00 1.50 5502681.64 4.05
Hou
∗ 20 / 29 13712 10337.05 24.61 7474.55 45.49
Hou clust.
∗ 8 / 7 14546 11543.35 20.64 10270.32 29.39
∗Components used for these examples consists of DVS-PEs only
4.1 Performance of the Generalised DVS Algorithm
To demonstrate the inﬂuence of power variations on the eﬃciency of DVS, we
compare our approach, which takes the power proﬁle into account, with a power
neglecting approach. This power neglecting approach (in the following referred to
as EVEN-DVS) is based on the idea to distribute available slack time evenly among
the processing elements, somewhat similar to the voltage scaling idea used in [Luo
and Jha 2000]. However, since the mapping and scheduling approach proposed in
[Luo and Jha 2000] targets also additional diﬀerent objectives, a direct comparison
is not valid.
Table IV shows a comparison between the EVEN-DVS and the proposed DVS
approach. In order to judge the complexity of the individual benchmark examples,
the table gives the number of nodes and edges in the task graphs. The compar-
ison between the two DVS approaches is carried out with respect to the energy
dissipation when no DVS is employed (see Column NO-DVS). Consider for exam-
ple benchmark tgff17, which consist of 29 tasks and 56 communications between
tasks. The unscaled execution (NO-DVS) of the application dissipates an energy of
23459. Using an even distribution of slack time (EVEN-DVS) this power consump-
tion can be reduced to 20396, a reduction of 13.1%. However, using the proposed
generalised DVS algorithm the dissipated energy is further reduced to 18334, when
ACM Journal Name, Vol. V, No. N, May 2003.26 · Marcus T. Schmitz et al.
Table V. PV-DVS results using the benchmark set of Bambha et al.
No. of NO-DVS Proposed Approach
Example Nodes/ Energy Energy CPU Reduction
Edges Dissip. Dissip. time (s) (%)
ﬀt1 28/32 29600 18172 0.21 38.61
ﬀt3 28/32 48000 36890 0.14 23.15
karp10 21/20 59400 44038 0.12 25.86
meas 12/12 28300 25973 0.11 8.22
qmf4 14/21 16000 12762 0.11 20.24
compared to NO-DVS a reduction of 21.8%.
For all examples shown in Table IV it is assumed that the mapping and schedul-
ing have been pre-determined, using a ﬁxed mapping and a schedule generated by
a mobility based list scheduling. Thus, the energy reductions are solely achieved
through voltage scaling. As expected, both the EVEN-DVS and the presented scal-
ing technique reduced the energy dissipation of the systems in all cases (Column
6 and 9), except for tgff18 where the even distribution of slack could not achieve
any improvement. It can be observed that the proposed DVS heuristic was able to
further improve the energy dissipation of all examples, when compared to EVEN-
DVS. Even in the case of tgff18 a reduction of 17.8% could be achieved. Due to
our particular implementation of the DVS algorithm which distributes slack evenly
among the PEs (EVEN-DVS), also slack is allocated on non-DVS-PEs. Therefore,
the higher energy reduction of the proposed DVS algorithm are due to two facts.
Firstly, EVEN-DVS allocates slack time on non-DVS-PEs. These times, of course,
cannot be exploited to lower the power consumption. Secondly, the proposed DVS
technique considers the power proﬁle information during the voltage scaling. This
leads to better energy reductions (see Motivational Example 1). To distinguish be-
tween both eﬀects, we have indicated in Table IV the architectures which consists of
DVS-PEs only. In these examples, the higher energy reduction in solely achieved by
taken the power proﬁle into account. The remaining examples achieve the increased
energy eﬃciency due to both eﬀects. We have further conducted experiments with
the benchmark set used by Bambha et al. [Bambha et al. 2001]. Since they use a
diﬀerent communication model (contention, requests for the bus, etc.), we had to
re-calculate the throughput constraints. Therefore, a direct comparison between
the results reported in [Bambha et al. 2001] and the results presented here is not
possible. Nevertheless, the re-calculation of the throughput was carried out for the
same task mapping and execution order as in [Bambha et al. 2001], which is based
on a dynamic level scheduling approach [Sih and Lee 1993]. The results of these
ﬁve examples, scaled by our PV-DVS method, are given in Table V. It can be
observed that in all cases the energy was reduced by 8.22 to 38.61%. Further, the
highly serialised structure of meas allowed us to calculate the theoretically optimal
voltage schedule for this example. Using this optimal supply voltages results in
13% energy saving. Our PV-DVS algorithm achieved for this example a reduction
of 8.22%, which is only 4.78% higher than the theoretically optimal solution.
To give insight into the dependencies between the computational eﬀorts, solution
quality, and the minimum extension time ∆tmin (see also the complexity analysis
in Section 3.1.2), we have conducted two experiments. In order to achieve accurate
results, especially for the time measurement, the experiments are carried out using
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 27
Inter−
polation
D
D
Estimated E/E0
found
tmin
tmin
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1e−06 0.0001 0.01 1 100 10000 1e+06
’tgff23.dat’
’large1.dat’
E
/
E
0
(a) Energy reduction quality dependent
on minimum extension time ∆tmin
D
0.001
0.01
0.1
1
10
100
1000
1e−06 0.0001 0.01 1 100 10000 1e+06
t
’time_tgff23.dat’
O
p
t
i
m
i
s
a
t
i
o
n
 
t
i
m
e
 
(
s
)
’time_large1.dat’
min
(b) Execution time dependent on mini-
mum extension time ∆tmin
Fig. 11. The inﬂuence of ∆tmin on energy reduction and optimisation time
two large task graphs with 80 (tgff23) and 400 (large1) tasks. Figure 11(a) illus-
trates the dependency between the minimum extension time ∆tmin and the solution
quality (given as reduced energy E over nominal energy E0). It can be observed
that no energy reduction can be achieved until the ∆tmin is smaller than the largest
slack available in the task schedule (2364 for tgff23 and 29581 for large1, see Fig-
ure 11(a)). Certainly, if the algorithm must distribute time quanta bigger than any
slack, it cannot perform any voltage reduction, and therefore E/E0 = 1. At the
same time, it is not desirable to decrease the minimum extension time too much
since the additional reductions become insigniﬁcant (the curves level out) and will
only increase the computational time of the optimisation. Figure 11(b), which gives
the dependency between minimum extension time ∆tmin and execution time of the
DVS algorithm. It is therefore important to ﬁnd a good value for ∆tmin, which
trades-oﬀ between solution quality and optimisation time. In our experiments we
use the following heuristic approach to ﬁnd an appropriate ∆tmin setting for each
solution candidate. It is based on the observation that for all used benchmarks the
characteristics shown in Figure 11(a) and Figure 11(b) hold.
With reference to Figure 11(a), we interpolate the nearly linear (in a semi-
logarithmic scale) energy drop, after decreasing ∆tmin below the highest DVS slack,
using the logarithmic function,
y = α · logx + β (10)
where the constants α and β are calculated using two initial points in the quasi
linear part of the graph. The ﬁrst point corresponds to the highest available slack
sh on any of the DVS-PEs, hence, it matches the nominal energy dissipation. This
point can be found in linear time. To establish the second point, needed for the
interpolation, the DVS algorithm is run with an ∆t∗
min three times smaller than
the highest DVS slack to ﬁnd its corresponding reduced energy dissipation E∗. For
all used examples this was still in the steeply dropping part of the graph. Using
these points, the constants α and β are given by
α =
1 − E∗
log(sh) − log(∆t∗
min)
β = E∗ − α · log(∆t∗
min)
ACM Journal Name, Vol. V, No. N, May 2003.28 · Marcus T. Schmitz et al.
Such a linear interpolation is shown in Figure 11(a) for the large1 example. Of
course, ﬁnding the second point has a computational overhead, however, as it can
be seen from Figure 11(a) and Fig 11(b), this ”investment” pays oﬀ when compared
to a wrong choice of ∆tmin, which could results in a much higher computational
time or a much higher energy consumption. The next step towards a good value for
∆tmin is it to ﬁnd a ”rough” estimation for the achievable energy reduction. We
calculate the estimation for the scaled energy consumption based on the average
power dissipation on each DVS-PE and the sum of the maximal available slack on
these PEs. An estimated energy dissipation for large1 is indicated in Figure 11(a).
The minimum extension time ∆tmin could be set to the intersection of the energy
estimation and the interpolated energy drop (as show in Figure 11(a)). However,
we set it one order of magnitude lower, as indicated by an arrow in the ﬁgure.
This is done to account for the fact that in the case of an energy estimation close
to the real achievable energy reduction, the intersection would be approximately
one order of magnitude to high. In the case that the energy estimation would
be far below the real achievable energy reduction, the calculated ∆tmin would
become unnecessary small. Therefore, we allow no ∆tmin smaller than 2.5 orders of
magnitude compared to the maximal DVS slack. This is based on the observation
that all used benchmarks show a similar characteristic, and ideal ∆tmin can be
found at maximal 2.5 orders of magnitude from the maximal DVS slack.
4.2 Schedule optimisation using the Generalised DVS Approach
To assess the capability of the proposed DVS optimised genetic list scheduler (pre-
sented in Section 3.2) to reduced the power consumption as well as ﬁnding feasible
schedules, we have conducted several experiments. Table VI shows, for the same
benchmarks as in the previous section, the achieved energy reductions and computa-
tional overheads, after including the EVEN-DVS and our PV-DVS algorithm inside
the schedule optimisation loop. Comparing the achieved reductions (Table IV) with
the results obtained by a mobility based scheduling (Table VI) reveals that for most
examples the energy consumption was reduced, i.e., the schedule optimisation was
able to ﬁnd execution orders which allow a more eﬀective exploitation of DVS. For
instance, consider benchmark tgff23. We can observe that the energy reduction
was increase from 6.7% to 15.05% when using EVEN-DVS and from 12.39% to
23.44% when utilising PV-DVS. Certainly, the GA based schedule optimisation in-
troduces a computational overhead which results in a necessary trade-oﬀ between
energy reduction and optimisation time. Of course, the linear time complexity of
the EVEN-DVS approaches results in lower optimisation times compared to PV-
DVS which has a polynomial complexity. However, the achieved reductions justify
this overhead, which is in the worst case 21.2s compared to 0.59s for a task graph
with 84 nodes (tgff23).
To further conﬁrm the quality of the proposed DVS optimised scheduling tech-
nique, we compare it next with the DVS scheduling approach proposed by Gruian
et al. [Gruian and Kuchcinski 2001], using the benchmark collections TG1 and TG2,
which contain 60 task graph examples. The reported average energy reductions in
[Gruian and Kuchcinski 2001] are 28% and 13% for the tight deadline task graph
collections TG1 and TG2, respectively. Table VII presents the results obtained us-
ing the proposed DVS optimised scheduling technique. The table is divided into
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 29
Table VI. Experimental results obtained using the generalised DVS algorithm integrated into a
genetic list scheduling heuristic
EVEN-DVS + GLSA PV-DVS + GLSA
Example Energy CPU Reduction Energy CPU Reduction
time (s) (%) time (s) (%)
Tgﬀ1 191 0.12 46.27 102 0.14 71.16
Tgﬀ2 572920 0.20 22.91 545451 0.27 26.61
Tgﬀ3 266907 0.33 51.89 170838 3.11 69.21
Tgﬀ4 377445 0.24 12.55 375778 0.97 12.94
Tgﬀ4 t 405473 0.25 6.06 396579 0.62 8.12
Tgﬀ4 ﬁxed 127867 0.26 27.65 124419 1.00 29.60
Tgﬀ5 3721137 0.37 11.13 3450292 2.41 17.60
Tgﬀ6 1399968 0.23 1.35 1396445 0.25 1.60
Tgﬀ7 1925000 0.20 24.47 1797520 0.27 29.47
Tgﬀ8 1722056 0.19 10.01 1648322 0.20 13.86
Tgﬀ9 829608 0.18 16.76 774994 0.26 22.24
Tgﬀ10 45325 0.17 34.65 44529 0.22 35.79
Tgﬀ11 3755206 0.22 13.67 3621740 0.42 16.73
Tgﬀ12 2212405 0.34 4.49 2198978 3.73 5.07
Tgﬀ13 2342892 0.28 19.56 2315766 0.80 20.49
Tgﬀ14 11891 0.21 23.44 11753 0.25 24.33
Tgﬀ15 61271 0.41 2.13 60129 1.07 3.96
Tgﬀ16 2492365 0.26 28.68 2449747 0.55 29.90
Tgﬀ17 18923 0.27 19.34 18249 0.56 22.21
Tgﬀ18 1724421 0.14 6.87 1421224 0.16 23.25
Tgﬀ19 4515 0.17 23.98 4357 0.16 26.63
Tgﬀ20 42704 0.19 45.02 37223 0.60 52.08
Tgﬀ21 2983044 0.56 6.13 2578046 3.92 18.87
Tgﬀ22 4664876 1.19 19.87 4119749 3.44 29.23
Tgﬀ23 9826644 0.64 15.05 8855575 21.25 23.44
Tgﬀ24 5240977 0.59 2.08 4881188 10.48 8.80
Tgﬀ25 4922085 0.39 14.18 4545250 1.91 20.75
Hou 10211 0.21 25.53 7474 0.23 45.49
Hou clustered 11543 0.18 20.64 10270 0.12 23.39
Table VII. Experimental results obtained using our generalised DVS optimised scheduling ap-
proach for benchmark examples TG1 and TG2
Continuous Discrete CPU
Benchmark Reduction Reduction time
(%) (%) (s)
TG1 41.52 37.86 3.96
TG2 18.90 15.93 0.74
the two benchmark collections. Although the examples do not allow our approach
to leverage power variations, since the speciﬁed power values are constant, the
achieved energy reduction for TG1 and TG2 are 41.52% and 18.90% (Column 5 and
11), respectively. This is an improvement of 13.52% and 5.90%, which indicates
the eﬀectiveness of the proposed optimisation technique, even when using constant
power benchmark examples. However, since the results in [Gruian and Kuchcin-
ski 2001] are obtained using multi-voltage PEs rather than variable-voltage PEs,
we have conducted an additional set of experiments, using the same multiple volt-
ages as given in [Gruian and Kuchcinski 2001]. Each supply voltage found by
our PV-DVS algorithm is split into its two neighbouring discrete voltages of the
multi-voltage PE, and the corresponding run-times for each voltage are calculated
ACM Journal Name, Vol. V, No. N, May 2003.30 · Marcus T. Schmitz et al.
Table VIII. Mapping optimisation of the benchmark set TG1 using NO-DVS, EVEN-DVS, and
PV-DVS
NO-DVS EVEN-DVS + MOB PV-DVS + GLSA
Example Energy time Energy time Red. Energy time Red. Red.
Dissip. (s) Dissip. (s) (%) Dissip. (s) (%) Fac.
r000 798700 53.87 unsolved 18.60 n/a 586806 194.86 26.53 n/a
r001 759500 56.16 592674 13.87 21.97 399839 804.73 47.35 2.16
r002 744800 55.64 unsolved 16.51 n/a 551944 189.97 25.89 n/a
r003 994700 27.76 711887 15.98 28.43 554171 769.58 44.29 1.56
r004 886900 54.00 unsolved 19.97 n/a 566263 360.58 36.15 n/a
r005 744800 54.94 465853 16.75 37.45 373677 1596.67 49.83 1.33
r006 901600 36.88 unsolved 17.55 n/a 589469 827.22 34.62 n/a
r007 837900 55.20 unsolved 20.20 n/a 565731 269.07 32.48 n/a
r008 862400 30.63 unsolved 19.25 n/a 635426 207.46 26.32 n/a
r009 681100 53.24 424723 14.99 37.64 311751 1535.28 54.23 1.44
using Equations (7) and (8). The results of the discrete voltage optimisation are
shown in Table VII (see columns with the headings ”Discrete Reduc.”). For the
two benchmark sets the achieved average energy reductions are 37.86% and 15.93%,
respectively, which represent improvements of 9.86% and 2.93%. Note that these
reductions were obtained on benchmarks which do not show any power variations
and so this optimisation feature of the proposed DVS algorithm stays unexploited.
The achieved improvements are due to the fact that our iterative GA-based ap-
proach is able to explore a large space of potentially energy saving schedules, as
opposed to the constructive list scheduling used in [Gruian and Kuchcinski 2001].
Regarding the computational times, Gruian et al. reported average times for the
30-node task graphs of 10s to 120s, while the proposed algorithm executes on aver-
age in 0.74s to 3.96s, indicating a performance advantage of the presented scaling
technique.
Another feature of the proposed scheduling approach is important to be men-
tioned. The scheduling optimisation (GLSA) does not only reduce signiﬁcantly the
dissipated energy in the presents of DVS-PEs, but also increases the possibility to
ﬁnd feasible schedules, when compared to constructive techniques, such as mobility
based scheduling. This is of great importance since high quality solutions could
be found in design space regions where infeasible and feasible solutions are spa-
tially placed closely together. Making a wrong decision might involve a more costly
implementation of the system speciﬁcation. To clarify this, consider the results
obtained with the benchmark set TG1 from Gruian et al. [Gruian and Kuchcinski
2001], as shown in Table VIII. The results shown in Column 4 are based on EVEN-
DVS and a constructive list scheduling heuristic which uses the mobility of tasks
as priorities. Consider for example benchmark r000. In the case of this benchmark
the scheduling attempt fails and the implementation is infeasible (Column 4, un-
solved), making it necessary to increase the performance of the allocated system
for the given mapping. On the other hand, our iterative GA-based list scheduling
technique (GLSA) is able to improve infeasible schedules by providing feedback to
the optimisation process and therefore feasible schedules might be found, as in the
case of the task graph example r000 (Column 7). This eﬀect is likely to appear in
the presence of tight deadline speciﬁcations, as it is the case with the benchmark
set TG1. It can be observed that for 6 out of 10 examples no feasible mapping
could be found when using a mobility based scheduling algorithm. Similarly, for
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 31
Table IX. Increasing architectural parallelism to allow voltage scaling of the OFD algorithm
Architecture Static Power Dynamic Power Total Power Reduction CPU time
(W) (W) (W) (%) (s)
2 DSPs 0.383 2.137 2.52 – –
3 DVS-DSPs 0.574 1.563 2.137 15.2 0.49
4 DVS-DSPs 0.736 1.053 1.789 29.0 0.69
5 DVS-DSPs 0.898 1.000 1.898 24.7 0.76
the remaining 20 task graphs of the TG1 benchmark set only 8 could be scheduled
using a mobility based scheduling approach. Clearly, the improved schedules are
solely introduced by the GA-based list scheduling and are not dependent on the
diﬀerent voltage scaling approaches.
In addition to the experiments presented above, we have validated the energy
reduction capability of the proposed scheduling and voltage scaling techniques,
using the real-life example of an optical ﬂow detection (OFD) algorithm. This
application is part of an autonomous helicopter and used for traﬃc monitoring
purpose. In its current implementation the OFD algorithm runs on two ADSP-
21061L digital signal processors (DSPs), with an average current of 760mA at 3.3V ,
resulting in an average power dissipation of approximately 2.5W. However, due to
the stringent power budget on board of the helicopter, including application critical
sub-systems, it is necessary to keep the overall power dissipation under a certain
limit. With respect to the performance of the two DSPs, this implementation is
able to process 12.5 frames of 78x120 pixels per second. We have conducted two set
of experiments regarding the OFD algorithm. In both we consider an hypothetical
extension of the DSPs towards DVS capability (DVS-DSP) and take into account
that such an extension increases the static power consumption of the processors.
This was estimated to be 10% for the systems presented in [Pering et al. 1998].
In the ﬁrst experiment the performance constraint is kept ﬁxed, i.e., the ﬂow
detection has to perform 12.5 frames per second. Since the 2 DSP implementation
needs to utilise the processors completely to achieve the 12.5Hz repetitions, we
increase the system performance by allocating additional DSPs. In this way it is
possible to utilise the application parallelism more eﬀectively and hence achieve
a high performance. This over performance can then be exploited by the DVS-
DSPs, in order to lower the dynamic power consumption. Table IX reports on
our ﬁndings. From this table it can be observed that with increasing number
of PEs the static power consumption increases as well, while the dynamic power
consumption decreases. Nevertheless, from the battery point of view the total
power dissipation is the limiting factor and it can be seen that the implementation
with 4 DVS-DSPs shows the lowest power consumption. It is important to note
that the implementations shown in Table IX do not necessitate any performance
degradation, though the energy dissipation is reduced by up to 29%. The proposed
scheduling and voltage scaling techniques optimised the execution of the 32 tasks
in less than 0.8s. Of course, the more DVS-DSPs are allocated, the more costly the
implementation becomes.
The last experiment is based on the fact at a 12.5Hz repetition rate is unnec-
essary high. We therefore relax the performance constraints to a repetition rate
of 8.33Hz, which is still high enough to allow a correct ﬂow detection, i.e., a cor-
ACM Journal Name, Vol. V, No. N, May 2003.32 · Marcus T. Schmitz et al.
Table X. Relaxed performance constraints of the OFD algorithm at 8.33Hz
Architecture Static Power Dynamic Power Total Power Reduction CPU time
(W) (W) (W) (%) (s)
2 DSPs 0.383 2.137 2.52 – –
2 DVS-DSPs 0.413 0.766 1.179 53.2 1.10
3 DVS-DSPs 0.574 0.699 1.273 49.5 1.78
4 DVS-DSPs 0.736 0.497 1.233 51.1 2.27
5 DVS-DSPs 0.898 0.503 1.401 44.4 3.54
rect operation of the OFD algorithm. In this case even the implementation build
out of 2 PEs is not fully utilised and the resulting idle times can be exploited by
DVS to reduced the power consumption. Table X shows the results for diﬀerent
architectural alternatives, consisting of 2 to 5 DVS-DSPs. Among all alternatives,
the system built out of two DVS-PEs is the clear favourite, since it achieves the
lowest energy consumption at the lowest cost. Clearly, the dynamic power reduc-
tions achieved for the 3–5 DVS-DSP systems do not justify the increased the static
power consumption. The optimisation of schedule and voltage scaling for these
system were carried out in at most 3.54s.
5. CONCLUSIONS
In this work, we have demonstrated that the consideration of power variations is
essential during the energy optimised synthesis of heterogeneous distributed hard-
ware/software systems containing power managed PEs, especially in the presence of
DVS-PEs. This has been mostly neglected in previous work on distributed systems
which include DVS-PEs. We have presented a novel DVS algorithm which identi-
ﬁes supply voltages for the tasks executing on DVS-PEs, under the consideration
of power variation eﬀects in order to minimise the dynamic energy dissipation. The
approach is based on the deﬁned energy diﬀerence. This DVS technique was success-
fully integrated into a genetic list scheduling approach as to iteratively optimise a
mapped system speciﬁcation towards an eﬃcient exploitation of the available DVS-
PEs. The integration was achieved by adapting the employed list scheduler for the
particular problems involved in dynamic voltage scaling. We have further compre-
hensively investigated the eﬀects of scheduling on the achievable energy reductions
when the generalised DVS technique is employed. The extensive experimental re-
sults show the necessity to take the PE power proﬁles into account when optimising
the scaled supply voltages for energy minimisation. Due to recent developments in
embedded computing systems and the availability of various implementations of
state-of-the-art DVS processors [Intelr XScaleTM 2000; Mobile AMD AthlonTM4
2000; Klaiber 2000] with power management techniques, voltage scaling algorithms
(as the present one) are becoming an important part of the synthesis ﬂow.
Acknowledgements
The authors wish to thank EPSRC for the ﬁnancial support in this project. They
would also like to thank Flavius Gruian (Lund University, Sweden) and Neal K.
Bambha (University of Maryland, USA) for kindly providing their benchmark sets.
Additionally, the authors wish to thank the reviewers for their excellent critical
assessment of the work and their useful suggestions.
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 33
REFERENCES
Bambha, N., Bhattacharyya, S., Teich, J., and Zitzler, E. 2001. Hybrid Global/Local Search
Strategies for Dynamic Voltage Scaling in Embedded Multiprocessors. In Proc. 1st Int. Symp.
Hardware/Software Co-Design (CODES’01). 243–248.
Brandolese, C., Fornaciari, W., Salice, F., and Sciuto, D. 2000. Energy Estimation for 32
bit Microprocessors. In Proc. 8th Int. Workshop Hardware/Software Co-Design (CODES’00).
24–28.
Burd, T. D. 2001. Energy-Eﬃcient Processor System Design. Ph.D. thesis, University of Cali-
fornia at Berkeley.
Burd, T. D. and Brodersen, R. W. 1996. Processor Design for Portable Systems. J. VLSI
Signal Processing 13, 2 (August), 203–222.
Burd, T. D., Pering, T. A., Stratakos, A. J., and Brodersen, R. W. 2000. A Dynamic Voltage
Scaled Microprocessor System. IEEE J. Solid-State Circuits 35, 11 (November), 1571–1580.
Chretienne, P., Coffman, E. G., Lenstra, J. K., and Liu, Z. 1995. Scheduling Theory and
its Applications. John Wiley & Sons.
Devadas, S. and Malik, S. 1995. A Survey of Optimization Techniques Targeting Low Power
VLSI Circuits. In Proc. IEEE 32nd Design Automation Conf. (DAC95). 242–247.
Dhodhi, M. K., Ahmad, I., and Storer, R. 1995. SHEMUS: Synthesis of Heterogeneous Mul-
tiprocessor Systems. J. Microprocessors and Microsystems 19, 6 (August), 311–319.
Dick, R., Rhodes, D., and Wolf, W. 1998. TGFF: Task Graphs for free. In Proc. 5th Int.
Workshop Hardware/Software Co-Design (Codes/CASHE’97). 97–101.
Dick, R. P. and Jha, N. K. 1998. MOGAC: A Multiobjective Genetic Algorithm for Hardware-
Software Co-Synthesis of Distributed Embedded Systems. IEEE Trans. Computer-Aided De-
sign 17, 10 (Oct), 920–935.
Eles, P., Peng, Z., Kuchcinski, K., and Doboli, A. 1997. System Level Hardware/Software
Partitioning Based on Simulated Annealing and Tabu Search. J. Design Automation for Em-
bedded Systems 2, 5–32.
Ernst, R., Henkel, J., and Brenner, T. 1993. Hardware-Software Co-synthesis for Mirco-
Controllers. IEEE Design & Test of Comp. 10, 4 (Dec), 64–75.
Fogarty, T. C. 1989. Varying the probability of mutation in the genetic algorithm. In Proc. 3rd
Int. Conf. Genetic Algorithms (ICGA). 104–109.
Fornaciari, W., Sciuto, D., and Silvano, C. 1999. Power Estimation for Architectural Ex-
ploration of HW/SW Communication on System-Level Buses. In Proc. 7th Int. Workshop
Hardware/Software Co-Design (CODES’99). 152–156.
Garey, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the theory
of NP-Completeness. W.H. Freeman and Company.
Goldberg, D. E. 1989. Genetic Algorithms in Search, Optimization & Machine Learning.
Addison-Wesley Publishing Company.
Grajcar, M. 1999. Genetic List Scheduling Algorithm for Scheduling and Allocation on a Loosely
Coupled Heterogeneous Multiprocessor System. In Proc. IEEE 36th Design Automation Conf.
(DAC99). 280–285.
Gruian, F. 2000. System-Level Design Methods for Low-Energy Architectures Containing Vari-
able Voltage Processors. In Workshop Power-Aware Computing Systems.
Gruian, F. and Kuchcinski, K. 2001. LEneS: Task Scheduling for Low-Energy Systems Using
Variable Supply Voltage Processors. In Proc. Asia South Paciﬁc - Design Automation Conf.
(ASP-DAC’01). 449–455.
Gutnik, V. and Chandrakasan, A. 1997. Embedded Power Supply for Low-Power DSP. IEEE
Trans. VLSI Systems 5, 4 (425–435).
Henkel, J., Benner, T., and Ernst, R. 1993. Hardware Generation and Partitioning Ef-
fects in the COSYMA System. In Proc. Int. Workshop Hardware/Software Co-Design
(Codes/CASHE’93).
ACM Journal Name, Vol. V, No. N, May 2003.34 · Marcus T. Schmitz et al.
Henkel, J. and Ernst, R. 2001. An Approach to Automated Hardware/Software Partitioning
using a Flexible Granularity that is driven by High-Level Estimation Techniques. IEEE Trans.
VLSI Systems 9, 2, 273–289.
Hong, I., Kirovski, D., Qu, G., Potkonjak, M., and Srivastava, M. B. 1999. Power Opti-
mization of Variable-Voltage Core-Based Systems. IEEE Trans. Computer-Aided Design 18, 12
(Dec), 1702–1714.
Hou, J. and Wolf, W. 1996. Process Partitioning for Distributed Embedded Systems. In Proc.
CODES. 70 – 76.
Intelr XScaleTM. 2000. Developer’s Manual. Order Number 273473-001.
Ishihara, T. and Yasuura, H. 1998. Voltage Scheduling Problem for Dynamically Variable
Voltage Processors. In Proc. Int. Symp. Low Power Electronics and Design (ISLPED’98).
197–202.
Kalavade, A. 1995. System-Level Codesign of Mixed Hardware-Software Systems. Ph.D. thesis,
University of California, Berkeley.
Kirovski, D. and Potkonjak, M. 1997. System-level Synthesis of Low-Power Hard Real-Time
Systems. In Proc. IEEE 34th Design Automation Conf. (DAC97). 697–702.
Klaiber, A. 2000. The Technology behind Crusoe Processors. http://www.transmeta.com.
Lee, S. and Sakurai, T. 2000. Run-time Voltage Hopping for Low-power Real-time Systems. In
Proc. IEEE 37th Design Automation Conf. (DAC00). 806–809.
Li, Y.-T. S., Malik, S., and Wolfe, A. 1995. Performance Estimation of Embedded Software
with Instruction Cache Modeling. In Proc. IEEE/ACM Int. Conf. Computer-Aided Design
(ICCAD-95). 380–387.
Liu, J., Chou, P. H., Bagherzadeh, N., and Kurdahi, F. 2001. Power-Aware Scheduling
under Timing Constraints for Mission-Critical Embedded Systems. In Proc. IEEE 38th Design
Automation Conf. (DAC01). 840–845.
Luo, J. and Jha, N. K. 2000. Power-conscious Joint Scheduling of Periodic Task Graphs and
Aperiodic Tasks in Distributed Real-time Embedded Systems. In Proc. IEEE/ACM Int. Conf.
Computer-Aided Design (ICCAD-00). 357–364.
Luo, J. and Jha, N. K. 2001. Battery-aware Static Scheduling for Distributed Real-Time Em-
bedded Systems. In Proc. IEEE 38th Design Automation Conf. (DAC01). 444–449.
Manzak, A. and Chakrabarti, C. 2000. Variable Voltage Task Scheduling for Minimizing
Energy or Minimizing Power. In Proc. Int. Conf. Acoustics, Speech, and Signal Processing
(ICASSP00). 3239–3242.
Micheli, G. D. and Gupta, R. K. 1997. Hardware/Software Co-Design. In Proceedings of the
IEEE. 349–365.
Mobile AMD AthlonTM4. 2000. Processor Model 6 CPGA Data Sheet. Publication No 24319
Rev E.
Muresan, R. and Gebotys, C. H. 2001. Current Consumption Dynamics at Instruction and
Program Level for a VLIW DSP Processor. In Proc. Int. Symp. System Synthesis (ISSS’01).
130–135.
Okuma, T., Ishihara, T., and Yasuura, H. 1999. Real-Time Task Scheduling for a Variable
Voltage Processor. In Proc. Int. Symp. System Synthesis (ISSS’99). 24–29.
Okuma, T., Ishihara, T., and Yasuura, H. 2001. Software Energy Reduction Techniques for
Variable-Voltage Processors. IEEE Design & Test of Comp. 18, 2 (March–April), 31–41.
Pedram, M. 1996. Power Minimization in IC Design: Principles and Applications. ACM Trans.
Design Automation of Electronic Systems (TODAES) 1, 1 (Jan), 3–56.
Pering, T., Burd, T. D., and Brodersen, R. B. 1998. The Simulation and Evaluation for
Dynamic Voltage Scaling Algorithms. In Proc. Int. Symp. Low Power Electronics and Design
(ISLPED’98). 76–81.
Prakash, S. and Parker, A. 1992. SOS: Synthesis of Application-Speciﬁc Heterogeneous Mul-
tiprocessor Systems. J. Parallel & Distributed Computing, 338–351.
ACM Journal Name, Vol. V, No. N, May 2003.Iterative Schedule Optimisation for Voltage Scalable Distributed Embedded Systems · 35
Quan, G. and Hu, X. S. 2001. Energy Eﬃcient Fixed-Priority Scheduling for Real-Time Systems
on Variable Voltage Processors. In Proc. IEEE 38th Design Automation Conf. (DAC01). 828–
833.
Quan, G. and Hu, X. S. 2002. Minimum Energy Fixed-Priority Scheduling for Variable Voltage
Processors. In Proc. Design, Automation and Test in Europe Conf. (DATE2002). 782–787.
Rogers, A. and Pr¨ ugel-Bennett, A. 1999. Modelling the dynamics of a steady-state genetic
algorithm. In Foundations of Genetic Algorithms (FOGA-5). 57–68.
Schmitz, M. T. 2003. Energy Minimisation Techniques for Distributed Embedded Systems. Ph.D.
thesis, University of Southampton.
Schmitz, M. T. and Al-Hashimi, B. M. 2001. Considering Power Variations of DVS Process-
ing Elements for Energy Minimisation in Distributed Systems. In Proc. Int. Symp. System
Synthesis (ISSS’01). 250–255.
Schmitz, M. T., Al-Hashimi, B. M., and Eles, P. 2002. Energy-Eﬃcient Mapping and Schedul-
ing for DVS Enabled Distributed Embedded Systems. In Proc. Design, Automation and Test
in Europe Conf. (DATE2002). 514–521.
Shin, Y. and Choi, K. 1999. Power Conscious Fixed Priority Scheduling for Hard Real-Time
Systems. In Proc. IEEE 36th Design Automation Conf. (DAC99). 134–139.
Shin, Y., Choi, K., and Sakurai, T. 2000. Power Optimization of Real-Time Embedded Sys-
tems on Variable Speed Processors. In Proc. IEEE/ACM Int. Conf. Computer-Aided Design
(ICCAD-00). 365–368.
Sih, G. C. and Lee, E. A. 1993. A Compile-time scheduling heuristic for interconnection-
constrained heterogeneous processor architectures. IEEE Trans. Parallel and Distributed Sys-
tems 4, 2 (Feb.), 175–187.
Simunic, T., Benini, L., Acquaviva, A., Glynn, P., and Micheli, G. D. 2001. Dynamic Voltage
Scaling and Power Management for Portable Systems. In Proc. IEEE 38th Design Automation
Conf. (DAC01). 524–529.
Teich, J., Blickle, T., and Thiele, L. 1997. An Evolutionary Approach to System-Level Syn-
thesis. In Proc. 5th Int. Workshop Hardware/Software Co-Design (Codes/CASHE’97). 167 –
171.
Tiwari, V., Malik, S., and Wolfe, A. 1994. Power Analysis of Embedded Software: A First
Step Towards Software Power Minimization. IEEE Trans. VLSI Systems.
Weiser, M., Welch, B., Demers, A., and Shenker, S. 1994. Scheduling for Reduced CPU
Energy. In Proc. USENIX Symposium on Operating Systems Design and Implementation
(OSDI). 13–23.
WITAS. The Wallenberg laboratory for research on Information Technology and Autonomous
System. http://www.ida.liu.se/ext/witas/.
Wolf, W. H. 1994. Hardware/Software Co-Design of Embedded Systems. In Proceedings of the
IEEE. 967–989.
Wolf, W. H. 1997. An Architectural Co-Synthesis Algorithm for Distributed, Embedded Com-
puting Systems. IEEE Trans. VLSI Systems 5, 2 (June), 218–229.
Wu, M. and Gajski, D. 1990. Hypertool: A Programming Aid for Message-passing Systems.
IEEE Trans. Parallel and Distributed Systems 1, 3 (July), 330–343.
Xie, Y. and Wolf, W. 2001. Allocation and Scheduling of Conditional Task Graph in
Hardware/Software Co-Synthesis. In Proc. Design, Automation and Test in Europe Conf.
(DATE2001). 620 – 625.
Zhang, Y., Hu, X., and Chen, D. Z. 2002. Task Scheduling and Voltage Selection for Energy
Minimization. In Proc. IEEE 39th Design Automation Conf. (DAC02). 183–188.
ACM Journal Name, Vol. V, No. N, May 2003.