Energy-conscious tasks partitioning onto a heterogeneous multi-core platform by Awan, Muhammad Ali & Petters, Stefan M.
  
 
 
 
 
 
 
Energy-conscious tasks partitioning onto a 
heterogeneous multi-core platform 
 
 
 
 
www.hurray.isep.ipp.pt 
Technical Report 
HURRAY-TR-121006 
Version:  
Date: 10/18/2012 
Muhammad Ali Awan 
Stefan M. Petters 
 
Technical Report HURRAY-TR-121006  Energy-conscious tasks partitioning onto 
  a heterogeneous multi-core platform 
© IPP Hurray! Research Group 
www.hurray.isep.ipp.pt   
1 
 
Energy-conscious tasks partitioning onto a heterogeneous multi-core platform 
Muhammad Ali Awan, Stefan M. Petters 
IPP-HURRAY! 
Polytechnic Institute of Porto (ISEP-IPP) 
Rua Dr. António Bernardino de Almeida, 431 
4200-072 Porto 
Portugal 
Tel.: +351.22.8340509, Fax: +351.22.8340509 
E-mail: muaan@isep.ipp.pt, smp@isep.ipp.pt 
http://www.hurray.isep.ipp.pt 
 
Abstract 
Modern multicore processors for the embedded market are often heterogeneous in nature. One feature often available 
are multiple sleep states with varying transition cost for entering and leaving said sleep states. This research effort 
explores the energy efficient task-mapping on such a heterogeneous multicore platform to reduce overall energy 
consumption of the system. This is performed in the context of a partitioned scheduling approach and a very realistic 
power model, which improves over some of the simplifying assumptions often made in the state-of-the-art. The 
developed heuristic consists of two phases, in the first phase, tasks are allocated to minimise their active energy 
consumption, while the second phase trades off a higher active energy consumption for an increased ability to exploit 
savings through more efficient sleep states. Extensive simulations demonstrate the effectiveness of the approach.  
 
Energy-conscious tasks partitioning onto a
heterogeneous multi-core platform
Muhammad Ali Awan Stefan M. Petters
CISTER/INESC-TEC, ISEP, Polytechnic Institute of Porto, Portugal
maan,smp@isep.ipp.pt
Abstract—Modern multicore processors for the embedded mar-
ket are often heterogeneous in nature. One feature often available
are multiple sleep states with varying transition cost for entering
and leaving said sleep states. This research effort explores the
energy efficient task-mapping on such a heterogeneous multicore
platform to reduce overall energy consumption of the system.
This is performed in the context of a partitioned scheduling
approach and a very realistic power model, which improves
over some of the simplifying assumptions often made in the
state-of-the-art. The developed heuristic consists of two phases,
in the first phase, tasks are allocated to minimise their active
energy consumption, while the second phase trades off a higher
active energy consumption for an increased ability to exploit
savings through more efficient sleep states. Extensive simulations
demonstrate the effectiveness of the approach.
I. INTRODUCTION
For embedded real-time (RT) systems it is imperative that
timing constraints posed by the environment are met. In
this general context a number of trends can be identified.
Firstly, Moore’s law is no longer sustained by increasing clock
frequencies, but rather by addition of extra cores in multi-
processors. This is driven for example, by the performance
per watt ratio, as higher clock ratios demand also higher
supply voltages. Besides symmetric multicore processors, ho-
mogeneous and heterogeneous multicores gain in popularity.
The move beyond symmetric multicores is driven by both
using cores geared to perform specific tasks well and cheap.
A second trend is an increased interest in multi-criticality
devices, where part of the system is critical and other parts
are executed in a best effort manner. Finally, there is the
move towards increased use of embedded devices with limited
energy supply. These might be, for example, solar powered
devices in the field or handheld rechargeable devices. In these
kind of devices effective management of the limited resource
(energy) is another constraint in the system requirements.
The real-time community has recognised these trends and
provided solutions to these challenges. However, in most cases,
the power and energy models used make many simplifying
assumptions, which limit the applicability of the presented so-
lutions. Common assumptions are, on one side homogeneous
multicore processors with a constant speed factor between the
different cores; on the other side, the energy consumption of
different applications are only a function of execution time
rather than other task characteristics (e.g. number of cache
misses). The latter has been shown to be widely off the mark
[1]. Finally, the use of multiple available sleep states is rare.
The fact that task characteristics, like the cache miss pattern
have an influence on the energy consumption beyond the mere
change of execution time, means that analytical solutions are
bound to be suboptimal for most specific cases. As such,
the way forward is an effective heuristic to be used for
energy management. Within this work, we assume that the
system has such non-linear dependencies on execution time
and energy consumption and several sleep states. In order to
guarantee the temporal isolation requirement, we work with a
partitioned scheduling approach. The underlying approach per
CPU, ERTH [2], allows reconfiguration at run-time and thus
enables limited migration, however, in this work we focus on
the task partitioning and mapping problem. In the allocation
stage the approach considers average-case energy consumption
as objective function, considering real-time constraints based
on worst-case execution and minimum inter-arrival time.
The proposed approach is divided into two steps. Firstly,
the novel algorithm performs assignments with an objective
to reduce the active energy consumption of the system by
allocating tasks to their favourite processors, on which their
active energy consumption is minimal compared to other
processor types. In the second phase, it trades off the higher
active energy consumption of tasks to enhance the processor’s
ability to use more efficient sleep states. The sleep states allow
the processor to reduce the static power consumption of the
system in idle intervals. The second stage is motivated by
the fact that the static power consumption has become non
negligible portion of the overall energy consumption of the
system. Traditional task assignment algorithms aim to reduce
the active power consumption of the system by assigning the
tasks to their favourite processor, while ignoring the static
power consumption. The management of the static power
consumption of the processor is an orthogonal issue as it
depends on the properties of the tasks such as its minimum
inter-arrival time and worst-case execution time. For instance,
assume the task assignment is such that it generates large
amount of idle intervals. The processor may not be able to
exploit it to use deeper sleep states due to a combination of
the larger transition overhead of those and a short period task.
The paper is organised as follows. Section II discusses
the related work followed by the system model. Section IV
presents the two phase approach to do the task assignment
followed by the experimental setup and results. We conclude
and provide future directions in Section VI.
II. RELATED WORK
Energy efficient scheduling for the homogeneous multipro-
cessors has been widely explored in RT systems in the last
decade. For instance, Kandhalue et al. [3] recently presented
a Single-clock domain multi-processor Frequency Assignment
Algorithm (SFAA) for the periodic, implicit deadline tasks
under fixed priority (rate-monotonic) scheduling. It exploits
the task period relationships to determine energy efficient fre-
quency assignment. Chen et al. [4] provided a comprehensive
survey of such techniques. In contrast, the state-of-the-art in
power-aware heterogeneous multiprocessors is limited.
Yu and Prasanna [5] proposed the static allocation of the
tasks in a RT system for the heterogeneous processing units
under Dynamic Voltage Scaling (DVS). They formulated the
problem as an Integer Linear Programming (ILP) and provided
a linearisation heuristics. A pseudo polynomial time greedy
algorithm [6] is proposed by Huang et al. for the frame-based
RT task model and heterogeneous systems. Furthermore, a
greedy heuristics is provided to migrate the tasks from the
overloaded processor to reduce energy consumption. Luo and
Jha addressed the tasks model with precedence constraints
and proposed the list-scheduling strategy [7] for the hetero-
geneous distributed systems. Chen and Thiele [8] considered
a case of 2 type heterogeneous processors and proposed a
polynomial time approximation scheme based on the ratio of
task execution times on the different processor types. The
synthesis problem for heterogeneous platform is addressed
by Hsu et al. [9] for the RT task model. They proposed
approximation algorithm based on a rounding technique by
applying a parametric relaxation on an ILP to minimise the
processor cost under the given timing and energy cost. Hung et
al. [10] considered a heterogeneous platform with 2 processing
elements, one with DVS enabled core and second without
DVS capability, with an objective to reduce the overall energy
consumption and maximise the energy saving in migration
from DVS enabled core to non-DVS core. While DVS has its
advantages, the state-of-the-art ( [5]–[10]) ignores the static
power consumption. We focus on the shut-down mechanism
in this paper that effectively exploits the idle intervals in the
schedule to reduce the static power consumption of the system
that has become a considerable factor of the overall power
consumption of the modern embedded systems.
Yang et al. [11] proposed an approximation algorithm
based on dynamic programming and provides polynomial-
time solution when the number of processor types is a small
constant. However, in the general case when the restriction
over the number of processor types is relaxed, this scheme has
exponential time/space complexity. They also assume static
power consumption of the system as a constant factor. The
work of Chen et al. [12] presented a task assignment algorithm
for periodic real-time tasks on heterogeneous platforms. The
problem is formulated as an ILP problem. They relax some of
the assumptions to adopt it into linear programming (LP) and
solve it through extreme point theory [13]. The tasks assigned
fractionally in the previous steps are reassigned through known
heuristics such first-fit, best-fit, worst-fit or last-fit. They ( [11],
[12]) assume the static power consumption of the system is a
constant factor and it cannot be reduced due to the significant
overhead of the sleep transitions. This assumption does not
hold for modern processors which contains several sleep states
to reduce the static power consumption of the system. More-
over, the static power consumption has become a considerable
part of the overall energy consumption. Therefore, the effect
of the task allocation on the power consumption in the sleep
states should be considered to avoid suboptimal assignments.
Our proposed algorithm is based on the realistic power
model. It considers the effect of task properties on both
active and static power consumption of an assigned processor.
In the context of heterogeneous multicores, the state-of-the-
art assumes only dynamic power consumption, ignores static
power consumption or considers it a constant factor while
doing task allocation on such platforms.
III. MODEL
A. Platform
We assume a partitioned multicore architecture, withM dif-
ferent types of heterogeneous processors/cores. Each processor
type has a unique characteristic of power consumption and
execution capability when compared to others. Each processor
type ⇡m has zm processing units of type m. The total number
of processors in the system will be equal to
PM
i=1(z
i). The
utilisation of a given core is denoted as Umzm .
B. Task Model
We assume sporadic task-model with ` independent tasks
⌧ = ⌧1, ⌧2, · · · , ⌧`. Each task ⌧i is represented as a quadruple
hCmi , Di, Ti, E¯mi i, where Cmi is a vector of worst-case exe-
cution times of ⌧i on M different processor types. Di is the
deadline and Ti is the minimum-inter arrival time. As derived
value Umi = Cmi /Ti. For the sake of simplicity, we assume
implicit deadline meaning Di = Ti. E¯mi is a vector of the
average-case energy consumption of ⌧i onM different proces-
sor types at their maximum speed. Each independent task will
release a sequence of unlimited jobs jmi,k = hri,k, cˆi,k, di,ki.
Where ri,k, cˆi,k and di,k are the absolute release time, actual
execution time and absolute deadline respectively. Jobs of the
same task are allowed to vary their execution between ⌧i’s
best-case execution time (BCET) and the worst-case execution
time (WCET).
The Enhanced Race-To-Halt (ERTH) algorithm [2] is used
on each processor, which is a leakage aware energy manage-
ment approach for dynamic priority systems. It allows multiple
sleep states per processor and utilises spare capacity available
online to save total energy consumption of the system. ERTH
is based on the Rate-Based Earliest Deadline first (RBED)
framework [14], which provides temporal isolation via an
enforced budget associated with each task. This temporal iso-
lation allows for mixed criticality workloads. Though RBED
supports many application classes (such as Hard RT, Soft RT
and BE tasks), we focus in our discussion on BE and Hard
RT tasks without loss of generality.
2
C. Power Model
The power model used in state-of-the-art assumes two
different parts: dynamic (active) power and static (leakage)
power. Dynamic power consumption varies with the frequency
of the processor, while static power consumption is considered
as a constant factor. Consequently, this power model assumes
the energy consumption of an application on a processor is
only a function of its execution time. However, in real terms,
energy consumption on a certain processor depends also on
the set of instructions it has to execute to perform the desired
functionality. Different instructions use different parts of CPU,
and hence result in a different energy consumption. Therefore,
two application with identical execution time may consume
different energy depending on the characteristics of the instruc-
tions used, and the number of cache misses involved. Secondly,
the static power consumption of the system cannot be regarded
as a constant factor. If the energy saving mechanism is based
on sleep states then the static power consumption of the system
depends on the energy characteristics of the used sleep states.
We employ this more refined power model where energy
consumption of a system is not constant per unit time, rather
depends on the behaviour of the application, the sleep-states
characteristics of the processor and the use of sleep states by
the scheduling algorithm.
We assume only a single speed per core, i.e. no DVS.
The power consumption of the processor type ⇡m in active
mode and idle mode are Pma and Pmi respectively. Similarly,
we assume each processor has N sleep states (low power
states). Each sleep state Smn is characterised with the tuple
hPmn , trmn , Esmn i, where Pmn is the power consumption of the
system in the sleep state Smn , trmn is the transition overhead
of going into or out of sleep state and Eswn is the energy
overhead associated to each sleep transition. For brevity, it
is assumed that the transition overhead of going into or out
of sleep state is same i.e. trmn . The break-even-time BETmn
of the sleep state Smn describes the minimum interval for
which entering a given sleep state is more efficient than any
shallower sleep state, despite the extra overhead (time/energy)
of entering and leaving this sleep state. The sleeps states
parameters can be used to derive its break-even-time BETmn
using any known techniques [2]. Note that the BETmn for
practical consideration is atleast 2 ⇤ trmn .
The average energy consumption of all tasks on all proces-
sor types is determined offline using any known techniques
(for instance, energy measurement technique based on perfor-
mance monitoring counter [15]). Nevertheless, one can also
use our approach with the naı¨ve power model that assumes in
active mode, the energy consumption of the processor is con-
stant per unit time or consider worst case energy consumption
as optimisation target. The preference of the task to any core is
set with respect to its ascending order of energy consumption.
The most favourite core type for the task is the one where
its energy consumption is minimal. Similarly, a core type is
least preferred where the energy consumption of the task is
maximal. We assume the static power consumption of the
system is not constant. It can be reduced by using efficient
low power sleep states in the idle intervals.
D. Problem Statement
We consider M-type Heterogeneous platform with per core
several sleep states assuming their energy/time overhead in a
setting of partitioned scheduling and map a given task-set onto
this platform such that the overall energy consumption (active
+ sleep) of the system is minimised.
IV. ALLOCATION HEURISTICS
In order to tackle active and static power consumption, a two
phase algorithm is proposed to perform the task assignment
for the given M-type heterogeneous platform. The first phase
of the algorithm optimises the assignments such that it reduces
the active energy consumption of the system. The second
phase trades tasks active energy consumption to enhance the
ability of the processors to use efficient sleep states to reduce
static power consumption of the system. For simplicity sake,
we assume in the discussion only a single core of each
processor type and has a utilisation of Um. Multiple cores
per type essentially only increase the processing capacity, but
do not provide new insights. Consequently we will use the
terms processor type, core type and core interchangeably.
A. First Phase of Allocation
We propose two different assignment algorithms to reduce
the dynamic power consumption of the system.
1) Least Loss Energy Density Algorithm (LLED): This
algorithm attempts to allocate tasks to their favourite core to
optimise the individual task energy consumption of the system.
However, not all tasks may be allocated to their respective
favourite core type due to the limited capacity on each core.
In such scenario, where more than one tasks are competing
for their favourite core type, we need to rank the tasks among
each other on same core type.
We defined the energy density EDmi = E¯mi /Ti of a task ⌧i
on a core ⇡m. The energy density of a task gives its average
energy consumption per unit time on the respective core type.
This value does not provide any global perspective on how
the power consumption of the system changes when a certain
task is not allocated to its preferred core type. The global
perspective can be achieved through a metric termed as density
difference (DD). The density difference can be determined by
subtracting the energy density of the task on the current core
type from the next higher energy density value of the same task
on another core. It can computed with the following expression
DDmi = min{EDki : k 6= m ^ EDki   EDmi }   EDmi . It
defines how much extra energy it will consume, if the task is
allocated to the next higher energy consumption core instead of
its current preferred core type. To get the ranking of the task on
the given core, we sort all the tasks on this core in descending
order with respect to their DD values. The tasks from the top
of the list are allocated to core. The intuition behind such
mechanism is to reduce the losses by allocating the tasks
with higher energy density difference first. The process can
3
be started from any core type. An allocated task to a core is
not considered for an allocation on any other core where it
consumes more energy than its currently allocated core. The
same procedure is repeated for all cores. In worst-case, the
process is iterated over each core at most ` times.
The pseudo-code of Least Loss Energy Density algorithm
(LLED) is given in Algorithm 1. Initially, we compute the
energy density EDmi of every task on all core types (line
2). Using energy density values, the DD values of all tasks
are estimated on each core and stored in a matrix called
MT (line 3-6, 10). (Note: MT qw value in a matrix MT
corresponds to the DD value of ⌧w on a core type ⇡q). To
obtain the DD value of the task ⌧w on its least preferred core
type ( max
x=1,···,M
EDxw), its energy density value on the least
preferred core type is subtracted from 0 (line 8) to obtain
a negative value. Afterwards, the algorithm iterates through
the processors in any order (for example, we used processors
indices to order them). Starting from the first core type ⇡q , all
tasks on ⇡q have their entries in MT q sorted in descending
order with respect to their MT qw or DD values. Our algorithm
iterates by picking a task from the top of the sorted list and
attempts to allocate it to ⇡q . For instance, ⌧x is the current
task on top of the sorted list with respect to DD values on
core ⇡q . The algorithm attempts to allocate ⌧x to ⇡q . If ⇡q can
accommodate ⌧x (line 17-18), it does not consider ⌧x on other
cores for which this inequality E¯mx   E¯qx holds and removes
its entries of DD values in MT matrix (line 19). In other
words, ⌧x is not considered for allocation on other core types
where it consumes more or equal energy compared to this core
type ⇡q . If the task ⌧x was previously allocated to these higher
energy consuming core types, it is deallocated on such cores
(line 20). Once the allocation for ⌧x is completed on ⇡q , LLED
attempts to allocate the next task in the sorted list. If any of
the task in the order cannot be allocated to ⇡q , the algorithm
moves to the next core type instead of checking the next tasks
in the order. This action is performed to avoid allocation of any
unfavourable task to the current core type, which may have a
chance of allocation in the next iteration. The same procedure
is repeated for the next core type and so on. On completion
of the first iteration, the algorithm starts again from the first
processor type. These iterations are repeated unless all the
tasks are allocated to exactly one core type. In worst-case,
the algorithm has to check each task in each core type for `
times. Lines 13 26 in Algorithm 1 corresponds to these steps.
Therefore, complexity of this algorithm in O(`2 ⇥M). The
working of the algorithm is demonstrated with an example.
a) Example: Assume, we have 4 tasks and 3 core types.
The tasks specifications are given in Figure 1(a). Entries under
each core type specifies Ci, E¯i, EDmi for ⌧i. The DD values
are computed for all tasks and presented in Figure 1(b). As
an example, the DD value of ⌧1 in ⇡1 is computed by an
expression ED21   ED11 . We start from the first core type
⇡1 and sort the tasks in descending order of DD values as
presented in the first column of Figure 1(c). ⌧4 can be allocated
to ⇡1, therefore, its entry that consumes more energy compared
Algorithm 1 First Phase: Least Loss Energy Density (LLED)
1: Um = 0 for each core ⇡m
2: Compute EDmi for each ⌧i on each core
3: for q = 1 to M do {/* For all processor types */}
4: for w = 1 to ` do {/* For all tasks */}
5: if EDqw 6= max
x=1,···,M
EDxw then
6: EDrw = min
x={1,···,M}\q&&EDxw EDqw
EDxw
7: else
8: EDrw = 0
9: end if
10: MT qw = ED
r
w   EDqw
11: end for
12: end for
13: for all Tasks ` do
14: for q = 1 to M do {/* For all processors types */}
15: Sort all tasks having entry in MT q , w.r.t MT qw values in
descending order
16: for all ⌧w 2 ⌧ on core type q in descending order of Mqw
values do
17: if Uq + Uqw  1 then
18: Assign ⌧w to ⇡q
19: 8x2[1,···,M ]\q remove MT xw iff(E¯xw   E¯qw)
20: 8x2[1,···,M ]\q Ux  = Uxw iff(E¯xw   E¯qw&&⌧w is
assigned)
21: else
22: Break;
23: end if
24: end for
25: end for
26: end for
to this core type is deleted in ⇡3 type. ⌧2 cannot be allocated,
therefore we move to ⇡2 and sort the task-set according. In
core type ⇡2, ⌧1 and ⌧4 can be allocated. ⌧1’s entry in ⇡3 and
⌧4’s entries on ⇡1&⇡3 will be deleted due to higher energy
consumption. Similarly, after appropriate sorting of tasks with
respect to their DD values on ⇡3, ⌧2 and ⌧3 can be allocated
to ⇡3. Therefore, ⌧2’s entry in ⇡2 and ⌧3’s entry in ⇡2,⇡1
are deleted. This completes our first iteration and status of the
tasks after first iteration are shown in Figure 1(c). Similarly, we
perform the second iteration. On ⇡1, the ⌧4’s entry is deleted,
so it is not considered for allocation and system attempts to
allocated next task in the order (⌧2). The rest of the process is
similar to the first iteration. The end result of 2nd iteration is
shown in Figure 1(d). We do not need any further iterations as
all the tasks are assigned. The worst-case number of iterations
is equal to a task-set size.
2) MaxMin Algorithm (MM): Another simple heuristic
MaxMin labelled as MM can be used to assign tasks in
M-type heterogeneous platform to reduce the active power
consumption is given in Algorithm 2. Assume, EDmini is
the energy density of task ⌧i on its most favourite core type,
while EDmaxi corresponds to its energy density on the least
preferred core type. This heuristic for each task computes the
difference of EDmaxi and EDmini , i.e. EDmaxi  EDmini . All
tasks are globally sorted in descending order with respect to
this difference (line 5). The MM algorithm picks a task from
the top of the list and assigns to its favourite core type. If
4
Fig. 1: First Phase Mapping of Least Loss Energy Density Algorithm
(a) Ci/E¯i/EDmi Values
⇡1 ⇡2 ⇡3 Ti
⌧1 4.5/16.5/1.65 3/17.2/1.72 7/52.5/5.25 10
⌧2 8/37.65/2.51 10/65.1/4.34 8/57/3.80 15
⌧3 18/84/2.80 12/78.9/2.63 10/75.9/2.53 30
⌧4 60/259.2/2.16 35/210/1.75 80/649.2/5.41 120
(b) Density Difference (DD) in MT
⇡1 ⇡2 ⇡3
⌧1 0.07 3.53  5.25
⌧2 1.29  4.34 0.54
⌧3  2.8 0.17 0.10
⌧4 3.25 0.41  5.41
(c) 1st Iteration
⇡1 ⇡2 ⇡3
⌧4 ⌧1 ⌧2
⌧2 ⌧4 ⌧3
⌧1 ⌧3 ⌧1
⌧3 ⌧2 ⌧4
(d) 2nd Iteration
⇡1 ⇡2 ⇡3
⌧4 ⌧1 ⌧2
⌧2 ⌧4 ⌧3
⌧1 ⌧3 ⌧1
⌧3 ⌧2 ⌧4
Algorithm 2 Alternative First Phase: MaxMin (MM)
1: Um = 0 for each core ⇡m
2: Compute EDmi for each ⌧i on each core
3: 8⌧i : Find EDmaxi = max
x=1,···,M
EDxw
4: 8⌧i : Find EDmini = min
x=1,···,M
EDxw
5: Sort task-set with respect to
 
EDmaxi   EDmini
 
in descending
order
6: for all Tasks i = 1 to ` do
7: Sort cores with respect to the energy consumption of ⌧i in
ascending order
8: for all Processors j = 1 to M do
9: if U j + U ji  1 then
10: Assign ⌧i to ⇡j
11: U j+ = U ji
12: Break
13: end if
14: end for
15: end for
the favourite core cannot accommodate this task, an allocation
attempt is made for next core type in its ascending order of
energy consumption (line 8-14). If the task is assigned to a
core type, the utilisation of the corresponding core type is
incremented accordingly. The MaxMin algorithm is simple and
has a complexity of O(`⇥M).
B. Second Level of Optimisations
While, the first phase of allocations is derived with an objec-
tive to optimise an individual task’s active energy consumption
in the system, it ignores its effect on the mechanism to reduce
the static power consumption. For instance, a core may have
less active energy consumption but some small group of tasks
allocated to it may prevent it from using a more efficient
deeper sleep state in the idle intervals of the schedule to reduce
the static power consumption of the system. In this second
phase of optimisation, our algorithm analyses the properties
of the allocated tasks to a core in this broader context and
considers its effect on the core’s ability to use more efficient
sleep states by trading off higher active energy consumption
of a task for energy savings in sleep states. (Note: we use a
single core per type to simplify notations.)
As mentioned previously, we assume ERTH per core. The
ERTH scheduler is based on a race-to-halt strategy and reduces
static power consumption with a shut-down mechanism. It
determines the maximum time interval offline for which the
processor may be enforced in a sleep state without causing any
tasks to miss their deadlines under worst-case assumptions.
This maximum time interval of a sleep state is termed as
maximum-feasible-sleep-threshold thm and it can be deter-
mined using the demand bound function (DBF) [2]. Assuming
synchronous release of all tasks allocated to a core ⇡m,
thm = min
8L2L⇤
(L   dbf(L)), where L is an absolute deadline
and L⇤ is the first idle time in the schedule. In other words,
thm is the minimum distance between the supply and the
request bound functions in the first busy interval, assuming
the synchronous release of all tasks allocated to ⇡m. ERTH
initiates a sleep transition online when system is idle or has
sufficient slack. The length of thm defines which sleep state
can be used online on core ⇡m. For instance, if a sleep state
Smn has a break-even-time BETmn > thm, it is not beneficial
(energy-wise) to use such sleep state, for the obvious reason
that a sleep state saves energy when used for greater than
BETmn time interval.
ERTH selects a sleep state for a core ⇡m to be used online
based on the value of thm. However, the properties of tasks
involved in the computation of thm have a high impact on
its value. For example, tasks with shorter difference between
their Ti and Ci give a small value of thm and restrict usage of
those sleep states with BETmn > Ti Ci. The intuition behind
the second phase is to collate tasks on a core with similar
properties such that it can use a more efficient sleep state. As
we are using a heterogeneous platform, each core has sleep
states with different characteristics. A task(s) restricting a more
efficient sleep state on one core may not effect the sleep state
on the other core and hence can be considered for migration.
However, algorithm must ensure that such migration reduces
the overall average energy consumption of a system.
We propose the heuristics given in Algorithm 3 to do such
a trade-off. Tasks assigned in the first phase are sorted in
each core with respect to their difference between Ti and Ci
in descending order. Consider one of the core type ⇡m and
assume `m are the number of tasks allocated to it in the first
phase (through LLED or MM). The second phase initially
computes the maximum time interval of the sleep duration
also known as a maximum-feasible-sleep-threshold with just
one task picked from the top the sorted list (w.r.t Ti   Ci)
of tasks allocated to ⇡m. This value is denoted as thm1 and
computed through DBF. As there is just one task, therefore,
thm1 = Ti   Ci. Now we superimpose the next task on
the current DBF and new maximum-feasible-sleep-threshold
thm2 = min8L2L⇤
(L dbf(L)) is computed. Similarly, a third task
is superimposed and correspondingly thm3 is computed. This
5
Fig. 2: Demand Bound Function
(a) Individual Demand
2
2
4
4
6
6
8
8
10
10
12
12
⌧1(1,4)
⌧2(0.75,3)
⌧3(0.5,2)
(b) Superimposed
2
2
4
4
6
6
8
8
10
10
12
12
⌧1
⌧1 + ⌧2
⌧1 + ⌧2 + ⌧3
process is repeated for all sorted tasks allocated to ⇡m and
at the end we have a set of maximum-feasible-sleep-threshold
values called ⇢m = {thm1 , thm2 , thm3 , · · · , thm`m}. As the tasks
are superimposed in the descending order of Ti Ci, therefore,
one of the property of ⇢m is that thm1   thm2   thm3   · · ·  
thm`m . Moreover, th
m = thm`m . To illustrate the computation
of ⇢m set, lets consider an example. Assume, we have three
task ⌧(Ci, Ti) ) ⌧1(1, 4), ⌧2(0.75, 3), ⌧3(0.5, 2) sorted in the
descending order of Ti   Ci and allocated to ⇡m. Individual
demands of these tasks are shown in Figure 2(a). Firstly, thm1
with ⌧1 is computed, i.e. 3 units. Then ⌧2 is superimposed on
⌧1 and thm2 is computed, which is equal to 2.25. Finally, ⌧3 is
superimposed on the demand of ⌧1+ ⌧2 and thm3 is estimated
to be 1.25. These steps are demonstrated in Figure 2(b). This
example has ⇢m = {3, 2.25, 1.25}.
The number of elements in ⇢m is equal to `m. Each element
in ⇢m gives the maximum sleep interval with the correspond-
ing number of tasks. A core ⇡m can use this sleep interval to
initiate a sleep state, if the tasks used to compute such interval
are allocated to it. We determine the most efficient sleep state
(among the available set of sleep states in ⇡m) for all element
of ⇢m using the following expression {8x 2 ⇢m, find Smn : Smn
minimises (x⇤Pma +Esmn )}. We know, ⇢m holds the property
that thm1   thm2   thm3   · · ·   thm`m . Therefore, th2 cannot
get a better sleep state when compared to thm1 and so on.
After computing the sleep states for each ⇢m element
we group the tasks that allows the same sleep state. We
define a set Gmn that holds the tasks for a sleep state Smn .
Starting from thm1 , ⌧1 is added to a set of its computed sleep
state. Similarly, ⌧2 is added to the set corresponding to a
sleep state determined for thm2 and so on. Lets demonstrate
this step with an example. Suppose, we have five elements
in ⇢m = {thm1 , thm2 , thm3 , thm4 , thm5 }. Assume, sleep states
corresponding to these ⇢m elements are determined to be
{Sm1 , Sm1 , Sm2 , Sm3 , Sm3 }. Then the tasks are added to the sets
corresponding to the sleep states as follow: Gm1 = {⌧1, ⌧2},
Gm2 = {⌧3} and Gm3 = {⌧4, ⌧5}. We refer to these sets as
groups of tasks corresponding to different sleep states. These
groups of tasks are ordered from the least efficient to the most
efficient sleep states. Thus, if we remove the top most group
of tasks, a core can achieve the next better sleep state. This
complete process is repeated for all cores and finally we have
different groups of tasks on each core corresponding to its
different sleep states. This complete step is given in line 4 of
Algorithm 3.
All cores compete to gain the next more efficient sleep state
to save energy by getting rid of their tasks in the top most
group that enforces the less efficient sleep state. However, the
algorithm will first consider the core which would result in the
most system energy gain. To identify this core, each core will
remove all the tasks associated to the first group (that cause
less efficient sleep state). Let Gmtop corresponds to the tasks
in the top least efficient sleep state group on ⇡m. The energy
saving by removing such group from this core will be equal to
 E¯m as given in Equation 1. Where thmold and th
m
new are the
maximum-feasible-sleep-threshold intervals before and after
removing Gmtop respectively on ⇡m. After computing thmold
and thmnew, their corresponding sleep states are determined.
Suppose, Smn1 and Smn2 are the sleep states selected for thmold
and thmnew respectively. Moreover, E¯old = (thmold   trmn1) ⇤
Pmn1+Es
m
n1 and E¯new = (thmnew trmn2)⇤Pmn2+Esmn2 represent
the energy consumption of a single sleep transition with and
without Gmtop. All Cores are sorted in descending order with
respect to  E¯m as given in line 5 of Algorithm 3, which will
be used to attempt a reallocation of the top most group of the
cores in this determined order.
Assume, {⇡1,⇡2, · · · ,⇡m} represent the cores in descend-
ing order of  E¯m. Initially, we select ⇡1. G1top are the tasks
in the top least efficient sleep state group of ⇡1. The local cost
of migration LCo⌧j of each task ⌧j 2 G1top will be computed
on every other core type ⇡o excluding ⇡1. The expression
to determine the local cost of migration LCo⌧j is given in
Equation 2. This value is computed by finding ⌧j’s energy
density on ⇡o plus the energy consumption per unit of time in
idle period with ⌧j on ⇡o minus the energy consumption per
unit of time in idle period without ⌧j on ⇡o. Where, thonew
and thoold are the maximum-feasible-sleep-thresholds with and
without including ⌧j respectively on ⇡o. The sleep states
corresponding to thonew and thoold are determined. E¯new and
E¯old are the energy consumption of a single sleep transition
with or without ⌧j respectively. The algorithm sort all the core
types in ascending order of LCo⌧j to move ⌧j .
 E¯m =
0@P8⌧i2⇡m E¯iTi +
⇣
1 
P
8⌧i2⇡m
Ui
⌘
E¯old
thm
old
1A 
0@P8⌧i2⇡m\Gmtop E¯iTi +
⇣
1 
P
8⌧i2⇡m\Gmtop
Ui
⌘
E¯new
thmnew
1A (1)
LCo⌧j =
E¯oj
Tj
+
⇣
1 
P
8⌧i2⇡o+⌧j
Ui
⌘
E¯new
thonew
 
⇣
1 
P
8⌧i2⇡o
Ui
⌘
E¯old
tho
old
(2)
TE =
P
8⇡m
8<:⇣P⌧i2⇡m E¯miTi ⌘+
0@⇣1 P⌧i2⇡m Ui⌘E¯m
thm
1A9=; (3)
The algorithm attempts to assign it to a core type with the least
migration cost provided it is schedulable on that core. This
6
Algorithm 3 Second Phase of Task Mapping (SP)
1: repeat
2: Previous Assignment = Current Assignment
3: Energy Old = Energy New
4: Group tasks per core such that next better sleep state can be
achieved
5: Order core by gains when removing group
6: Feasible = TRUE
7: for all Processors M do
8: for all Tasks in a group do
9: Compute the local cost of migration on energy consump-
tion of this task for all other cores
10: Sort other cores by decreasing order of cost
11: for all Cores except the core of the currently assigned
task do
12: if Feasible on core then
13: Assign to a core
14: Success = True
15: Break
16: end if
17: end for
18: if !Success then
19: Feasible = FALSE
20: Break
21: end if
22: end for
23: if Feasible && Energy New < Energy Old then
24: Break
25: else
26: Undo all Assignments
27: end if
28: end for
29: until Previous Assignment == Current Assignment
process is repeated 8⌧j 2 G1top. Line 8 to 22 in Algorithm 3
corresponds to this step. In case any of the tasks ⌧j 2 G1top is
not schedulable, all the assignments are undone and we move
to the next core type. On the other side, if the assignments of
G1top are successful, we compute the new expected total energy
TE consumption of the system with Equation 3 and compare
it with the previous expected total energy consumption. If it
is less than previous expected TE consumption, we iterate
over the algorithm again unless the energy consumption of
the previous iteration is greater than this iteration.
The maximum number of groups (of tasks) in each pro-
cessor is equal to its number of sleep states and we migrate
the complete group to another core. While the theoretical
complexity of the second phase is exponential as we need to
find the combinations of all tasks in a group with all other
groups on other cores, the actual computation time in our
experiments is very fast. The execution time and the number
of migrations are discussed in Section V-B(2).
V. EVALUATION
A. Experimental Setup
In order to evaluate the effectiveness of our algorithms, we
have extended the SPARTS (Simulator for Power Aware and
Real-Time Systems) [16] and implemented our algorithms for
the experiments. SPARTS is used with the parameters defined
in Table I. The underlined values are the default values if
TABLE I: Overview of Simulator Parameters
Parameters Specifications
Task-set sizes |T| {100, 200, 500}
Inter-arrival time Ti for RT tasks [30ms, 50ms]
Inter-arrival time Ti for BE tasks [50ms, 200ms]
Sporadic delay limit ⌥ {10%}
Best-Case execution-time limit Cb {10%}
Share of RT/BE tasks ⇠ h30%, 70%i
Characteristic Factor   {10%, 20%, 40%}
TABLE II: Device Power Model Parameters
⇡m Pma P
m
i ⌘
m Sm0 S
m
1 S
m
2 S
m
3
⇡0 1.0 0.39 1.0 0.31 0.21 0.12 0.05
⇡1 2.2 0.86 0.5 0.67 0.47 0.27 0.11
⇡2 6.0 2.33 0.2 1.83 1.29 0.74 0.30
⇡3 13.0 5.05 0.1 3.98 2.79 1.61 0.64
⇡4 12.1 4.7 0.15 3.70 2.60 1.50 0.6
not specified in the description of an individual experiment.
Heterogeneous multicore platforms are used for a wide variety
of complex applications, therefore, the task-set size is varied
from small number of coarse grained 100 tasks to fine grained
large tasks-set sizes of 500 tasks. The share distributions ⇠
divide the task-set size and overall effective system utilisation
between RT and BE tasks. Moreover, the utilisation allocated
to each task type is randomly distributed among the tasks of
the same class. The minimum inter-arrival time of RT and BE
tasks is randomly chosen within a range of [30ms; 50ms] and
[50ms; 200ms] respectively. SPARTS selects one of the core
type and reference it as a default core type ⇡D. The task-set
is initially generated for ⇡D type. The WCET time CDi of ⌧i
is deemed to be UDi ⇥ Ti, where UDi is the utilisation of ⌧i
on the default core type ⇡D.
The average system capacity Ua of the given platform is
computed through the average speed-up-factor ⌘m. The speed-
up-factor defines a ratio of the clock cycle of a core ⇡m with
reference to ⇡D. Suppose speed-up-factor of a core type ⇡m
is ⌘m, then the average capacity of the system will be Ua =
z1/⌘1+z2/⌘2+· · ·+zm/⌘m. However the effective utilisation
U of the task-set in the experiments is controlled through a
helper variable ⇣, and U = Ua ⇥ ⇣. The range of ⇣ is (0; 1].
In our experiments ⇣ is varied from 0.5 to 0.9 with a step size
of 0.05. Individual utilisation of ⌧i on each ⇡m is a random
number within a range of Ui = [(1  ); (1+ )]⇥ ⌘m⇥UDi ,
where   is a characteristic factor that models the fact that
different tasks will respond differently in terms of execution
time when moved from one core to another.
Beyond those initial settings, a two level approach is used
for generating a wide variety of different tasks and their
subsequently varying jobs on all cores. Tasks are further
annotated with a limit on the sporadic delay  si in the interval
[0,⌥ ⇤ Ti] and on the best-case execution time Cbi in the
interval [Cb ⇤Cmi , Cmi ]. The second level varies the behaviour
of individual jobs of the same task. The interested reader is
referred to [16] for details. Each set point of parameters is
7
9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
0.9
0.905
0.91
0.915
0.92
0.925
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED−SP
MM−SP
Fig. 3: 4 Core Types (S1)
9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
0.8
0.85
0.9
0.95
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED−SP,   β = 10%
MM−SP, β = 10%
LLED−SP,   β = 20%
MM−SP, β = 20%
LLED−SP,   β = 40%
MM−SP, β = 40%
Fig. 4: Variation in   (S1)
evaluated with 100 different task sets.
The hardware parameters of heterogeneous platform used
in our experiments are shown in Table II. The power model
for the default core in our experiments is modelled after
the FreeScale PowerQUICC III Integrated Communication
Processor MPC8536 [17]. The FreeScalePowerQUICC III
core specifications are given in Table II under m = 4. The
values of the other core types are derived from this core type to
generate a heterogeneous platform. We assume each core type
has four sleep states, with {Smx : x 2 0, 1, 2, 3} representing
different sleep states such as Doze, Nap, Sleep and Deep Sleep
respectively. We have assumed their transition overheads and
estimated break-even-time accordingly. We assume a single
unit of each core type. The average system capacity Ua =
1
1 +
1
0.5 +
1
0.2 +
1
0.1 = 18. As we are changing ⇣ in an interval
of [0.5; 0.9], therefore, the effective utilisation of the system
U is within a range of [0.5; 0.9] ⇤ 18 = [9; 16.2]. The energy
consumption of a task is, however, not a mere function of its
execution time. As such the values of Emi are computed using
the average execution time C¯mi and a random value similar to
the utilisation conversion Emi = [1   ; 1 +  ]⇥ Pma ⇥ C¯mi .
B. Results
The parameters described previously remain the same, ex-
cept where explicitly specified. In the state-of-the-art there
is no such algorithm proposed that has a power model
such that this work could be compared with it. Moreover,
fundamental assumptions made in the state-of-the-art restrict
their extension to the more realistic power model proposed
in this paper. Therefore, we have implemented a worst-fit
decreasing (WFD) and first-fit (FF ) algorithm as a base
line to compared against our algorithms. It has been shown
by Aydin and Yang [18] that WFD performs better when
compared to other conventional bin packing algorithms for
homogeneous platforms. In our experiments, we observed that
WFD performs worst in heterogeneous platforms. It was able
to schedule few tasks-set at higher utilisations making it hard
to compare against our algorithms. Therefore, initially, we
compare our approached against the FF algorithm. Later on,
the experiments of WFD are also presented and compared
against the FF algorithm. Moreover, the FF algorithm allo-
cates the tasks sorted with respect to their Di or Ti following
the order from the slowest core type to the fastest core type.
The results under labels LLED-SP and MM -SP represent
the second phase applied on the allocation of LLED and
MM respectively. We have created 2 different scenarios. In the
first scenario, we have modelled the system with very efficient
sleep states having low transition overhead (time and energy).
The second scenario models the system, with substantially
less efficient sleep states. All results are normalised to the
corresponding values of the FF algorithm.
1) First Scenario: In this scenario, as the overhead of the
sleep state is low, therefore, different cores can still achieve
the most efficient sleep state even at high utilisation. This
scenario does not leave much room for the second phase
to save any additional energy when compared to LLED.
Nevertheless, MM -SP saves in some cases energy but it is
fairly minimal. Therefore, for this scenario, we compare the
energy consumption of MM -SP and LLED-SP .
Firstly, we analyse the performance of LLED-SP and
MM -SP for different number of core types. Figure 3 shows
the normalised energy consumption of the system for only 4
core types. The figure for 2 cores looks similar to Figure 3
but does not provide as high energy gains over FF due to
the limited scope for optimisation. Similarly, in the second
case of 4 core types, initially, the difference of LLED-SP
and MM -SP increases but then starts to shrink towards the
higher utilisations. The reason for this behaviour is quite
obvious that LLED-SP and MM -SP has more chance at
low utilisation to allocate task to their favourite core. However,
towards, high utilisation, this flexibility decreases along with
their difference. In best-case, LLED-SP consumes 10% less
energy consumption when compared to FF , while MM -SP
saves energy slightly under 10%.
We evaluate the effect of variation in the characteristic factor
  on the normalised total energy consumption of the system.
8
9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
0.895
0.9
0.905
0.91
0.915
0.92
0.925
0.93
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED−SP,  |τ| = 100
MM−SP, |τ| = 100
LLED−SP,  |τ| = 200
MM−SP, |τ| = 200
LLED−SP,  |τ| = 500
MM−SP, |τ| = 500
Fig. 5: Variation in Task-set Size (S1)
4.52 4.93 5.34 5.76 6.17 6.58 6.990.55
0.6
0.65
0.7
0.75
0.8
0.85
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED−SP, Asimilar
MM−SP, Asimilar
Fig. 6: Asimilar Platform (S1)
9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
1.01
1.02
1.03
1.04
1.05
1.06
1.07
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
WFD
Fig. 7: 4 Core Types (S1, WFD)
9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
1.01
1.02
1.03
1.04
1.05
1.06
1.07
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
WFD, β = 10%
WFD, β = 20%
WFD, β = 40%
Fig. 8: Variation in   (S1, WFD)
9 9.9 10.8 11.7 12.6 13.5 14.4
1.01
1.02
1.03
1.04
1.05
1.06
1.07
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
WFD, |τ| = 100
WFD, |τ| = 200
WFD, |τ| = 500
Fig. 9: Variation in Task-set Size (S1, WFD)
4.52 4.93 5.34 5.76 6.17 6.58 6.99
0.92
0.93
0.94
0.95
0.96
0.97
0.98
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
WFD, Asimilar
Fig. 10: Asimilar Platform (S1, WFD)
  controls the variation of task dynamic power consumption
from the average dynamic power consumption of the core.
Figure 4 demonstrates that the energy consumption of both
approaches decreases with an increase in the range of  . The
developed power model on average favours the slow core.
However, this factor ( ) can change this behaviour. With
9
  = 10%, small portion of tasks are more favourable to the
fast cores. Hence, the FF algorithm that fills the slowest core
first does a few task allocation to their unfavourable cores.
Consequently, the gains of LLED-SP and MM -SP are less
at   = 10%. However, as the   range increases, the tasks
probability to favour a fast core becomes higher. Therefore,
LLED-SP and MM -SP give better allocations for higher
values of  . Similar to the previous observation, the difference
of MM -SP and LLED-SP is higher at low utilisation and
decreases with an increase in the system utilisation.
Figure 5 demonstrates the effect of task-set size variation
on the given allocations mechanism. In general a large task-
set size increases the probability of the tasks to be allo-
cated to their unfavourable core with FF . Therefore, energy
consumption of the LLED-SP and MM -SP algorithms
decreases with an increase in the task-set size. However, this
saving reduces with an increase in the effective utilisation.
In the beginning LLED-SP with the different task-set sizes
do the same allocations but with an increase in effective
system utilisation, the difference in allocation also increases.
The same observations hold for the MM -SP as well. For
small task-set size of 100, FF also performs well at low
utilisation. However, this effect deteriorates with an increase
in the effective system utilisation.
Processors types given in Table II have approximately simi-
lar ratio of P xa /P ya ⇡ ⇣y/⇣x. We have generated a case where
this ratio is not the same and tasks always favour the same core
i.e. P xa /P ya 6= ⇣y/⇣x. This case allows us to evaluate a system,
where all the tasks are competing for the best core types.
For this experiment, we modified the heterogeneous platform
given in Table II and generated an asimilar heterogeneous
platform by changing the ⌘m values from 1, 0.5, 0.2, 0.1 to
1, 0.6, 0.45, 0.3. The average capacity of the asimilar platform
is Ua = 11 +
1
0.6 +
1
0.45 +
1
0.3 = 8.22. The effective utilisation
U is varied within a range of [0.5; 0.9] ⇤ 8.22 = [4.11; 7.4].
Figure 6 presents the results for the asimilar platform. The
energy consumption of LLED-SP and MM -SP is low at
low utilisation and gradually increases gradually towards high
utilisation. All the algorithms, attempt to allocate tasks in order
from the slowest core to the fastest core. LLED-SP can rank
tasks in an efficient way and saves more energy. Similarly,
MM -SP also performs better when compared to FF as it
also does some ranking of the tasks but FF does not prioritise
the tasks to account for global energy benefits.
a) Comparison With Worst Fit Decreasing (WFD):
We have simulated the WFD algorithm and compared it
against the FF algorithm. All values are normalised to the
corresponding results of FF . Figure 7 demonstrate the energy
consumption of WFD for four different cores types. It is
evident that WFD performs worse when compared to FF
algorithm. Its performance slightly increases with an increase
in the system utilisation and the difference of energy consump-
tion with FF decreases. WFD follows the similar trend for
different value of   as shown in Figure 8. One interesting
observation is that change of   has very similar effect on both
WFD and FF decreasing algorithms. As all the values of
WFD are normalised to the corresponding values of FF ,
therefore, the results of different   values are similar.
We have also compared the effect of different task-set sizes
on the WDF algorithm as shown in Figure 9. The perfor-
mance of the WFD algorithm increases with an increase in
the task-set size. Moreover, the utilisation of the system is
only varied upto 14.4 in this experiment because the WFD
algorithm was not able to schedule most of the task-sets on
higher utilisations. For the asimilar platform, Figure 10 shows
WFD performs better than FF . However, its performance is
a way more worse when compared to our algorithms.
2) Scenario 2: In this scenario, we have modelled a system,
in which the core types have large overheads of sleep transi-
tions (time/energy). To generate such model, we have scaled
the transition delays of all the sleeps states by a factor of
12 and determined their BET accordingly. We have observed
a very interesting result, which shows, it is not necessary
that tasks assigned to their favourite core will always reduce
the overall system energy consumption of the system. In
this scenario, the overall energy consumption depends mostly
on the characteristics of the core and it depends less on
those of the tasks. This fact will be evident in the following
experiments, in which we are comparing the LLED, MM ,
LLED-SP and MM -SP algorithms. The base line is still
the corresponding energy consumption of FF . Furthermore,
we have increased the range of ⇣ to [0.4; 0.9] with a step size
of 0.05 for this scenario.
Figure 11 shows the normalised total energy consumption
of system for 4 core types. At low utilisation, though LLED
and MM had a chance to allocate tasks to their favourite
core but globally it is not energy efficient. The reasons is
that these algorithms are not accounting the effect of their
allocations on the core sleep states. The FF algorithm which
is also sleep state agnostic allocation mechanism surprisingly
performs well compared to LLED and MM . It allocates
the core from the slowest one and allows fast core to have
empty space to use their efficient sleep state. However, our
LLED-SP and MM -SP algorithm compares well to FF
at low utilisations and compensates for the wrong allocations
done by LLED and MM respectively. It is interesting to see
that for low utilisation LLED-SP and MM -SP achieves
substantial gain. For high utilisations, LLED and MM en-
ergy consumption reduces when compared to FF . Hence, a
combination of initial first phase allocation (LLED or MM )
with the second phase is a good choice for most of the system
utilisations, except for some corner cases (at a utilisation of 9.9
in Figure 11). In the detailed analysis of utilisations between
7.2 and 9, we have observed that FF loses the efficient
sleep states earlier than LLED-SP or MM -SP . Hence, the
energy consumption of LLED-SP and MM -SP is dropped
at U = 8.1 when compared to FF It is also evident from
Figure 11 that the performance of LLED is always dominant
overMM , and similarly, the performance of LLED-SP over
MM -SP .
The variation in the characteristics factor   is demonstrated
in Figure 12 and Figure 13. Similar to the results in scenario
10
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.20.8
0.9
1
1.1
1.2
1.3
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED
LLED−SP
MM
MM−SP
Fig. 11: 4 Core Types (S2)
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED−SP,   β = 10%
MM−SP, β = 10%
LLED−SP,   β = 20%
MM−SP, β = 20%
LLED−SP,   β = 40%
MM−SP, β = 40%
Fig. 12: Variation in   (S2)
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
0.9
1
1.1
1.2
1.3
1.4
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED,   β = 10%
MM,   β = 10%
LLED,   β = 20%
MM,   β = 20%
SP,   β = 40%
MM,   β = 40%
Fig. 13: Variation in   (S2)
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
0.8
0.85
0.9
0.95
1
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED−SP,  |τ| = 100
MM−SP, |τ| = 100
LLED−SP,  |τ| = 200
MM−SP, |τ| = 200
LLED−SP,  |τ| = 500
MM−SP, |τ| = 500
Fig. 14: Variation in Task-set Size (S2)
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED,  |τ| = 100
MM, |τ| = 100
LLED,  |τ| = 200
MM, |τ| = 200
LLED,  |τ| = 500
MM, |τ| = 500
Fig. 15: Variation in Task-set Size (S2)
3.29 3.7 4.11 4.52 4.93 5.34 5.76 6.17 6.58 6.990.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
LLED, Asimilar
MM, Asimilar
LLED−SP, Asimilar
MM−SP, Asimilar
Fig. 16: Asimilar (S2)
1 (Figure 4), the performance of LLED-SP and MM -SP
given in Figure 12 increases with an increase in the value of
  and the similar trend is followed by LLED and MM in
Figure 13. Another observation which is clear from Figure 12
is that LLED-SP always dominates MM -SP and the same
is true in Figure 13 for LLED and MM .
11
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
50
100
150
200
250
System Utilization
No
 o
f T
as
ks
 S
hif
ts
 
 
|τ| = 100
|τ| = 200
|τ| = 500
Fig. 17: Decisions (S2)
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
4
6
8
10
12
14
16
18
20
System Utilization
Ex
ec
ut
ion
 T
im
e 
(m
se
c)
 
 
|τ| = 100
|τ| = 200
|τ| = 500
Fig. 18: Time Calculation (S2)
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
1.5
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
WFD
Fig. 19: 4 Core Types (S2, WFD)
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4 15.3 16.2
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
1.5
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
WFD,   β = 10%
WFD,   β = 20%
WFD,   β = 40%
Fig. 20: Variation in   (S2, WFD)
7.2 8.1 9 9.9 10.8 11.7 12.6 13.5 14.4
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
1.5
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
WFD, |τ| = 100
WFD, |τ| = 200
WFD, |τ| = 500
Fig. 21: Variation in Task-set Size (S2, WFD)
3.29 3.7 4.11 4.52 4.93 5.34 5.76 6.17 6.58 6.99
1
1.05
1.1
1.15
1.2
System Utilization
No
rm
ali
se
d 
To
ta
l E
ne
rg
y
 
 
WFD, Asimilar
Fig. 22: Asimilar Platform (S2, WFD)
The effect of variation in the task-set size is presented in
Figure 14 and Figure 15. Unlike to Figure 5, in this scenario
the task-set size does not make any difference on the perfor-
mance of all the algorithms. To evaluate the platform, where all
the tasks prefer similar core type, the same setup of Figure 6 is
adopted. The results of this experiment are shown in Figure 16.
12
All the algorithms follow the same race to allocate tasks to
the slowest core. Furthermore, LLED performance dominated
over MM , and towards high utilisation, it even consumes less
energy compared to MM -SP . Overall, LLED-SP performs
better for all utilisations.
Figure 17 and Figure 18 present the execution time and
the number of tasks migrations between different core types
of the second phase of allocation respectively. Its allocation
process is very fast even for a large task-set size of 500.
Figure 17 shows that the number of migrations (also execution
time) decrease with an increase in effective utilisation as the
tasks have less freedom to manoeuvre due to high utilisations.
Less loaded systems (U = 7.2) allow cores to use their more
efficient sleep anyway. Therefore, U = 7.2 has fewer number
of migrations (executions time) when compared to U = 8.1.
a) Comparison With Worst Fit Decreasing (WFD):
Similar to the previous scenario, we have also compared
WFD with FF in the second scenario. All the values of the
WFD algorithm are normalised to the corresponding values
of FF for consistency. For four core types, normalised energy
consumption of the WFD algorithm is shown in Figure 19.
The shape of the curve is similar to the previous scenario
(Figure 7). The difference of energy consumption between
FF and WFD is higher in this scenario when compared to
first scenario. The variation in   and task-set sizes also have
the similar effects when compared to previous scenario. The
normalised energy consumption for different values of   and
different sizes are shown in Figure 20 and Figure 21 respec-
tively. In this scenario, the energy consumption of WFD is
higher for the asimilar platform as shown in Figure 22.
VI. CONCLUSION
Heterogeneous multicore platforms are becoming popu-
lar in industry. This trend demands an advancement in RT
scheduling theory. We have explored the problem of task
assignment with an objective to reduce the average-case energy
consumption of the system, while satisfying RT constraints.
This research effort demonstrates the importance of a realistic
power model and its effect on the overall energy consumption.
In the future, we have an intention to extend this work to allow
the job migration to further reduce the energy consumption of
the system.
REFERENCES
[1] D. C. Snowdon, S. M. Petters, and G. Heiser, “Accurate on-line
prediction of processor and memory energy usage under voltage scaling,”
in 7th EMSOFT, Salzburg, Austria, Oct 2007, pp. 84–93.
[2] M. A. Awan and S. M. Petters, “Enhanced race-to-halt: A leakage-aware
energy management approach for dynamic priority systems,” in 23rd
ECRTS, 2011, pp. 92–101.
[3] A. Kandhalu, J. Kim, K. Lakshmanan, and R. Rajkumar, “Energy-aware
partitioned fixed-priority scheduling for chip multi-processors,” in 17th
RTCSA, vol. 1, aug. 2011, pp. 93 –102.
[4] J.-J. Chen and C.-F. Kuo, “Energy-efficient scheduling for real-time
systems on dynamic voltage scaling (dvs) platforms,” in 13th RTCSA,
aug. 2007, pp. 28 –38.
[5] Y. Yu and V. Prasanna, “Power-aware resource allocation for independent
tasks in heterogeneous real-time systems,” in Parallel and Distributed
Systems, 2002, pp. 341 – 348.
[6] T.-Y. Huang, Y.-C. Tsai, and E.-H. Chu, “A near-optimal solution for
the heterogeneous multi-processor single-level voltage setup problem,”
in IPDPS, 2007, pp. 1–10.
[7] J. Luo and N. K. Jha, “Static and dynamic variable voltage scheduling
algorithms for real-time heterogeneous distributed embedded systems,”
in ASP-DAC, 2002.
[8] J.-J. Chen and L. Thiele, “Energy-efficient task partition for periodic
real-time tasks on platforms with dual processing elements,” in 14th
ICPADS, dec. 2008, pp. 161 –168.
[9] H.-R. Hsu, J.-J. Chen, and T.-W. Kuo, “Multiprocessor synthesis for
periodic hard real-time tasks under a given energy constraint,” in 43rd
DATE, 2006.
[10] C.-M. Hung, J.-J. Chen, and T.-W. Kuo, “Energy-efficient real-time task
scheduling for a dvs system with a non-dvs processing element,” in 27th
RTSS, 2006.
[11] C.-Y. Yang, J.-J. Chen, T.-W. Kuo, and L. Thiele, “An approximation
scheme for energy-efficient scheduling of real-time tasks in heteroge-
neous multiprocessor systems,” in 46th DATE, 2009, pp. 694–699.
[12] J.-J. Chen, A. Schranzhofer, and L. Thiele, “Energy minimization for
periodic real-time tasks on heterogeneous processing units,” in IPDPS,
2009, pp. 1 –12.
[13] G. B. Dantzig and M. N. Thapa, “Linear programming: 1: Introduction,”
in Springer Verlag, 1997.
[14] S. A. Brandt, S. Banachowski, C. Lin, and T. Bisson, “Dynamic
integrated scheduling of hard real-time, soft real-time and non-real-time
processes,” in 24th RTSS, Cancun, Mexico, Dec 2003, p. 396.
[15] D. C. Snowdon, E. Le Sueur, S. M. Petters, and G. Heiser, “Koala:
A platform for OS-level power management,” in 4th EuroSys Conf.,
Nuremberg, Germany, Apr 2009.
[16] B. Nikolic, M. A. Awan, and S. M. Petters, “SPARTS: Simulator for
power aware and real-time systems,” in 8th IEEE Int. Conf. Emb. Softw.
& Syst. Changsha, China: IEEE, Nov 2011, pp. 999–1004.
[17] MPC8536E PowerQUICC III Integrated Processor Hardware
Specifications, FreeScale Semiconductor, document Number:
MPC8536EEC, Rev 3, Nov. 2010. [Online]. Available: http://cache.
freescale.com/files/32bit/doc/data sheet/MPC8536EEC.pdf?fpsp=1
[18] H. Aydin and Q. Yang, “Energy-aware partitioning for multiprocessor
real-time systems,” in IPDPS 2003, april 2003.
13
