Energy Minimization for Parallel Real-Time Systems with Malleable Jobs
  and Homogeneous Frequencies by Fisher, Nathan et al.
Energy Minimization for Parallel Real-Time Systems with
Malleable Jobs and Homogeneous Frequencies
Nathan Fisher
Wayne State University
fishern@wayne.edu
Joe¨l Goossens
Universite´ Libre de Bruxelles
joel.goossens@ulb.ac.be
Pradeep M. Hettiarachchi
Wayne State University
pradeepmh@wayne.edu
Antonio Paolillo
Universite´ Libre de Bruxelles
antonio.paolillo@ulb.ac.be
Abstract—In this work, we investigate the potential utility of
parallelization for meeting real-time constraints and minimizing
energy. We consider malleable Gang scheduling of implicit-
deadline sporadic tasks upon multiprocessors. We first show the
non-necessity of dynamic voltage/frequency regarding optimality
of our scheduling problem. We adapt the canonical schedule for
DVFS multiprocessor platforms and propose a polynomial-time
optimal processor/frequency-selection algorithm. We evaluate the
performance of our algorithm via simulations using parameters
obtained from a hardware testbed implementation. Our algo-
rithm has up to a 60 watt decrease in power consumption over
the optimal non-parallel approach.
I. INTRODUCTION
Power-aware computing is at the forefront of embedded
systems research due to market demands for increased battery
life in portable devices and decreasing the carbon footprint
of embedded systems in general. The drive to reduce system
power consumption has led embedded system designers to
increasingly utilize multicore processing architectures. An oft-
repeated benefit of multicore platforms over computationally-
equivalent single-core platforms is increased energy efficiency
and thermal dissipation [1]. For these power benefits to be
fully realized, a computer system must possess the ability
to parallelize its computational workload across the multiple
processing cores. However, parallel computation often comes
at a cost of increasing the total, overall computation that the
system must perform due to communication and synchroniza-
tion overhead of the cooperating parallel processes. In this
paper, we explore the trade-off between parallelization of real-
time applications and energy consumption.
II. RELATED WORK
There are two main models of parallel tasks (i.e., tasks that
may use several processors simultaneously): the Gang [2], [3],
[4], [5] and the Thread model [6], [7], [8]. With the Gang
model, all parallel instances of a same task start and stop
using the processors synchronously. On the other hand, with
the Thread model, there is no such constraint. Hence, once a
thread has been released, it can be executed on the processing
platform independently of the execution of the other threads.
Very little research has addressed both real-time paralleliza-
tion and power-consumption issues [9], [10]. Furthermore,
some basic fundamental questions on the potential utility of
parallelization for meeting real-time constraints and minimiz-
ing energy have not been addressed at all in the literature.
III. MODELS
A. Parallel Job Model
In real-time systems, a job J` is characterized by its arrival
time A`, execution time E`, and relative deadline D`. The
interpretation of these parameters is that the system must
schedule E` units of execution on the processing platform
in the interval [A`, A` + D`). Traditionally, most real-time
systems research has assumed that the execution of J` must
occur sequentially (i.e., J` may not execute concurrently with
itself on two — or more — different processors). However,
in this paper, we deal with jobs which may be executed on
different processors at the very same instant, in which case we
say that job parallelism is allowed. Various kind of task models
exist; Goossens et al. [4] adapted parallel terminology [11] to
recurrent (real-time) tasks as follows.
Definition 1 (Rigid, Moldable and Malleable Job). A job is
said to be (i) rigid if the number of processors assigned to
this job is specified externally to the scheduler a priori, and
does not change throughout its execution; (ii) moldable if the
number of processors assigned to this job is determined by the
scheduler, and does not change throughout its execution; (iii)
malleable if the number of processors assigned to this job can
be changed by the scheduler during the job’s execution.
As a starting point for investigating the tradeoff between
energy consumption and parallelism in real-time systems,
we will work with the malleable job model in this paper.
Schedulability analysis is more complicated for the rigid and
moldable job models, and we defer study of these models for
future research.
B. Parallel Task Model
In real-time systems, jobs are generated by (recurring) tasks.
One general and popular real-time task model is the sporadic
task model [12] where each sporadic task is characterized by
its worst-case execution time ei, task relative deadline di, and
minimum inter-arrival time pi (also called the task’s period).
A task τi can generate a (potentially) infinite sequence of jobs
J1, J2, . . . such that: 1) J1 may arrive at any time after system
start time; 2) successive jobs must be separated by at least pi
time units (i.e., A`+1 > A`+pi); 3) each job has an execution
requirement no larger than the task’s worst-case execution time
(i.e., E` 6 ei); and 4) each job’s relative deadline is equal to
the the task relative deadline (i.e., D` = di). A useful metric
of a task’s computational requirement upon the system is
utilization denoted by ui and computed by ei/pi. A collection
ar
X
iv
:1
30
2.
17
47
v1
  [
cs
.O
S]
  7
 Fe
b 2
01
3
of sporadic tasks τ def= {τ1, τ2, . . . , τn} is called a sporadic
task system. In this paper, we assume a common subclass of
sporadic task systems called implicit-deadline sporadic task
systems where each τi ∈ τ must have relative deadline equal
to its period (i.e., di = pi).
At the task level, the literature distinguishes between at least
two kinds of parallelism:
• Multithread. Each task is sequence of phases, each phase
is composed of several threads, each thread requires a
single processor for execution and can be scheduled
simultaneously [13]. A particular case is the Fork-Join
task model where task begins as a single master thread
that executes sequentially until it encounters the first fork
construct, where it splits into multiple parallel threads
which execute the parallelizable part of the computa-
tion [7] and so on.
• Gang. Each task corresponds to e× k rectangle where e
is the execution time requirement and k the number of
required processors with the restriction the k processors
execute task in unison [5].
In this paper, we assume malleable Gang task scheduling.
Due to the overhead of communication and synchroniza-
tion required in parallel processing, there are fundamental
limitations on the speedup obtainable by any real-time job.
Assuming that a job J` generated by task τi is assigned
to k` processors for parallel execution over some t-length
interval, the speedup factor obtainable is γi,k` . The inter-
pretation of this parameter is that over this t-length inter-
val J` will complete γi,k` · t units of execution. We let
Γi = (γi,0, γi,1, . . . , γi,m, γi,m+1) denote the multiprocessor
speedup vector for jobs of task τi (assuming m identical
processing cores). The variables γi,0 and γi,m+1 are sentinel
values used to simplify the algorithm of Section V; the values
of γi,0 and γi,m+1 are 0 and ∞ respectively. Throughout the
rest of the paper, we will characterize a parallel sporadic task
τi by (ei, pi,Γi).
We consider the following two restrictions on the multipro-
cessor speedup vector:
• Sub-linear speedup ratio [9]: 1 < γi,j′γi,j <
j′
j where 0 <
j < j′ 6 m.
• Work-limited parallelism [2]: γi,(j′+1)−γi,j′ 6 γi,(j+1)−
γi,j where 0 6 j < j′ < m.
The sub-linear speedup ratio restriction represents the fact
that no task can truly achieve an ideal or better than ideal
speedup due to the overhead in parallelization. It also requires
that the speedup factor strictly increases with the number of
processors. The work-limited parallelism restriction ensures
that the overhead only increases as more processors are used
by the job. These restrictions place realistic bounds on the
types of speedups observable by parallel applications.
C. Power/Processor Model
We assume that the parallel sporadic task system τ executes
upon a multiprocessor platform with m identical processing
cores. The processing platform is enabled with both dynamic
power management (DPM) and dynamic voltage and fre-
quency scaling (DVFS) capabilities. With respect to DPM
capabilities, we assume the the processing platform has the
ability to turn off any number of cores between 0 and m− 1.
For DVFS capabilities, in this work, we assume that there is a
system-wide homogeneous frequency f > 0 which indicates
the frequency at which all cores are executing at any given
moment. The power function P (f, k) indicates the power
dissipation rate of the processing platform when executing
with k active cores at a frequency of f . We only assume
that P (f, k) is a non-decreasing, convex function. While we
consider the setting where the system may dynamically change
frequency without penalty, we consider that there is significant
overhead to turning a core on or off. Therefore, in this
paper, we will only consider core speed/activation assignment
schemes where the number of active cores is decided prior to
system runtime and does not change dynamically.
The interpretation of the frequency is that if τi is executing
job J` on k` processors at frequency f over a t-length interval
then it will have executed t ·γi,k` ·f units of computation. The
total energy consumed by executing k cores over the t-length
at frequency f is t · P (f, k).
D. Scheduling Algorithm
In this paper, we use a scheduling algorithm originally de-
veloped for non-power-aware parallel real-time systems called
the canonical parallel schedule [2]. The canonical scheduling
approach is optimal for implicit-deadline sporadic real-time
tasks with work-limited parallelism and sub-linear speedup
ratio upon an identical multiprocessor platform (i.e., each
processor has identical processing capabilities and speed).
In this paper, we consider also an identical multiprocessor
platform, but permit both the number of active processors
and homogeneous frequency f for all active processors to be
chosen prior to system runtime. In this subsection, we briefly
define the canonical scheduling approach with respect to our
power-aware setting.
Assuming the processor frequencies are identical and a fixed
value f , it can be noticed that a task τi requires more than
k processors simultaneously if ui > γi,k · f ; for the unitary
frequency, we denote by ki the largest such k (meaning that
ki is the smallest number of processor[s] such that the task τi
is schedulable on ki + 1 processors at frequency f = 1):
ki
def
=
{
0 if ui 6 γi,1
maxmk=1{k | γi,k < ui} otherwise.
(1)
For example, let us consider the task system τ = {τ1, τ2}
to be scheduled on three processors with f = 1. We have
τ1 = (6, 4,Γ1) with Γ1 = (1.0, 1.5, 2.0) and τ2 = (3, 4,Γ2)
with Γ2 = (1.0, 1.2, 1.3). Notice that the system is infeasible
at this frequency if job parallelism is not allowed since τ1 will
never meet its deadline unless it is scheduled on at least two
processors (i.e., k1 = 1). There is a feasible schedule if the
task τ1 is scheduled on two processors and τ2 on a third one
(i.e., k2 = 0).
The canonical schedule: That scheduler assigns ki pro-
cessor(s) permanently to τi and an additional processor spo-
2
radically (see [2] for details). In this work we will extend that
technique for dynamic voltage and frequency scaling (DVFS)
and dynamic power management (DPM) capabilities.
IV. NON-NECESSITY OF DVFS FOR MALLEABLE JOBS
Property 1. In a multiprocessor system with global homoge-
neous frequency in a continuous range, choosing dynamically
the frequency is not necessary for optimality in terms of
consumed energy.
Proof: [14] presented similar result, here we prove the
property for our framework. Although we have a proof of this
property for any convex form of P (v) (v is the voltage chosen,
directly linked to the resulting frequency of the system),
for space limitation in the following, we will consider that
P (v) ∝ v3. Assuming we have a schedule at the constant
speed/voltage v on the (multiprocessor) platform we will
show that any dynamic frequency schedule (which schedules
the same amount of work) consumes not less energy. First
notice that from any dynamic frequency schedule we can
obtain a constant frequency schedule (which schedules the
same amount of work) by applying, sequentially, the following
transformation: given a dynamic frequency schedule in the
interval [a, b] which works at voltage v1 in [a, `) and at voltage
v2 in [`, b] we can define the constant voltage such that at that
speed/voltage the amount of work is identical.
Without loss of generality we will consider the constant
voltage schedule the interval [0, 1] working at voltage v and
the dynamic schedule working at voltage v + ∆ in [0, `) and
at the voltage v −∆′ in [`, 1].
Since the transformation must preserve the amount of work
completed we must have:
v = `(v + ∆) + (1− `)(v −∆′)
⇔ ∆′ def= `∆
1− ` (2)
since the extra work in [0, `) (i.e., ∆`) must be equal to the
spare work in [`, 1] (i.e., ∆′(1− `)).
Now we will compare the relative energy consumed by both
the schedules, i.e., we will show that
`(v + ∆)3 + (1− `)(v − `∆
1− ` )
3 > v3 (3)
We know that `(v+∆)3 = `(v3 +3v2∆+3v∆2 +∆3) and
(1−`)(v− `∆1−` )3 = (1−`)(v3−3v2 `∆1−` +3v `
2∆2
(1−`)2 − `
3∆3
(1−`)3 ).
(3) is equivalent to (by subtracting v3 on the both sides)
∆
[
3`v∆ + `∆2 + 3v
`2∆
(1− `) −
`3∆2
(1− `)2
]
> 0
Or equivalently (dividing by `∆):
3∆v + ∆2 + 3v
`∆
(1− `) −
`2∆2
(1− `)2 > 0
⇐ (v −∆′ > 0 and, by (2))
3∆v + ∆2 + 3
`∆
1− `
`∆
(1− `) −
`2∆2
(1− `)2 > 0
⇐
3∆v + ∆2 + 2
`2∆2
(1− `)2 > 0
which always holds because ∆ > 0 and v > 0.
V. OPTIMAL PROCESSOR/FREQUENCY-SELECTION
ALGORITHM
Property 1 implies, for homogeneous frequency upon the
different processing cores, that for each DVFS scheduling, it
exists a constant frequency scheduling which consumes no
more energy. Thus, the frequency that minimizes consumed
energy can be computed prior the execution of the system. So,
in the following, we will design an offline algorithm to find this
optimal minimal frequency. This parameter will allow us to use
the canonical schedule [2] to find a scheduling of the system.
First, we will present the feasibility criteria adapted to variable
homogeneous frequency. After that we will use this criteria to
determine constraints on the frequency for the system to be
feasible on a fixed number of processors. After that, we will
present an algorithm which uses those constraints to compute
the exact optimal frequency for the system to be feasible.
Finally, we will prove the correctness of this algorithm.
In the following we denote by f the frequency of our
multiprocessor platform. Notice that we made the hypothesis
that time is continuous (as in [2]). More specifically, we can
also choose the frequency in the positive continuous range
(f ∈ R+0 ).
A. Background
Notice that a task τi requires more than k processors
simultaneously if ui > γi,k × f ; we denote by ki(f) the
largest such k (meaning that ki(f) is the smallest number of
processor[s] such that the task τi is schedulable on ki(f) + 1
processors):
ki(f)
def
=
{
0, if ui ≤ γi,1 × f
maxmk=1{k | γi,k × f < ui}, otherwise.
(4)
This definition extends the one of ki, Eq. (1). Notice that
we have ki = ki(1). For a given number of processors
κ ∈ {0, . . . ,m − 1,m}, we wish to determine the range of
frequencies [f1, f2) such that ki(f) = κ for all f ∈ [f1, f2).
We denote as the inverse function
k−1i (κ)
def
=
{
{f | uiγi,κ+1 6 f < uiγi,κ } if 0 < κ 6 m
[ uiγi,1 ,∞) otherwise.
(5)
We denote the left endpoint (resp., right endpoint) of k−1i (κ)
as k−1i (κ).f1 (resp., k
−1
i (κ).f2).
B. Feasibility criteria with variable homogeneous frequency
We will now present a necessary and sufficient condition for
the feasibility of a task system τ on m identical processors at
frequency f > 0.
Theorem 1. A sporadic task system τ def= {τ1, τ2, . . . , τn}
is feasible on an identical platform with m processing cores
3
at frequency f > 0 if and only if the task system τ ′ def=
{τ ′1, τ ′2, . . . , τ ′n} is feasible on the same system with m pro-
cessing cores at frequency 1.
τ ′ is defined as follow:
∀ 1 6 i 6 n : τ ′i = (ei, pi,Γ′i)
Γ′i = (γ
′
i,0, γ
′
i,1, . . . , γ
′
i,m, γ
′
i,m+1)
∀ 0 6 k 6 m+ 1 : γ′i,k def= γi,k × f.
Proof: First of all, it is easy to see that τ respects sub-
linear speedup ratio and work-limited parallelism if and only
if τ ′ respects them also.
We know that if τi is executing a job on k processors at
frequency f over a t-length interval then it will have executed
t · γi,k · f units of computation. For the same interval, τ ′i , at
frequency 1, is executing t · γ′i,k · 1 = t · γi,k · f units of
computation. The amount of work executed per unit of time
is exactly the same for every task of both systems. So if there
exists one schedule without any deadline miss for one of the
two systems, we can use the same one to schedule the other
system. Thus, we can conclude that τ is feasible if and only
if τ ′ is feasible.
Theorem 2. A necessary and sufficient condition for a spo-
radic task system τ (respecting sub-linear speedup ratio and
work-limited parallelism) to be feasible on m processors at
frequency f is given by:
m >
n∑
i=1
(
ki(f) +
ui − γi,ki(f) × f
(γi,ki(f)+1 − γi,ki(f))× f
)
. (6)
Proof: By Theorem 1, we know that τ is feasible at
frequency f on m processing cores if and only if τ ′ is
feasible at frequency 1. In [2], there is a necessary and
sufficient feasibility condition for any sporadic task system
(work-limited and sub-linear speedup ratio) for fixed frequency
(f = 1). This result can be used to establish the schedulability
of τ ′.
So, using the result given by [2], we know that τ ′ is feasible
if and only if this inequation holds:
m >
n∑
i=1
(
k′i +
ui − γ′i,k′i
γ′i,k′i+1 − γ
′
i,k′i
)
, (7)
where k′i denotes the value of ki (cf. definition given by (4))
calculated for the system τ ′ at frequency 1. ∀ 1 6 i 6 n :
k′i = k
′
i(1) = ki(f)
We can now replace k′i and γ
′
i,k by their value in (7):
m >
n∑
i=1
(
ki(f) +
ui − γi,ki(f) × f
γi,ki(f)+1 × f − γi,ki(f) × f
)
⇔m >
n∑
i=1
(
ki(f) +
ui − γi,ki(f) × f
(γi,ki(f)+1 − γi,ki(f))× f
)
,
which corresponds exactly to (6). So τ ′ is feasible if and
only if (6) holds. Thus, by Theorem 1, τ is feasible if and
only if (6) holds.
Definition 2 (Minimum number of processor function for
parallel tasks and system). For any τi ∈ τ :
Mi(f)
def
= ki(f) +
ui − γi,ki(f) × f
(γi,ki(f)+1 − γi,ki(f))× f
Therefore, we can define the same notion system-wide:
Mτ (f)
def
=
n∑
i=1
Mi(f)
=
n∑
i=1
(
ki(f) +
ui − γi,ki(f) × f
(γi,ki(f)+1 − γi,ki(f))× f
)
Based on this definition, the feasibility criteria (6) becomes:
m >Mτ (f) (8)
Notice that, for a fixed frequency f , the minimum number
of processors necessary and sufficient to schedule the system
is dMτ (f)e.
In the following, we will show that in our model, the feasi-
bility of the system is sustainable regarding the frequency i.e.
increasing the value of the frequency maintains the feasibility
of the system. For this, we will need the following theorem.
Theorem 3. Mτ (f) is a monotonically decreasing function
for f > 0.
Proof: We will first prove someting stronger:
∀τi ∈ τ,∀f1, f2 ∈ R+0 :
0 < f1 6 f2 ⇒Mi(f1) >Mi(f2) (9)
First, notice that ∀ τi ∈ τ , ki(f) is a decreasing staircase
function. Indeed, the value of ki(f) depends on the satisfaction
of γi,k×f < ui. In this inequation, the greater f is, the smaller
γi,k has to be to hold it and so does k because Γi is ordered
by model assumption (0 < γi,1 < γi,2 < . . . < γi,m).
In order to confirm the decrease of Mi(f), we have to
consider both cases:
• When ki(f) remains constant between two frequencies.
• When ki(f) jumps a step between two frequencies.
The case when κ = m is trivial : f < uiγi,m , Mi(f) = m
and task τi is not schedulable on m processors at this fre-
quency. So before f = uiγi,m , Mi(f) is constant and therefore
monotonically decreases.
First case to consider is when ki(f) = κ is fixed (κ ∈
{0, 1, . . . ,m−1}). By (5), we have that for f in
[
ui
γi,κ+1
, uiγi,κ
)
,
ki(f) = κ remains constant and
Mi(f) = κ+
ui − γi,κ × f
(γi,κ+1 − γi,κ)× f
= κ+
ui
(γi,κ+1 − γi,κ)× f −
γi,k
γi,κ+1 − γi,κ
decreases as a multiplicative inverse function (terms which
don’t depends on f are fixed in this interval).
Now consider the case when there is a variation in the value
of ki(f). This occurs only when f = uiγi,k for k = m,m −
4
1, . . . , 1. At this exact value of the frequency, the value of
ki(f) jumps from k to k−1. We will prove that even in those
cases, the function Mi(f) still decreases.
We have to prove the following:
ui
γi,k+1
6 f ′ < ui
γi,k
= f ⇒ Mi(f ′) > Mi(f)
Let us compute their values individually:
∀ 1 6 k 6 m :
ki(f) = ki
(
ui
γi,k
)
= k − 1
⇒Mi(f) = Mi
(
ui
γi,k
)
= k − 1 +
ui − γi,k−1 × uiγi,k
(γi,k − γi,k−1)× uiγi,k
= k − 1 + 1
= k
ki(f
′) = k because f ′ ∈ k−1i (k)
⇒Mi(f ′) = k + ui − γi,k × f
′
(γi,k+1 − γi,k)× f ′
We know ui−γi,k×f ′ > 0 because f ′ < uiγi,k , γi,k+1 > γi,k
is true by model assumption and f ′ > 0. Thus we have, with
ε > 0:
Mi(f
′) = Mi(f) + ε
⇒Mi(f ′) > Mi(f)
We have now proved (9) for every possible value of ki(f).
Thus:
f1, f2 ∈ R+0 : 0 < f1 6 f2
⇒Mi(f1) >Mi(f2) ∀τi ∈ τ
⇒
n∑
i=1
Mi(f1) >
n∑
i=1
Mi(f2)
⇒Mτ (f1) >Mτ (f2)
This theorem directly implies the following property.
Property 2. The feasibility of the system is sustainable re-
garding the frequency1.
Proof: By (8), τ is feasible on m processors at frequency
f > 0, if and only if m >Mτ (f).
If τ is feasible on m processors at frequency f > 0, then τ
is feasible on m processors at any greater frequency f ′ > f
because, by Theorem 3, the following holds:
f 6 f ′ and m >Mτ (f)
⇒ m >Mτ (f) >Mτ (f ′)
⇒ m >Mτ (f ′) ,
which corresponds to the feasibility criteria at frequency f ′.
1i.e., increasing the frequency preserves the system schedulability.
C. Minimum optimal frequency
Property 2 implies that there is a minimum frequency for
the system to be feasible. Then, it would be interesting to have
an algorithm to compute it for a particular task system τ and
a maximum number of processors m. We will first derive a
constraint on the frequency from the feasibility criteria. After
that, we will use this constraint to design an algorithm that
computes the optimal minimum frequency in O(n2 log22m)
time.
Definition 3 (Minimum frequency notation).
Ψ(τ,m,~κ)
def
=
∑n
i=1
ui
γi,κi+1−γi,κi
m−∑ni=1 (κi − γi,κiγi,κi+1−γi,κi ) ,
where ~κ = (κ1, κ2, . . . , κn).
Property 3. If ~k(f) = (k1(f), k2(f), . . . , kn(f)), then the
following holds:
m >Mτ (f)⇔ f > Ψ(τ,m,~k(f)) (10)
Proof: Let us define a few more notations:
M ′τ (f) =
n∑
i=1
(
ki(f)−
γi,ki(f)
γi,ki(f)+1 − γi,ki(f)
)
M ′′τ (f) =
n∑
i=1
(
ui
γi,ki(f)+1 − γi,ki(f)
)
A few things to notice:
• we have M ′′τ (f) > 0 because ui > 0 (tasks aren’t
trivial) and γi,k < γi,k+1 ∀ k ∈ {0, 1, . . . ,m} (sub-linear
speedup ratio).
• Mτ (f) = M ′τ (f) +
M ′′τ (f)
f (with f > 0).
• the last two items implies that Mτ (f) > M ′τ (f)
We have:
m >Mτ (f) > M ′τ (f)
⇒ m > M ′τ (f)
⇔ m−M ′τ (f) > 0 (11)
And:
m >Mτ (f)
⇔ m >M ′τ (f) +
M ′′τ (f)
f
⇔ m−M ′τ (f)︸ ︷︷ ︸
>0 by (11)
> M
′′
τ (f)
f︸︷︷︸
>0
⇔ f > M
′′
τ (f)
m−M ′τ (f)
= Ψ(τ,m,~k(f))
Let us denote by fMIN the optimal minimum frequency
such that the system τ is feasible on m processors. By
Property 3, fMIN is the smallest real positive number f such
that
f > Ψ(τ,m,~k(f)) . (12)
Consider fixing each ki(f) term such that they are equal to
5
ki(fMIN ). From there, it would be easy to calculate fMIN
with the function Ψ (in O(n) time).
The first thing the algorithm will do is then searching
those values (denoted by κ¯1, κ¯2, . . . , κ¯n such that κ¯i =
ki(fMIN ) ∀τi ∈ τ ) and then compute the value of fMIN with
the expression Ψ(τ,m, κ¯ = (κ¯1, κ¯2, . . . κ¯n)). The algorithm
will be presented in the next section.
D. Algorithm Description
Algorithm 1: feasible(τ,m, f)
sum← 0
for τi ∈ τ do
κi ← ki(f)
sum← sum + κi + ui−γi,κi×f(γi,κi+1−γi,κi )×f
return m > sum
Algorithm 2: minimumOptimalFrequency(τ,m)
for i ∈ {1, 2, . . . , n} do
if feasible(τ,m, uiγi,m ) then
κ¯i ← m− 1
else
κ¯i ← minm−1κ=0 {κ | not feasible(τ,m, uiγi,κ+1 )}
κ¯
def
= (κ¯1, κ¯2, . . . , κ¯n)
fMIN ← Ψ(τ,m, κ¯)
return fMIN
We have designed an algorithm to determine the optimal
minimum frequency (see Algorithm 2). The algorithm essen-
tially systematically searches for the minimum frequency that
that satisfies the constraints of 6 by calling the feasibility test
function (Algorithm 1). For each value κ that we want to test,
we determine from (5) the minimum frequency f such that τi
requires κ+ 1 processors (i.e, f = k−1i (κ).f1 =
ui
γi,κ+1
). The
value of f can be determined in O(1) time.
In the feasibility test, we determine the value of ki(f)
from frequency f according to (4), which can be obtained
in O(log2m) time by binary search over m values. Thus, to
calculate ki(f) for all τi ∈ τ and sum every Mi(f) terms, the
total time complexity of the feasibility test is O(n log2m).
In the main algorithm aimed at calculating fMIN , the value
of κ¯i can also be found by binary search and thus takes
O(log2m) time to be computed. This is made possible by the
sustainability of the system regarding the frequency (proofed
by Property 2). Indeed, if τ is feasible on m processors
with κi (f = uiγi,κi+1
), then it’s also feasible with κi − 1
(f = uiγi,κi
> uiγi,κi+1
).
In order to calculate the complete vector κ¯, there will
be O(n log2m) calls to the feasibility test. Since computing
Ψ is linear-time when the vector κ¯ is already stored in
memory, the total time complexity to determine the optimal
feasible frequency is O(n2 log22m). In order to determine the
optimal combination of frequency and number of processors,
we simply iterate over all possible number of active processors
` = 1, 2, . . . ,m executing Algorithm 2 with inputs τ and `.
We return the combination that results in the minimum overall
power-dissipation rate. Thus, the overall complexity to find the
optimal combination is O(mn2 log22m).
E. An Example
Let us use the same example system than previously intro-
duced in Section III-D. Consider τ = {τ1, τ2} to be scheduled
on m = 3 identical processors. Tasks are defined as follow :
τ1 = (6, 4,Γ1) with Γ1 = (1.0, 1.5, 2.0) and τ2 = (3, 4,Γ2)
with Γ2 = (1.0, 1.2, 1.3). The vector κ¯ corresponding to
this configuration computed by the algorithm is equal to
(κ¯1 = 2, κ¯2 = 0). This implies that the optimal minimum
frequency for this system to be feasible on 3 processors is
equal to fMIN = Ψ(τ, 3, 〈2, 0〉) = 0.9375. We can see that if
we call the feasibility test function for any frequency greater
or equal than 0.9375, it will return True; it will return False
for any lower value.
F. Proof of Correctness
The efficiency and correctness of the above algorithm de-
pends upon the theorem presented in Sections V-B and V-C.
Furthermore, the algorithm is correct if the value Ψ computed
using the previously calculated vector κ¯ is equal to the
minimum optimal frequency as defined by (12). That will be
the goal of our last theorem.
Theorem 4.
fMIN = Ψ(τ,m, κ¯) (13)
Proof: We will need an auxiliary notion:
Mi(f, κ)
def
= κ+
ui − γi,κ × f
(γi,κ+1 − γi,κ)× f κ ∈ {0, 1, . . . ,m− 1}
Notice the following:
Mi(f) = Mi(f, ki(f)) ∀ τi ∈ τ
By definition,
fMIN = min{f ∈ R+0 | m >Mτ (f)}
= min{f ∈ R+0 | f > Ψ(τ,m,~k(f))} by Property 3
= Ψ(τ,m,~k(fMIN )) .
This is equivalent to
m = Mτ (fMIN ) by Property 3.
We will prove the following:
∀ 1 6 i 6 n : Mi(fMIN , κ¯i) = Mi(fMIN ,~ki(fMIN ))
= Mi(fMIN )
This would imply:
n∑
i=1
Mi(fMIN , κ¯i) = Mτ (fMIN ) = m
⇔ Ψ(τ,m, κ¯) = Ψ(τ,m,~k(fMIN )) = fMIN
Notice that when ki(fMIN ) = κ¯i, then we have
Mi(fMIN ) = Mi(fMIN , κ¯i).
We will have four cases of possible value of κ¯i to investi-
gate:
6
• κ¯i = m− 1, the basic case of the algorithm,
• κ¯i = 0,
• κ¯i > 0, when fMIN 6= uiγi,κ¯i ,
• κ¯i > 0, when fMIN = uiγi,κ¯i .
Basic case, κ¯i = m− 1:
feasible(τ,m,
ui
γi,m
)⇒ fMIN 6 ui
γi,m
<
ui
γi,m−1
< . . . <
ui
γi,1
⇒ ki(fMIN ) > k( ui
γi,m
) = m− 1 ,
but for the system to be feasible, we must have ki(fMIN ) <
m, so:
⇒ m− 1 6 ki(fMIN ) < m⇒ ki(fMIN ) = m− 1
Complex case, κ¯i = 0:
¬feasible(τ,m, ui
γi,1
)⇒ fMIN > ui
γi,1
>
ui
γi,2
> . . . >
ui
γi,m
⇒ ui
γi,1
< fMIN <∞
⇒ fMIN ∈ k−1i (0)
⇒ ki(fMIN ) = 0
Complex case, κ¯i > 0:
¬feasible(τ,m, ui
γi,κ¯i+1
) ∧ feasible(τ,m, ui
γi,κ¯i
)
⇒ ui
γi,κ¯i+1
< fMIN 6
ui
γi,κ¯i
Case fMIN 6= uiγi,κ¯i :
⇒ ui
γi,κ¯i+1
< fMIN <
ui
γi,κ¯i
⇒ fMIN ∈ ] ui
γi,κ¯i+1
,
ui
γi,κ¯i
[ ⊂ k−1i (κ¯i)
⇒ ki(fMIN ) = κ¯i
Case fMIN = uiγi,κ¯i
:
ki(fMIN ) = κ¯i − 1
⇒Mi(fMIN ) = Mi(fMIN , ki(fMIN ))
= Mi(
ui
γi,κ¯i
, κ¯i − 1)
= κ¯i − 1 +
ui − γi,κ¯i−1 × uiγi,κ¯i
(γi,κ¯i − γκ¯i−1)× uiγi,κ¯i
= κ¯i − 1 + 1
= κ¯i
Mi(fMIN , κ¯i) = κ¯i +
ui − γi,κ¯i × uiγi,κ¯i
(γi,κ¯i+1 − γκ¯i)× uiγi,κ¯i
= κ¯i
VI. EXPERIMENTAL EVALUATION & SIMULATION
In order to obtain realistic predictions regarding the effect of
parallelism upon power consumption, we have evaluated our
algorithm upon an actual hardware testbed. In this section, we
describe and discuss the high-level overview of the methodol-
ogy employed in our evaluation, the low-level details involved
in our evaluation methodology, and the results obtained from
our experiments.
A. Methodology Overview
Realistic predictions of the energy behavior of a real-time
parallel system using our frequency-selection algorithm re-
quires a hard-real-time parallel application to execute upon an
instrumented multicore hardware testbed. In the Compositional
and Parallel Real-Time Systems (CoPaRTS) laboratory at
Wayne State University, we have developed a power/thermal-
aware testbed infrastructure to obtain accurate power and
temperature readings. Thus, we may obtain realistic hardware
power measurements for any application executing on our
testbed.
Regarding the hard-real-time parallel application, we are
unfortunately not aware of any such available application that
matches the malleable job model used in this paper2. However,
given the continuous march of the real-time and embedded
computing domains towards increasingly parallel architectures,
we fully expect that such applications will be developed in
the near future. Thus, it behooves us to obtain as close to
realistic as possible parameters for such future parallel real-
time applications. We have developed a methodology with this
goal in mind. Below is a high-level overview of the steps of
our design methodology. The details for each step are in the
next subsection.
1) Modify Testbed: We have modified a multicore platform
to obtain accurate instantaneous CPU power readings.
Furthermore, our hardware testbed has the ability to run
at a discrete set of frequencies and turn off individual
cores. Thus, our platform can approximately implement
the frequencies determined from the frequency/processor-
selection algorithm (Section V).
2) Obtain Realistic Speedup Vectors: Since we do not
possess a hard-real-time application with malleable par-
allel jobs, we have observed the execution behavior of
two different non-real-time parallel benchmarks (an I/O-
constrained and non-I/O-constrained application) over
different processing frequencies and levels of parallelism.
Our observations are used to construct two realistic
speedup vectors to use in our stimulation (Step 4).
3) Obtain Realistic Power Rates: Using the same non-
real-time parallel benchmarks, we also construct a matrix
of power dissipation rates over a range of processing
frequencies and number of active cores. Again, our mea-
surements are utilized in the next simulation step.
4) Power-Savings Simulation: After obtaining the speedup
vectors and corresponding power dissipation rates, we
evaluate our algorithm over randomly generated task
systems. Our frequency/processor-selection algorithm is
compared against the power required by an optimal non-
parallel real-time scheduling approach (e.g., Pfair [15]).
2In fact, we are also unaware of any commercially or freely-available
application for any of the other hard-real-time parallel job models.
7
B. Methodology Details
1) Testbed: For our testbed platform, we use an Intel i7 950
processor with eight cores (four physical cores with each
physical core having two “soft” cores – i.e., hyperthreads).
The processor supports 13 different frequency settings. (The
processor sets the frequency level and all cores execute at
the global frequency). We use a Linux 2.6.27 kernel with
PREEMPT-RT patch as our operating system. In addition, we
have developed kernel modules for individual core shutdown
and for frequency modulation functionality.
The testbed requires a few hardware modifications to mea-
sure the actual CPU power usage. Towards this goal, we
connect four shunt resisters, in-series (.05Ω each), with the
four-wire eATX power connector interfaces of the mother-
board (each 12V power line is shunted with 0.05Ω resisters).
We measure the current (A) drawn by the CPU using Na-
tional Instrument’s NI 9205 Data Acquisition unit. Then, we
calculate the total instantaneous CPU power (as the sum of
all the individual powers) through each eATX +12V mother-
board connectors. We run the testbed under the 13 different
supported frequencies and active number of cores settings
and record the corresponding power dissipation rates for the
system. When the number of active cores is less than eight,
there is a choice of which core to shutdown. To address this
choice, we consider all the possible shutdown scenarios for
a given number of active cores and use the average of the
power-rate of all the scenarios. For example, in our eight-core
processor, we have seven different ways to shutdown a single
core3. We calculate the power consumption of the system for
each individual case and the average power is recorded as our
final power-rate measurement for the combination of the given
frequency and number of active cores.
2) Speedup Vectors and Power Functions: From our
testbed, we can generate both a speedup vector and power-
dissipation-rate function for non-I/O-constrained (i.e., CPU-
bound) and I/O-constrained (i.e., memory-bound) parallel ap-
plications. In order to obtain these parameters, we use two
parallel applications: a modified version of Jetbench [16] for
an non-I/O-consrained application and a modified parallel
version of the GNU Compiler Collection (GCC) [17] for
an I/O-constrained application. Jetbench is an Open Source
OpenMP-based multicore benchmark application that emulates
the jet engine performance from real jet engine parameters and
thermodynamic equations presented in the NASA’s EngineSim
program. For GCC, using the “-j” option for GNU Make [18],
we concurrently compile a collection of source code files under
variable number of active processor cores.
To obtain the speedup vectors for both Jetbench and GCC,
we execute the applications upon different numbers of active
cores, recording for each number of cores the response time
for the application. The speedup for x number of cores is
determined by the ratio between the response time on one core
to the response time of the application running concurrently
on x cores. Figure 1 plots the speedup vector for the two
3We cannot shutdown the core 0 (boot core).
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
# of Active Procs
Sp
ee
du
p
 
 
    gcc
   jetbench
Fig. 1. Speedup Vectors for Jetbench and GCC
0
2
4
6
8
1.5
2
2.5
3
3.5
0
20
40
60
80
100
 
# of Active CoresCore Frequency (GHz) 
Po
w
er
 (W
)
non I/O Constrained (jetbench)
Fig. 2. Power Function Over Varying Number of Active Cores and
Frequencies for Non-I/O-Constrained Parallel Application (Jetbench)
applications. Not surprisingly, Jetbench benefits more greatly
from increasing number of processors due to the CPU-bounded
nature and inherently parallelizable workload.
For determining the power-dissipation rates for both Jet-
bench and GCC, we execute these applications for all combi-
nations of frequency and number of active cores and record
both the power-dissipation rate and the speedup values for the
application. The power-dissipation rates are determined using
the measurement hardware described above in Section VI-B1.
Each recorded value is an average of the power measured at
a 1ms sampling intervals for the duration of the application.
Figure 2 plots the power-dissipation function for Jetbench; Fig-
ure 3 plots the power-dissipation function for GCC. Observe
that the power-dissipation level for Jetbench are slightly higher
in most cases than the levels for GCC; this is likely due to
the fact that GCC idles the processor more often during I/O
operations.
3) Power-Savings Simulation: We randomly generate task
systems using a variant of the UUnifast-Discard algo-
rithm by Davis and Burns [19]. In the UUnifast-Discard
algorithm, the user supplies a desired system-level utiliza-
tion and number of tasks, and the algorithm returns a task
system where each task has its task utilization randomly-
generated from a uniform distribution each task utilization.
The difference between UUnifast-Discard and the origi-
nal UUnifast algorithm from Bini and Buttazzoo [20] is that
8
0
2
4
6
8
1.5
2
2.5
3
3.5
0
20
40
60
80
100
 
# of Active CoresCore Frequency (GHz) 
Po
w
er
 (W
)
I/O Constrained (gcc)
Fig. 3. Power Function Over Varying Number of Active Cores and
Frequencies for I/O-Constrained Parallel Application (GCC)
UUnifast-Discard generates task systems with system
utilizations exceeding one, but task utilizations at most one.
These restrictions make UUnifast-Discard appropriate
for multiprocessor scheduling settings with non-parallel real-
time jobs. To extend the UUnifast-Discard, we modified
the algorithm to permit task utilizations to exceed one (i.e.,
a job is required to execute on more than one processor
to complete by its deadline) and fix a single task at a
given maximum utilization. We call our extended algorithm
UUnifast-Discard-Max. The utilization for each task
generated by UUnifast-Discard-Max (except for the task
with fixed maximum utilization) is drawn from a uniform
distribution.
Using the random-task generator, we generate task systems
with a total of eight tasks. The total system utilization is varied
from 1.5 to 8 and the UUnifastDiscard_max algorithm
assigns a maximum utilization to the first task. We run our
testbed with maximum utilization value Umax (i.e., maxni=1 ui)
equal to .4, .8, and 1.2 in our simulations. Also, to match our
testbed settings and the simulations, we select the number of
CPUs from 1 to 8. The simulation runs for all the possible
values of 1.5 to 8 utilization in .1 increments and number of
available cores is varied from one through eight. We run a
variant of the Algorithm 2 that iterates through all frequencies
and number of active core combinations, instead of using a
binary search. (Our power function does not exactly satisfy the
non-decreasing property required for binary search to work).
In each utilization point, we store the exact frequency returned
by our algorithm. For comparison, we determine the minimum
frequency required for a optimal (non-parallel) scheduling
algorithm to schedule the same task system. This value can be
obtained by solving the following for f : for any task system τ ,
U(τ) 6 mf . Using these resulting frequencies, we obtain the
optimal minimum frequency for the non-parallel and parallel
settings. We then use these frequencies to look up the power-
dissipation rates for the respective application by using the
functions displayed in Figures 2 and 3. In the next subsection,
we plot the power savings; i.e., we plot the power-dissipation
level obtained from our algorithm minus the power-dissipation
1 2
3 4
5 6
7 8
0
2
4
6
8
0
20
40
60
80
 
Utilization
# of Active Cores
 
Po
w
er
 (W
)
Power Saving
Fig. 4. Average power savings for non I/O Constrained workload (jetbench)
when Umax = .4
1 2
3 4
5 6
7 8
0
2
4
6
8
0
20
40
60
80
 
Utilization# of Active Cores 
Po
w
er
 (W
)
Power Saving
Fig. 5. Average power savings for non I/O Constrained workload (jetbench)
when Umax = .8
level required for the optimal non-parallel algorithm. Each data
point is the average power saving for 1000 different randomly-
generated task systems.
C. Results & Discussion
Figures 4, 5, and 6 display the power savings obtained from
simulating over the parallel/power parameters obtained from
the Jetbench application. Figures 7, 8, and 9 display the power
savings for the GCC application. The largest power savings is
60 watts (for GCC when Umax = .4 and both the utilization
and active cores equal eight) which is significant since from
Figures 2 and 3 the maximum power dissipation rate is around
80 watts.
From these plots, there are a few noticeable trends: 1) as
Umax increases, the power savings decrease for both applica-
1 2
3 4
5 6
7 8
0
2
4
6
8
0
20
40
60
80
 
Utilization
# of Active Cores 
Po
w
er
 (W
)
Power Saving
Fig. 6. Average power savings for non I/O Constrained workload (jetbench)
when Umax = 1.2
9
1 2
3 4
5 6
7 8
0
2
4
6
8
0
20
40
60
80
 
Utilization
# of Active Cores 
Po
w
er
 (W
)
Power Saving
Fig. 7. Average power savings for I/O Constrained workload (gcc) when
Umax = .4
1 2
3 4
5 6
7 8
0
2
4
6
8
0
20
40
60
80
 
Utilization
# of Active Cores
 
Po
w
er
 (W
)
Power Saving
Fig. 8. Average power savings for I/O Constrained workload (gcc) when
Umax = .8
tions; the reason for this decrease is that larger utilization jobs
require greater parallelization and thus more parallel overhead
which reduces the power savings. 2) As the total utilization
increases, the power savings increases (for active processors
greater than two); in this case, the savings appears to be due to
the fact that the power-dissipation rates are considerably higher
at the highest core frequencies. Thus, if our parallel algorithm
can reduce the frequency over the non-parallel algorithm by
a slight amount, there is significant power savings. 3) The
power savings for both applications are similar; however, the
I/O-constrained application, GCC, appears to have slightly
higher power savings. Again, the power-dissipation function
for GCC may reward small frequency reductions slightly more
than Jetbench’s function. Also, since we have a discrete set
of frequencies, many of the different frequencies returned by
Algorithm 2 will get mapped to the same core frequency
reducing the differences for the two applications.
1 2
3 4
5 6
7 8
0
2
4
6
8
0
20
40
60
80
 
Utilization# of Active Cores 
Po
w
er
 (W
)
Power Saving
Fig. 9. Average power savings for I/O Constrained workload (gcc) when
Umax = 1.2
VII. CONCLUSIONS
In this paper, we explore the potential energy savings that
could be obtained from exploiting parallelism present in a
real-time application. We consider the case of malleable Gang
scheduled parallel jobs and design an optimal polynomial-time
algorithm for determining the frequency to run each active core
when we have the constraint of homogenous core frequencies.
Simulations with power data from an actual hardware testbed
confirm the efficacy of our approach by providing significant
power savings over the optimal non-parallel scheduling ap-
proach. As real-time embedded systems are trending toward
multicore architecture, our research suggests the potential in
reducing the overall energy consumption of these devices by
exploiting task-level parallelism. In the future, we will extend
our research to investigate power saving potential when the
cores may execute at different frequencies and also incorporate
thermal constraints into the problem.
REFERENCES
[1] “The benefits of multiple cpu cores in mobile devices (white paper),”
NVIDIA Corporation, Tech. Rep.
[2] S. Collette, L. Cucu, and J. Goossens, “Integrating job parallelism in
real-time scheduling theory,” Information Processing Letters, vol. 106,
no. 5, pp. 180–187, 2008.
[3] V. Berten, P. Courbin, and J. Goossens, “Gang fixed priority scheduling
of periodic moldable real-time tasks,” in JRWRTC 2011, pp. 9–12.
[4] J. Goossens and V. Berten, “Gang FTP scheduling of periodic and
parallel rigid real-time tasks,” in RTNS 2010, pp. 189–196.
[5] S. Kato and Y. Ishikawa, “Gang EDF scheduling of parallel task
systems,” in RTSS 2009, pp. 459–468.
[6] P. Courbin, I. Lupu, and J. Goossens, “Scheduling of hard real-time
multi-phase multi-thread periodic tasks,” Real-Time Systems: The Inter-
national Journal of Time-Critical Computing, vol. 49, no. 2, pp. 239–
266, 2013.
[7] K. Lakshmanan, S. Kato, and R. Rajkumar, “Scheduling parallel real-
time tasks on multi-core processors,” in RTSS 2010, pp. 259–268.
[8] A. Saifullah, K. Agrawal, C. Lu, and C. Gill, “Multi-core real-time
scheduling for generalized parallel task models,” in RTSS 2011, pp. 217–
226.
[9] F. Kong, N. Guan, Q. Deng, and W. Yi, “Energy-efficient scheduling
for parallel real-time tasks based on level-packing,” in SAC 2011, pp.
635–640.
[10] S. Cho and R. Melhem, “Corollaries to Amdahl’s law for energy,”
Computer Architecture Letters, vol. 7, no. 1, pp. 25–28, 2007.
[11] R. Buyya, High Performance Cluster Computing: Architectures and
Systems. Upper Saddle River, NJ, USA: Prentice Hall PTR, 1999,
ch. Scheduling Parallel Jobs on Clusters, pp. 519–533.
[12] A. Mok, “Fundamental design problems of distributed systems for the
hard-real-time environment,” Ph.D. dissertation, Laboratory for Com-
puter Science, Massachusetts Institute of Technology, 1983.
[13] G. Nelissen, V. Berten, J. Goossens, and D. Milojevic, “Techniques
optimizing the number of processors to schedule multi-threaded tasks,”
in ECRTS 2012, pp. 321–330.
[14] T. Ishihara and H. Yasuura, “Voltage scheduling problem for dynami-
cally variable voltage processors,” in ISLPED 1998, pp. 197–202.
[15] S. Baruah, N. Cohen, G. Plaxton, and D. Varvel, “Proportionate progress:
A notion of fairness in resource allocation,” Algorithmica, vol. 15, no. 6,
pp. 600–625, 1996.
[16] M. Qadri, D. Matichard, and K. McDonald Maier, “Jetbench: An
open source real-time multiprocessor benchmark,” in Architecture of
Computing Systems - ARCS 2010, 2010, vol. 5974, pp. 211–221.
[17] “GCC, the gnu compiler collection,” http://gcc.gnu.org/.
[18] “Gnu make,” http://www.gnu.org/software/make/.
[19] R. Davis and A. Burns, “Priority assignment for global fixed priority
pre-emptive scheduling in multiprocessor real-time systems,” in RTSS
2009, pp. 398 –409.
[20] E. Bini and G. Buttazzo, “Biasing effects in schedulability measures,”
in ECRTS 2004, pp. 196–203.
10
