Bounding the Execution Time of Task-based Parallel Applications on Unrelated Multiprocessors by Voudouris, Petros et al.
Bounding the Execution Time of Task-based
Parallel Applications on Unrelated Multiprocessors
Abstract—Heterogeneous multiprocessors, that consist of pro-
cessor types with different execution capabilities, are critical
today, and in future, to offer high performance and high energy
efficiency. In order to use them in hard real-time systems to
support parallel processing, a tight estimation of the upper bound
on the completion time (WCET) of parallel applications is needed.
This paper presents, for the first time, a closed-form solution
for the calculation of the WCET for task-based parallel applica-
tions modeled as directed acyclic-graphs (DAG) using the general
unrelated multiprocessor model that is capable of modeling a
wide range of heterogeneous multiprocessor platforms. The paper
contributes with a polynomial time algorithm to calculate the
WCET (i.e., makespan) for the unrelated model. In addition, it
presents simulation results that are based on modeling a set of
representative OpenMP task-based parallel applications from the
BOTS benchmark suite.
Index Terms—Real-time Scheduling, Parallel Applications,
Heterogeneous multiprocessors, Makespan
I. INTRODUCTION
The end of Dennard scaling at the beginning of this mil-
lennium led to a shift to multi/manycore platforms, a.k.a.
multiprocessors. Higher performance and energy-efficiency
has turned the focus on heterogeneous multiprocessors [1].
Heterogeneous multiprocessors comprise cores with different
performance/energy and functional characteristics. Using them
in real-time systems, efficient and tight estimations of the
worst-case execution time (also known as the makespan) of
parallel applications are needed.
This paper focuses on parallel applications run on hetero-
geneous multiprocessors. As tasks in these applications have
different resource requirements, scheduling algorithms taking
heterogeneity into account can offer higher performance and
energy efficiency. As schedulability analysis for homogeneous
multiprocessors cannot be trivially applied to heterogeneous
multiprocessors [2], new scheduling algorithms are being
developed and analyzed for heterogeneous systems [3], [4].
The worst-case execution time (WCET) of a task on a
heterogeneous multiprocessor depends on the execution be-
havior of the task – the task type (i.e., two tasks of different
types perform two different functionality) – as well as the
type of processor used – the processor type [5]. Sometimes,
however, it may not even be possible to execute a certain
task on a particular processor due to the specialization of
the processor (called, an incompatible processor type for that
task). Based on the relation between the task type and the
processor type, multiprocessor architectures can be separated
into three categories: homogeneous, related and unrelated. In
homogeneous multiprocessors, there is a single processor type.
Hence, the WCET for a specific task type is the same on
all processors. In related multiprocessors, each processor type
is associated with a speed factor. The WCET of any task is
scaled with the speed factor of the processor type (related
multiprocessors are also known as uniform multiprocessor
platform [5]). In unrelated multiprocessors, a speed factor is
associated with each task-type and processor-type pair. Hence,
unrelated multiprocessors model is the most general model for
heterogeneous multiprocessor platform that we consider in this
paper for execution of real-time parallel applications.
Prior work on task- and processor-type speed relations
has mainly focused on related multiprocessors [5]–[7]. This
model is not capable of modeling today’s heterogeneous mul-
tiprocessor platforms as the following example shows. Let’s
consider two tasks, task uILP with abundant instruction-level
parallelism (ILP), where all the instructions are independent of
each other and task uNO with no ILP, where all the instructions
have data dependencies between each other. Next, let assume
that the two tasks are executed on a big.LITTLE platform
[8], where the LITTLE processor is in-order with 8 pipeline
stages, and the big processor is out-of-order with 15 pipeline
stages. Task uILP would have shorter WCET on the big
processor compared to the LITTLE processor because the big
processor can exploit the ILP and can finish the execution
faster. In contrast, uNO has no ILP that the big processor
can exploit. Also, the extra cost of a deeper pipeline and the
cost of the mechanisms that are needed to preserve a correct
execution of instructions that are (miss speculatively) executed
in parallel, the WCET of uNO on the “big” processor may be
longer compared to the WCET on the LITTLE processor. The
example above shows that the related multiprocessor model,
with the same speed factor for all tasks’ types, is too restrictive.
In contrast, the unrelated multiprocessor model is capable
of modeling ideal execution scenario of different task types
on a broad range of heterogeneous multiprocessors, including
big.LITTLE, as it associates a speed factor with each task-type
and processor-type pair.
Prior work using the unrelated multiprocessor model targets
independent tasks [4], [9]. However, so far, no prior work
has targeted task-based real-time parallel applications for
unrelated multiprocessor platform. This paper considers, for
the first time unrelated multiprocessor platform composed of
processors of different types and parallel applications that
are modeled as directed acyclic graphs (DAG), where every
node is a task and a directed edge between two nodes is a
dependency. Every task is characterized by using different
WCET for different processor types and the goal is to calculate
the makespan of the application.
We use a combinatorial approach to analyze all possible
mappings of the tasks on different processor types exhaustively
under a greedy scheduling policy and we propose two meth-
ods to calculate the makespan. These two approaches have
exponential time complexity to the number of processors and
the number of tasks, so they are useful only to analyze small-
scale platforms. However, we use this analysis as a stepping
stone to develop a third efficient makespan calculation method
(denoted by, EM ) which has polynomial-time complexity and
can also be used for large-scale platforms.
To evaluate our proposed makespan calculations we use
four OpenMP, task-based parallel applications from the BOTS
benchmark suit [10] modeled as DAGs [11]: Fibonacci, Sort,
Strassen and FFT. We have also extended our evaluation with
synthetic DAGs to measure the sensitivity of the proposed
approach concerning different simulation parameters.
We could not find any literature on makespan computation
of DAGs on unrelated machines. Instead, we measure the
tightness of EM by comparing the makespan with our two
proposed exhaustive approaches. In addition, a lower bound
on the makespan is derived by simulating the actual execution
parallel applications under the assumed scheduler and compare
it with the makespan under our proposed analytical approach
EM , to find the level of pessimism.
By comparing the results of the exhaustive approach and
the EM approach for up to 8 processors with a fixed number
of processors types, the EM approach overestimates the
makespan of the four applications on average only by 1%
and up to 3%. By comparing the EM to the simulation
of the execution for up to 1024 processors with up to 8
processor types we have on average 23% and up to 59%
pessimism. In other words, our estimated makespan is at most
59% larger than the exact makespan. Next, for a platform
with 8 processors and for varying number of processors types,
the tightness of the EM , compared to the two-permutation
based approach, is on average 1% and at maximum 1.3%. By
comparing the EM to the simulation of the execution we have
on average 12% and up to 24% pessimism.
We have also analyzed the impact of processor heterogene-
ity (i.e., how much different are the WCETs of the tasks
on different processor types) on the makespan of all four
applications. For the applications under study, we could not
see a big difference for the tested cases mainly because of the
high number of tasks in the applications hides the adverse
impact of high heterogeneity. Finally, we have empirically
investigated the impact of making some tasks incompatible
to some processors types and found that by increasing the
number of incompatible tasks, the makespan of the application
increases since less parallelism is available.
The main contributions of the paper are:
• To the best of our knowledge this is the first work that
provides a closed-form solution to the problem of calcu-
lating the makespan for task-based parallel applications
modeled as DAGs executed on unrelated multiprocessor
platform with polynomial time complexity, called EM .
• Two exhaustive approaches that provide tighter makespan
calculation compared to the EM with exponential time
complexity that can be applied to smaller platforms.
• Simulation results that are based on modeling OpenMP
task-based parallel applications from the BOTS bench-
mark suite [12] and simulations with synthetic DAGs.
The rest of the paper is organized as follows: Section
II introduces the system model. Next, Section III provides
the definitions for the platform characterization. Section IV
presents the proposed makespan calculations. Furthermore,
Section V describes the simulation framework and Section VI
reports the simulation results. Section VII presents the related
work before we conclude the paper in section VIII.
II. SYSTEM MODEL
This section presents the system model considered in this
paper. The model of unrelated multiprocessor platform and the
model of parallel application are presented in Section II-A and
Section II-B, respectively. The task-processor speed relation
is formally introduced in Section II-C. Finally, Section II-D
presents the run-time scheduler.
A. Platform
We consider an unrelated multiprocessor platform with total
M processors and H different processor types. A processor
type is denoted by t for t = 1, 2 . . . H . For example, t = 3
specifies the processor type 3.
B. Application
A parallel application is modeled as a directed acyclic graph
(DAG) denoted G = (V,E), where V = {u1, . . . uN} is a set
of N nodes (tasks) and E ⊆ (V ×V ) is a set of directed edges.
Each task is executed sequentially. If (up, uq) ∈ E, then uq
can start execution only after task up completes.
The WCET of task ui on any processor of type t is
denoted by eti. A heterogeneous platform may have specialized
accelerators that have limited functionality and not all the tasks
are compatible for execution on such accelerators. If task ui
is not compatible with processor type t, then eti = ∞. We
assume that every task has at least one compatible processor
type. Two functionally equivalent tasks belong to the same
task type. If two tasks ui and uj are functionally equivalent,
then eti = e
t
j , ∀t, 1 ≤ t ≤ H .
Let a path, denoted by γ, be a sequence of tasks that
are connected with edges. A task with no incoming and no
outgoing edge is called a source and sink, respectively. Without
loss of generality we assume that there is exactly one source
(denoted as vsrc) and one sink (denoted as vsink) of G. If
there is more than one source/sink node, a dummy task with
WCET zero as a new source/sink node is added.
Let a DAG Gmin be an isomorphic DAG with G [13] where
each node ui has only one WCET, denoted by emini such that
emini = min
H
t=1{eti}. In other words, Gmin is a DAG that
has the same structure as G but each node ui has only one
(minimum over all processor types) WCET. The critical path
cp of G is defined as the same path in Gmin which has the
longest execution among all the paths in Gmin. Let L be the
sum of the WCETs of the tasks that belong to the cp in Gmin.
The total workload of the DAG G is denoted by C which is
the sum of the WCETs of all the tasks belonging to Gmin.
Figure 1 shows an example DAG where the WCETs of
the tasks on an unrelated heterogeneous platform with four
processor types are presented in Table I. By calculating the
total workload and the workload of the critical path from the
corresponding Gmin we get C = 6 and L = 3, respectively.
Fig. 1. Example of application
model
t1 t2 t3 t4
uA 1 1 3 ∞
uB 1 2 4 5
uC 2 1 ∞ 5
uD 2 1 3 4
uE 1 3 6 4
uF 2 1 3 4
TABLE I
WCETS OF THE TASKS OF
APPLICATION OF FIGURE 1
C. Task-processor speed relation
We define δit given by Eq. (1), as the (normalized) speed of
processor type t with respect to task ui.
δit =
{
emini
eti
if eti 6=∞
0, otherwise
(1)
The larger is the value of δit, the smaller is the WCET of
task ui on processor type t. Note that δit = 1.0 for processor
type t on which the task has the smallest WCET.
Note that the tasks experience different speeds on different
types of processor and two tasks of different types may not
enjoy the same speed on the same processor type. Such a fea-
ture is not exhibited in uniform (i.e., related) multiprocessors
[5]–[7]). Table II shows the speeds (computed using Eq. (1))
that each task in Table I experiences on four different types
of processors.
δi1 δ
i
2 δ
i
3 δ
i
4
uA 1.0 1.0 0.33 0
uB 1.0 0.5 0.25 0.2
uC 0.5 1.0 0 0.2
uD 0.5 1.0 0.33 0.25
uE 1.0 0.33 0.16 0.25
uF 0.5 1.0 0.33 0.25
TABLE II
THE TASK-PROCESSOR SPEEDS FOR THE EXAMPLE DAG OF FIGURE 1
We denote Prf i the sequence of non-increasing order of
speeds on all the different processor types for task ui. We
also denote Prf ix the speed of the x
th fastest processor for
task ui. Finally, the set of processors with speed slower than
than the xth fastest processor for task ui is denoted by Slix.
Fig. 2. An example of execution of the scheduler for the DAG of Figure 1
D. Scheduler
This subsection presents the details of our assumed sched-
uler which is greedy. The greedy scheduler dispatches a task
ui to an idle processor on which the task would execute the
fastest with respect to other (if any) idle processors. Note that
if no compatible processor for ui is idle, then the task is not
scheduled on an incompatible processor. In addition, if task ui
is currently executing on a processor of type t1 and another
(better) processor of type t2 becomes available later on (for
example, due to completion of some other task uk) such that
et1i > e
t2
i , then task ui is allowed to migrate from processor of
type t1 to type t2 to execute faster. If multiple tasks are eligible
for migration, then the selection of the task for migration is
arbitrary.
This paper presents an analysis of the greedy scheduler to
derive a closed-form equation that evaluates as a safe upper
bound on the makespan of parallel application. The proposed
analysis is directly applicable to broad classes of well-known
scheduling principles like the fixed-priority and EDF which
are greedy by nature.
Figure 2 depicts the execution of the parallel application in
Figure 1 based on the assumed greedy scheduling algorithm
on a heterogeneous platform (speeds specified in Table II) with
four processors each of a different type. At time 0, only node
uA is ready for execution and it is dispatched to processor
type t1 since uA has the minimum WCET on processor of
type t1. When uA completes its execution after one time
unit, nodes uB , uC , uD and uE are ready for execution.
By assuming that the newly released tasks are placed in the
ready queue in lexicographic order, node uB is dispatched
next. Since one processor of each type is available, node uB
is also dispatched to a processor of type t1 which executes it
the fastest. Similarly, uC is dispatched to t2 because it also
executes the task fastest (i.e., provides the minimum WCET.
Note that both uB and uC execute with speed 1 since they do
not compete for the same (fastest) processor type. Next, uD
will be dispatched to processor type t3 where it is executed
with speed 0.33 because both (relatively faster) processors of
types t1 and t2 are occupied. Task uE is then dispatched to
processor type t4 and executes with speed 0.25. At time 2,
both uB and uC complete their execution and processor types
t1 and t2 become available. At time 2, node uE and uD also
migrate respectively to relatively higher-speed processors of
type t1 and t2 to continue the rest of their execution. Task
uD and uE complete their execution at time 2.67 and 2.75,
respectively. After that, uF is dispatched to processor type t2
which provides speed 1 for this task since all the processor-
types are available. The entire application finishes at time 3.75.
It is clear that when a task ui is executing on its xth fastest
processors, then it is also true that all the processors that are
faster than the xth fastest processor for ui are also busy. In
other words, when a task ui is ready to be dispatched to a
platform with (x − 1) ≤ M busy processors, the scheduler
may in the worst-case dispatches ui to the processor type with
speed Prf ix (x
th fastest speed for ui).
III. PLATFORM CHARACTERIZATION
Formally characterizing the platform by specifying its ca-
pacities is a prerequisite for the schedulability analysis pre-
sented in this paper. This section shows how the concept
of processor capacity and the uniformity already presented
in [5], [6] for uniform (related) multiprocessors are adapted
for general unrelated multiprocessor platform. The term “het-
erogeneity” instead of “uniformity” is used in this paper for
unrelated multiprocessors.
First, subsection III-A describes an approach to model all
the possible mappings (each such mapping is called a permu-
tation) of the tasks to the different types of processors (this is
the basis for the permutation based approach to compute the
makespan). Second, subsection III-B formally presents the no-
tion of processor capacity and heterogeneity for permutation-
based makespan calculation. In the first permutation-based
approach, the processor capacity and the heterogeneity are
computed for each individual permutation. Then individual
(i.e., permutation-specific) makespan is computed considering
individual permutation. Finally, the maximum over all the
computed makespans of all the permutations is reported as the
makespan of the entire application. In our second permutation-
based approach, we compute a common processor capacity and
heterogeneity that are applicable to any permutation. Then, the
makespan is computed for any arbitrary permutation but using
the common processor capacity and heterogeneity.
Subsection III-C presents a modified version of the notion of
processor capacity and heterogeneity that do not even need to
rely on any permutation and can be computed very efficiently
(i.e., in polynomial time), which in turn is the basis for an
efficient makespan calculation (our third approach).
A. Modeling of all possible mappings
In this paper, the first two permutation-based approaches
determine the makespan of a parallel application by con-
sidering all the different possible mappings (i.e., execution
scenarios) of the tasks on different processor types. The worst-
case execution scenario for which the completion time of the
application is maximized can be determined based on such
mappings. To that end, we define permutations of the tasks
where each such permutation corresponds to one particular
execution scenario as follows.
We define pi as one permutation of p tasks selected from
N tasks of the application. For a given permutation pi , we
denote pix as the x
th task in pi . For example, consider a
permutation pi =< u2, u1, u5, u9 > of size 4, where pi3 = u5.
The set of all the permutations1 of size p selected from
N different tasks is denoted by σp.
Consider an arbitrary schedule of an application with
N tasks. Let Bpip be the sum of the length of intervals for
which exactly p processors are busy for executing the tasks of
permutation pi of size p. We define Bp to be the sum of the
lengths of time intervals during which exactly p processors are
busy in the schedule. Consequently,
Bp =
∑
pi∈σp
B
pi
p (2)
For the example schedule of Figure 2, we present in Figure
3 the active Bpip . The horizontal axis is time and the vertical
axis presents the processor capacity. Initially, it can be seen
that B1 is composed by three permutations with the tasks A, E
and F since they are executed when 1 processor is busy. Next,
we can see that B2 is composed by one permutation with the
tasks D and E since these are the only tasks that are executed
when 2 processors are busy. Furthermore, we note that B3 = 0
since there is no instance that exactly 3 processors are busy.
Finally, the B4 is composed by one permutation with the tasks
that are executing when exactly 4 processors are busy, namely
{B,C,D,E}.
Fig. 3. Busy processor capacity based on schedule given by Figure 2.
When exactly p processors are busy due to executing the
tasks that belong to pi , a part of the total workload (C)
is consumed. When some of these tasks also belong to the
critical path, some part of the workload of the critical path
(L) is also consumed. We define Cpip as the amount of the
total workload that is consumed when p processors are busy
by tasks belonging to pi . The amount of the total workload
when p processors are busy is denoted by Cp such that
Cp =
∑
pi∈σp C
pi
p . Therefore, the following holds:
C =
M∑
p=1
Cp =
M∑
p=1
∑
pi∈σp
C
pi
p (3)
Similarly, let Lpip be the amount of workload that belongs
to the critical path cp when p processors are busy by the tasks
that are present in pi . The sum of the workload that belongs
to cp when p processors are busy for all the permutations of
size p is denoted by Lp such that Lp =
∑
pi∈σp L
pi
p . Since
the scheduler is greedy, the workload of the critical path is
also executing whenever no more than (M − 1) processors
1The total number of permutations of size p selected from N tasks is
N !
(N−p)! .
are busy. Therefore, the workload on the critical path L is
lower bounded as follows:
L ≥
M−1∑
p=1
∑
pi∈σp
L
pi
p (4)
Note that if the schedule does not exhibit a certain per-
mutation pi of size p, then Lpip = 0 and C
pi
p = 0 for that
permutation.
B. Parameters for the permutation based approaches
If exactly x processors are busy when task ui in permutation
pi is being executed, then the execution of ui is the longest if
it is executed on its xth fastest processor type where such a
processor provides speed δpixx to task ui. To calculate overall
processor capacity and the heterogeneity of the platform, we
need to consider — for the sake of worst-case analysis — the
speed δpixx for executing ui whenever exactly x processors are
busy.
The overall processor capacity for a particular permutation
pi , denoted by Spix , shows the overall capability of the platform
to execute the tasks in pi and is defined as follows:
S
pi
x =
x∑
k=1
δ
pik
k (5)
where δpikk is the speed which the k
th task (i.e., task with index
pik) in permutation pi uses for execution.
The makespan of a parallel application also depends on
how much capacity is wasted in the worst-case, i.e., the
capacity that cannot be used to execute the tasks of the parallel
application. We will quantify such wastage using the notion
of heterogeneity in Eq. (6). The makespan of an application
is computed based on how much capacity of the platform
is wasted. The higher is the wastage in capacity, the longer
is the makespan. When at most x processors are busy for
executing the nodes in a permutation pi , the accumulated
wastage is maximized if the node on the critical path executes
on the xth slowest processor. Based on these observations, the
heterogeneity for a given permutation pi is given by Eq. (6)
that shows the maximum accumulated capacity loss of the
platform. By assuming that the critical path is executing on a
processor having speed δpixx , the numerator in Eq. (6) is the
sum of the speeds of the processors that are idle while the
denominator is the speed of the processor executing the tasks
of the critical path.
λpi =
M
max
x=1
{S
pi
M − Spix
δ
pix
x
} where, δpixx 6= 0 (6)
The minimum processor capacity (denoted by SminM ) and
maximum heterogeneity (denoted by λmax) over all the per-
mutations in set σM are given as follows:
SminM = min
pi∈σM
{SpiM} (7)
λmax = max
pi∈σM
{λpi} (8)
C. Parameters for the efficient approach
The definition of busy interval Bp in Eq. (2) requires us to
consider all the O(NM ) permutations, which is exponential
and is computationally impractical for large N and M . To find
the makespan for larger system, we need a computationally
efficient way to find the processor capacity and heterogeneity.
The assumed scheduler will dispatch a task ui in the worst-
case to its xth fastest processor type (Prf ix) when exactly x
processors are busy. In the worst-case, all the (x−1) preferred
processors of task ui are busy for executing some other tasks.
As a result, the idle processor that guarantees the shortest
WCET for task ui is Prf ix.
The capacity of the platform can also be defined indepen-
dent of any permutation by assuming the minimum speed that
each processor offers to any task. Recall that the greedy sched-
uler ensures that at least one processor executes some task with
the most preferred speed, one processor executes some task
at least with the second most preferred speed, one processor
executes some task at least with the third most preferred
speed, and so on. In other words, the minimum xth speed
that the platform provides to some task is minNi=1{Prf ix}
and can be computed in polynomial time. To this end, the
minimum processor capacity (SM ) is determined by summing
the minimum speeds of the M processor as follows:
SM =
M∑
x=1
N
min
i=1
{Prf ix} (9)
Similarly, we also define the total capacity loss. The maxi-
mum speed of the yth processor that is provided to any task is
maxNj=1{δjy}. The maximum unused capacity idleix when task
ui is executing on its xth faster processor is given by Eq. (10)
as follows:
idleix =
∑
y∈Slix
N
max
j=1
{δjy} (10)
where Slix is the set of processors having speed slower than
the xth fastest processors for task ui. Note that Eq. (10) can
also be computed in polynomial time.
When some processors are idle, we are going to have some
processor capacity loss. Because the scheduler is greedy, we
know that when some processors are idle, nodes of the critical
path are executed. The duration during which we are going to
have the capacity loss is determined by the execution time (i.e.,
cumulative length of intervals of execution of the nodes in the
critical path) to consume the workload of the critical path.
Since during the actual execution we do not know where the
tasks that belong to the critical path are going to be executed
we assume that in the worst case a task ui belonging to the
critical path will be executed on its xth fastest processor given
that exactly x processors are busy. Eq. (11) maximizes the
accumulated capacity loss, denoted by λ, considering all the
tasks and all the processors.
λ =
N
max
i=1
{ Mmax
x=1
{ idle
i
x
δix
}} where, δix 6= 0 (11)
By comparing the heterogeneity of Eq. (6) and the hetero-
geneity of Eq. (11) we note that they both express the pro-
cessor capacity loss. However, Eq. (11) introduces some new
pessimism since it considers the maximum unused processor
capacity between all the tasks (N ), while Eq. (6) is calculated
for a particular permutation. A table with the description of
the parameters can be found in Appendix C, Table IV.
IV. MAKESPAN CALCULATION
This section presents three different approaches for cal-
culating an upper bound on the makespan, denoted by TM ,
of parallel application considering heterogeneous platform.
First, Section IV-A presents the first combinatorial approach to
calculate the makespan. Next, Section IV-B initially introduces
the second combinatorial approach that is used eventually to
prove the efficient makespan calculation.
A. Permutation based makespan
We need the following lemma which states that the value
of λpi in Eq. (6) for permutation pi of size M is an upper
bound on the value of λpi for the same permutation pi of size
p, where p ≤M .
Lemma 1. Given a permutation pi of size M , the λpi in Eq. (6)
is larger than or equal to the same permutation of size p where
p ≤M .
Proof. From the definition of the permutation, for any permu-
tation pis of size p ≤M there is a corresponding permutation
pib of size M that includes all the tasks with the first p tasks of
pib. By applying the definition of λpi for the two permutations
pis and pib, it can be seen that for pib all the cases for the
different processor speeds of pis are covered in addition to the
extra cases since pib includes all the tasks in pis.
Theorem 1. Permutation based makespan (PM1): The
makespan of application G characterized by C and L on an
unrelated multiprocessor platform is given by:
TM ≤ max
pi∈σp
(
λpi
S
pi
M
) · L+ max
pi∈σp
(
1
S
pi
M
) · C (12)
Proof. Consider the busy interval Bpip in an arbitrary schedule
of G for the permutation Bpip . During B
pi
p , the pth task
belonging to pi in the worst- case will be executed with speed
δ
pip
p 6= 0. Let χ(γ, δpipp ) denote the total amount of workload
completed at speed δ
pip
p along an arbitrary path γ where it
holds that:
B
pi
p · δpipp ≤ χ(γ, δpipp )
Let the total workload of the nodes that belong to path γ is
WBγ . Since the longest path is cp with total workload L
pi
p
belonging to permutation pi , the actual workload is bounded
as follows: χ(γ, δ
pip
p ) ≤WBγ ≤ Lpip and we have:
B
pi
p · δpipp ≤ Lpip
From the definition of λpi given in Eq. (6) we have:
=⇒ Bpip · S
pi
M − Spip
λpi
≤ Lpip (13)
Equivalently,
B
pi
p · (SpiM − Spip ) ≤ Lpip · λpi (14)
When p processors are busy with tasks from permutation pi ,
it means that Spip processor capacity is used and the workload
that is consumed is bounded by the total workload Cpip . So it
holds that Bpip · Spip ≤ Cpip and by adding this term both sides
of Eq. (14) we have:
B
pi
p · (SpiM − Spip ) +Bpip · Spip ≤ Lpip · λpi + Cpip
B
pi
p ≤ L
pi
p · λpi + Cpip
S
pi
M
By summing over all M processors for a given pi :
=⇒
M∑
p=1
B
pi
p ≤
M∑
p=1
(
L
pi
p · λpi + Cpip
S
pi
M
)
=⇒
M∑
p=1
∑
pi∈σp
B
pi
p ≤
M∑
p=1
∑
pi∈σp
(
L
pi
p · λpi + Cpip
S
pi
M
)
Since the scheduler is greedy, from Eq. (2) it holds that
TM =
∑M
p=1Bp =
∑M
p=1
∑
pi∈σp B
pi
p and we have:
=⇒ TM ≤
M∑
p=1
∑
pi∈σp
λpi
S
pi
M
· Lpip +
M∑
p=1
∑
pi∈σp
1
S
pi
M
· Cpip
From Lemma (1) and Eq. (5) (by setting x =M ) we have:
=⇒ TM ≤
M∑
p=1
∑
pi∈σp
max
pi∈σM
{ λ
pi
S
pi
M
} · LpiM+
M∑
p=1
∑
pi∈σp
max
pi∈σM
{ 1
S
pi
M
} · Cpip
Since λpi and SpiM does not depend on the size of pi :
TM ≤ max
pi∈σM
{ λ
pi
S
pi
M
} ·
M∑
p=1
∑
pi∈σp
L
pi
M+
max
pi∈σM
{ 1
S
pi
M
} ·
M∑
p=1
∑
pi∈σp
C
pi
p
From the definitions of L and C (Eq. (4) and Eq. (3)) and
because LpiM = 0 (since we cannot guarantee that the critical
path executes when all the processors are busy) in the worst-
case ∀pi ∈ σp we have:
TM ≤ max
pi∈σM
(
λpi
S
pi
M
) · L+ max
pi∈σM
(
1
S
pi
M
) · C
B. Efficient makespan calculation
This section presents an efficient approach to calculate the
makespan of a parallel application modeled as a DAG and
executed on an unrelated multiprocessor platform. Lemma 2
first derives our second approach to compute the makespan
based on all the permutations. Lemma 3 and 4 show that
the heterogeneity and capacity given by Eq. (11) and Eq. (9)
calculated for the efficient approach in subsection III-C are
always more pessimistic than the heterogeneity and capacity
given by Eq. (8) and Eq. (7) derived for the permutation-based
approach. Finally, we present the EM makespan calculation
in Theorem 2.
Lemma 2. Permutation based makespan (PM2): The
makespan of application G characterized by C and L on an
unrelated multiprocessor platform is given by:
TM ≤ C + λ
max · L
SminM
(15)
where λmax and SminM are given in Eq. (8) and Eq. (7),
respectively.
Proof. By applying the Eq. (4) to Eq. (13) and since for an
arbitrary pi it holds that λpi ≤ λmax and form definition of λpi
given in Eq. (6) we have:
M−1∑
p=1
∑
pi∈σp
B
pi
p · S
pi
M − Spip
λmax
≤ L
Equivalently,
M−1∑
p=1
∑
pi∈σp
B
pi
p · (SpiM − Spip ) ≤ λmax · L (16)
During Bpip the p processors are busy with accumulated
processor capacity of Spip , which means that after B
pi
p time
units, the amount of workload that is done is (Bpip ·Spip ). Since
the application completes when no processor is busy, the total
workload is given by C =
∑M
p=1
∑
pi∈σp B
pi
p · Spip . By adding
this term in both sides in Eq. (16), we get:
M−1∑
p=1
∑
pi∈σp
[B
pi
p · (SpiM − Spip ) +Bpip · Spip ] +
∑
pi∈σp
B
pi
p · SpiM
≤ C + λmax · L
Equivalently,
M−1∑
p=1
∑
pi∈σp
B
pi
p · SpiM +
∑
pi∈σp
B
pi
p · SpiM ≤ C + λmax · L
Equivalently,
M∑
p=1
∑
pi∈σp
B
pi
p · SpiM ≤ C + λmax · L
Since the different permutations can have different processor
capacity, we lower bound the processor capacity with the
minimum processor capacity that is at least offered for any
permutation. Based on the definition of SminM given by Eq. (7)
and by the definition of Bp given by Eq. (2), we have:
=⇒
M∑
p=1
Bp · SminM ≤ C + λmax · L
M∑
p=1
Bp ≤ C + λ
max · L
SminM
Since the scheduler is greedy it holds that TM =
∑M
p=1Bp
and consequently
TM ≤ C + λ
max · L
SminM
The proofs of the following Lemma (3) and Lemma (4) are
given in the Appendix A.
Lemma 3. The heterogeneity λ is always greater or equal
to the maximum heterogeneity λmax between the different
permutations.
λmax ≤ λ (17)
Lemma 4. The SM is always less or equal to minimum
processor capacity between the different permutations SminM .
SminM ≥ SM (18)
Theorem 2. Efficient makespan (EM): The makespan of
application G characterized by C and L on an unrelated
multiprocessor platform is safe.
TM ≤ C + λ · L
SM
(19)
Proof. From Lemma (3) and Lemma (4) it follows that the
makespan calculated by using λ and SM is always larger
compared to the makespan calculate from Lemma (2). Since
the makespan from Lemma (2) is safe, the makespan given by
Eq. (19) is also safe (i.e., an upper bound).
Since λ and SM can be computed in polynomial time in
terms of the number of tasks and number of processors, the
makespan in Eq. (19) can be computed in polynomial time.
The EM makespan calculation method can also be applied
to the more specialized platform models: related and homo-
geneous multiprocessors. The proposed makespan calculation
will be the same with the approaches that are provided from
[6], [14] for the related and the homogeneous platform model
respectively. For example, the homogeneous model can be
modeled by assuming that δit = 1,∀ui ∈ G, 1 ≤ t ≤ H .
With this assumption SM = M and λ = M − 1. As a
result, makespan calculation with the EM approach given by
Theorem 2 is the same with the TM = L+(C−L)/M given
by Lemma IV.1 in [14].
V. SIMULATION FRAMEWORK
This section presents the simulation framework that is used
to evaluate the proposed methods for the makespan calculation.
We use the same technique to generate the DAG models as in
[11] for four applications Fibonacci, Sort, FFT and Strassen
from the BOTS benchmark suit [12]. These applications are
widely used in many different fields of computing (e.g., data
processing, sorting, scientific applications, image processing,
etc.). The application Fibonacci is a recursive parallel appli-
cation with tree-like structure and is a good representative of
many recursive applications. The application Sort is common
operation in almost all fields of computing. The application
Strassen is an efficient matrix multiplication algorithm that is
used in many scientific applications. Finally, FFT is widely
used in signal and image processing.
The considered applications have thousands of tasks. How-
ever, we note that the applications have only a few different
task types (tasks that are functionally equivalent). For example,
Fibonacci with input 20 has 32836 tasks while there is only
1 task type, the one that calculates the Fibonacci numbers.
Similarly, for Sort, FFT, Strassen, there are 2, 3 and 1 task
types, respectively. Similar is the number of task types for the
applications implemented with OmpSs programming model:
Cholesky factorization, QR factorization, Heat diffusion, and
Integral Histogram that have 4, 4, 3 and 2 task types respec-
tively for few thousands of tasks [15].
Based on the model of the applications from [11], there
are three categories of nodes for every task type; Spawn,
Base and Sync nodes that can have different WCET, where
we use 300, 400 and 100 time units respectively for their
emini . The Spawn node generates the parallel work, the Base
node performs the actual functionality of the task type, and
the Sync node synchronizes the output of the tasks that are
generated from the same Spawn node. Although there are
three categories of each node, the total number of possible
pairs of task types and node categories is significantly fewer
in comparison to the total number of tasks. Tasks that have
the same WCET for the different processors will lead to the
same permutations. Consequently, we need to calculate all
the permutations based on only the different task types which
provide exponential complexity to the number of different task
types rather than exponential in the number of tasks.
The WCET of a task for the different processor types is
generated by adding a randomly generated value to the emini
(minimum WCET between the different processors types)
which is given by the configuration of the applications. The
randomly generated value is limited in a range which is given
as parameter by the parameter Limit. More formally, the
WCET of a task for the different processor types is given
as follows, eti = e
min
i +Rand(0, Limit).
The configuration of the applications is presented in Table
III. The columns are the different applications (Fibonacci, Sort,
Strassen and FFT). The first row is the input of the applica-
tions, next is the total number of nodes that the applications
have. At the third and fourth rows are the total workload (C)
and the workload of the critical path (L) which characterizes
the Gmin of the applications. Next row presents the ratio of
the workload of the critical path to the total workload. Finally,
the last row shows the number of task types.
Fib Sort Strassen FFT
Input 20 32768 512 8192
#Nodes 32836 16043 22410 23748
C 8756400 4403300 7843300 6221400
L 8000 14900 2500 51020
L
C
0.0009 0.003 0.0003 0.008
#Task types 3 6 3 9
TABLE III
APPLICATION CONFIGURATIONS
In addition to the real applications, experiments with syn-
thetic DAGs were performed. A synthetic DAG is modeled by
following a similar structure of the applications. We generated
a fully balanced tree together with the mirror tree for the Sync
nodes. The maximum degree of the Spawn nodes and the
maximum height of the DAG, can be set as parameters. A time
budget is assigned to every Spawn node which is responsible
for distributing it to its child nodes and the corresponding
Sync node to get the desirable C and L characteristics of the
DAG. Next, the number of task types are given as a parameter
to the DAG. The randomly generated WCETs with the use of
the Limit parameter are generated, with the same approach
that we used for the applications.
For our simulations we use the following evaluation metrics:
Tightness: To the best of our knowledge, no other related
work provides a closed form solution for the makespan
calculation of parallel applications modeled as DAGs on
an unrelated multiprocessor platform. We compare the three
approaches for makespan calculation that we present in Sec-
tion IV. The exhaustive makespan calculations PM1 and
PM2 given by Theorem (1) and Lemma (2) respectively are
compared to the EM makespan given by Theorem (2). The
tightness is defined as the ratios PM1/EM and PM2/EM .
Pessimism: To the best of our knowledge there is no
analysis that finds the exact makespan of DAGs consider-
ing unrelated heterogeneous processor platform. However,
we derive a lower bound on the makespan by simulating
the actual execution of the parallel applications under the
assumed greedy scheduler where all the tasks are executed
for their WCET. If the application takes time Sim to finish its
execution, then the pessimism of our approach is defined as
the ratios Sim/EM . Note that even the optimal way to find
the makespan has length not smaller than Sim.
VI. SIMULATION RESULTS
This section presents the results of the evaluation. Section
VI-A presents the results of the makespan calculation for
different number of processors and Section VI-B for different
number of processor types. Section VI-C shows the impact of
processor heterogeneity on the makespan calculation. Finally
Section VI-D discusses the results from all the simulations.
Fig. 4. Tightness and pessimism of EM for different number of processors, where the number of processor types is H = min{8,M}
Fig. 5. Tightness and pessimism of EM for different processor types, where M = 8.
Fig. 6. Tightness and pessimism for different variations of the WCET.
A. Impact of changing the number of processors
Figure 4 presents the tightness and the pessimism of the
applications Fibonacci, Sort, Strassen and FFT for the different
number of processors where the number of processor types is
up to min{8,M}. The horizontal axis shows the number of
processors. The left vertical axis is the tightness and the right
vertical axis is the pessimism. The points without the dashed
line correspond to the tightness while with the dashed line
correspond to the pessimism. In this graph, the closer to 1 are
the values the better is the tightness and the pessimism.
The makespan calculation of EM has polynomial time
complexity, and as a result, we generate the results for up to
1024 processors. On the contrary, the makespan calculation of
PM1 and PM2 are the permutation-based approaches which
have exponential time complexity. With our simulation setup,
we can simulate only up to 8 processors. The Limit is set to
100 and every makespan calculation was performed 100 times,
and the average is reported.
Initially, it can be seen that for 1 processor, all the ap-
proaches are equal to C. Furthermore, we can see that for up
to 8 processors, on average the tightness (the overestimation of
the makespan) of EM compared to both PM1 and PM2 is
less than 1% on average and up to 1.2% larger for all the
applications (please note that the PM1 and PM2 have little
difference and they are overalaping in the graph). We have
performed the same simulations, but with Limit equal to 500
and 1000. The average tightness of the makespan is slightly
higher than 1% and up to 3%. Next, it can be noted for
the pessimism that, by increasing the number of processors
exponentially, the pessimism increases linearly. Compare to
Sim we have on average 23% and up to 59% pessimism.
B. Impact of changing the number of processor types
Figure 5 shows the tightness and the pessimism of the
makespan of EM for the different number of processor types
when the total number of processors is eight for the four
applications. The horizontal axis is the number of processor
types for the four applications, the left vertical axis is the
tightness and the right vertical axis is the pessimism. The
Limit for the random generation of the WCET is set to 100.
The tightness of the EM , compared to the two-permutation
based approach, is on average 1% and at maximum 1.3%.
Since for the calculation of heterogeneity and the processor
capacity, we did not distinguish the processor types, but we
consider the total number of processors, it is expected to have
similar behavior with the results shown in Figure 4. As a result,
the three approaches would be able to calculate the makespan
regardless of the number of different processor types that
exist in the system and consequently the determining factor
to determine the makespan is the total number processors.
Next, we can note that by increasing the number of processor
types the pessimism increases since more processors are not
executed with speed 1 which as a result leads to longer
makespan. Compare to Sim we have on average 12% and
up to 24% more pessimism. This result shows that if an exact
makespan can be calculated for parallel application (which
is very unlikely to happen), our analysis can still provide an
upper bound on makespan which is at most 24% larger than
the actual makespan. We, therefore, believe that our approach
to find the makespan using EM is quite effective for the
applications considered from the BOTS benchmark.
C. Impact of processor heterogeneity
To analyze the impact of the Limit factor, which shows
the variation of the WCET of a task among the different
processor types, we continue the simulations with a synthetic
DAG. Figure 6 illustrates the tightness and the pessimism
for different values of the Limit. The horizontal axis is the
Limit, the left vertical axis presents the tightness, and the right
vertical axis is the pessimism. Note that with Limit = 1000
we can have a variation on the WCET from 2.5x to 10x
for Spawn and Sync nodes respectively that have 400 and
100 time units for their emini values, but we use intentionally
extreme values to expose the limitations of the EM . We have
C = 191400 and L = 5800 for 938 nodes and 3 task types
with ratio LC = 0.03. Note that this ratio is 1 to 2 orders of
magnitude higher compared to the BOTS applications, so the
impact of heterogeneity would be higher. We use a platform
with 4 processors and 2 processor types
By increasing Limit, which can be seen as making the
platform more heterogeneous, the tightness of the EM ap-
proach decreases, on average we have 5% and maximum 11%
less tight makespan compared to exhaustive approaches. Such
increase in the makespan is due to the calculation of λ. It is
taking the maximum unused accumulated capacity regardless
of how the tasks are executed while the PM1 and PM2 only
considers permutations where the same task is not allowed to
execute on more than one processor. Next, it can be observed
that the pessimism of EM compared to Sim increases as
the Limit increases since more processors have smaller speed
and as a result the makespan of the EM increases. Compare
to Sim we have on average 44% and up to 72% more
pessimism. Note that although such values may seems quite
high for our analysis, we would like to stress that the degree of
heterogeneity for higher Limit is quite pessimistic for many
practical heterogeneous platform.
D. Discussion
In summary, we have seen that the EM has little extra
pessimism compared to the exhaustive PM1 and PM2 ap-
proaches. We also have noted that by increasing the hardware
heterogeneity (deviation of the WCET) the makespan can be
increased significantly compared to Sim for the synthetic
DAG (which are artificially created for stress testing), but for
the DAGs of the BOTS applications in consideration we have
experienced small differences of the tightness for the tested
values. Simulations to investigate compatibility are presented
in Appendix B to show how the makespan changes when
the number of incompatible tasks is increased. We have seen
that by increasing the number of incompatible processors the
makespan increases since the parallelism is limited.
VII. RELATED WORK
To the best of our knowledge, no other work considers
makespan calculation of parallel applications modeled as DAG
executed on an unrelated multiprocessor platform.
Plethora of heterogeneous architectures are proposed in the
literature; architectures with singe instruction set architecture
(ISA) with different microarchitectures [8], coexistence of
architectures with different ISA [16], special purpose accel-
erators for convolution [17] and matrix multiplication [18].
The authors of [6] (in the first part of the paper) proposed a
makespan calculation for the related multiprocessor platform
[6] with the use of uniformity and processor capacity, which
is based on previous work [5]. In [7] the authors extended the
scheduler of Cilk [19] for related heterogeneous systems where
the speed of the processors are different and fixed throughout
the execution. They provide a makespan calculation method-
ology for Cilk based applications and an approximation of
the makespan based on the average speed of the processors.
The related multiprocessor model is a special case of the
unrelated model. Our proposed makespan calculations can also
be applied to related multiprocessors, and our approach will
have the same makespan as in [6] for related machines.
The authors of [20], [21] provide static scheduling for
unrelated heterogeneous platforms where the applications are
modeled as DAGs with the goal to minimize the schedule
lenght. In [4], [9] the authors consider the problem of schedul-
ing independent tasks on a two type unrelated heterogeneous
multiprocessor platform. Since the independent tasks model is
a special case of the DAG model, by setting the L = 0 the
proposed makespan calculations can also be applied.
In [22], [23] the authors provide upper bounds by using
competitive analysis based on the different types of processors
for cloud applications. The authors consider a multiprocessor
platform where the type of the task and the type of the
processors need to match in order to be executed. In [24], the
authors provide response-time bounds for DAG based appli-
cations. They consider heterogeneous multiprocessor platform
where every task is compatible with one processor type.
Processors of the same type share a common ready queue
and non-preemptive global EDF is used. Initially they trans-
form the DAG to independent tasks with offsets to preserve
dependencies between the tasks of the DAG and then they
adapt their analysis with previous work [25]. In contrast, we
use a general greedy scheduler and consider computing the
makespan of a single DAG task without transforming the node
as independent tasks with offsets. The contribution of our work
is unique for real-time computing considering the more general
heterogeneous processor model, a general greedy scheduler,
and DAG task model without requiring any transformation.
VIII. CONCLUSION
This paper considers the problem of calculating the
makespan of task-based parallel applications modeled as a
DAG executed on heterogeneous multiprocessor platforms
model using the unrelated modeling framework. To the best
of our knowledge, this is the first work that provides a closed-
form solution to the makespan calculation under these assump-
tions. A combinatorial analysis is used to construct two closed-
form makespan calculations that are used to build a third
lower time complexity makespan calculation. The evaluation
is performed by modeling four OpenMP task-based parallel
applications as DAGs and synthetic DAGs. The simulation
results have shown that the length of the makespan using
the EM approach is very close to that of the two exhaustive
approaches.
REFERENCES
[1] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankar-
alingam, and Doug Burger. Dark silicon and the end of multicore scaling.
In IEEE ISCA, 2011.
[2] Anupam Gupta, Sungjin Im, Ravishankar Krishnaswamy, Benjamin
Moseley, and Kirk Pruhs. Scheduling heterogeneous processors isn’t
as easy as you think. In ACM-SIAM SODA, 2012.
[3] Sungjin Im, Janardhan Kulkarni, Kamesh Munagala, and Kirk Pruhs.
Selfishmigrate: A scalable algorithm for non-clairvoyantly scheduling
heterogeneous processors. In IEEE FOCS, 2014.
[4] Hoon Sung Chwa, Jaebaek Seo, Jinkyu Lee, and Insik Shin. Optimal
real-time scheduling on two-type heterogeneous multicore platforms. In
IEEE RTSS, 2015.
[5] Shelby Funk, Joel Goossens, and Sanjoy Baruah. On-line scheduling on
uniform multiprocessors. In IEEE RTSS, 2001.
[6] Xu Jiang, Nan Guan, Xiang Long, and Wang Yi. Semi-federated
scheduling of parallel real-time tasks on multiprocessors. IEEE RTSS,
2017.
[7] Michael A Bender and Michael O Rabin. Scheduling cilk multithreaded
parallel programs on processors of different speeds. In ACM SPAA,
2000.
[8] ARM Peter Greenhalgh. Big.little processing with arm
cortex-a15 and cortex-a7 improving energy efficiency in
high-performance mobile platforms. In White paper,
”http://www.cl.cam.ac.uk/ rdm34/big.LITTLE.pdf”, 2011.
[9] Gurulingesh Raravi, Bjo¨rn Andersson, and Konstantinos Bletsas. As-
signing real-time tasks on heterogeneous multiprocessors with two
unrelated types of processors. Springer RTS, 2013.
[10] Eduard Ayguade´ et al. The design of openmp tasks. IEEE TPDS, 2009.
[11] Petros Voudouris, Per Stenstro¨m, and Risat Pathan. Timing-anomaly
free dynamic scheduling of task-based parallel applications. In IEEE
RTAS, 2017.
[12] Alejandro Duran et al. Barcelona openmp tasks suite: A set of
benchmarks targeting the exploitation of task parallelism in openmp.
In ICPP, 2002.
[13] Douglas Brent West et al. Introduction to graph theory. Prentice hall
Upper Saddle River, 2001.
[14] Alessandra Melani, Marko Bertogna, Vincenzo Bonifaci, Alberto
Marchetti-Spaccamela, and Giorgio C Buttazzo. Response-time analysis
of conditional dag tasks in multiprocessor systems. In ECRTS, 2015.
[15] Kallia Chronaki et al. Criticality-aware dynamic task scheduling for
heterogeneous architectures. In ACM, ICS, 2015.
[16] Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M
Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In
IEEE ISPASS, 2009.
[17] Wajahat Qadeer et al. Convolution engine: balancing efficiency &
flexibility in specialized computing. In ACM ISCA, 2013.
[18] Jouppi Norman P et al. In-datacenter performance analysis of a tensor
processing unit. ACM ISCA, 2017.
[19] Robert D Blumofe and Charles E Leiserson. Scheduling multithreaded
computations by work stealing. JACM, 1999.
[20] Gilbert C Sih and Edward A Lee. A compile-time scheduling heuristic
for interconnection-constrained heterogeneous processor architectures.
IEEE TPDS, 1993.
[21] Haluk Topcuoglu, Salim Hariri, and Min-you Wu. Performance-effective
and low-complexity task scheduling for heterogeneous computing. IEEE
TPDS, 2002.
[22] Yuxiong He, Hongyang Sun, and Wen-Jing Hsu. Adaptive scheduling of
parallel jobs on functionally heterogeneous resources. In ICPP, 2007.
[23] Yuxiong He, Jie Liu, and Hongyang Sun. Scheduling functionally
heterogeneous systems with utilization balancing. In IEEE IPDPS, 2011.
[24] Kecheng Yang, Ming Yang, and James H Anderson. Reducing response-
time bounds for dag-based task systems on heterogeneous multicore
platforms. In ACM RTNS, 2016.
[25] Kecheng Yang and James H Anderson. Optimal gedf-based schedulers
that allow intra-task parallelism on heterogeneous multiprocessors. In
IEEE ESTIMedia, 2014.
APPENDIX
A. Proofs of Lemma 3 and Lemma 4
This section presents the proofs for Lemma (3) and
Lemma (4) that are used for the proof of Theorem 2 in Section
IV.
Proof of Lemma (3)
Proof. For an arbitrary task 1 ≤ piy ≤ N that it is executed
on the yth fastest processor for the task it holds that:
δ
piy
y ≤ Nmax
j=1
{δjy}
Equivalently, ∑
y∈Slxpix
{δpiyy } ≤
∑
y∈Slxpix
N
max
j=1
{δjy}
By dividing both sides with the speed of the xth fastest
processor δpixx 6= 0, of task pix we have:∑
y∈Slxpix
{δpiyy }
δ
pix
x
≤
∑
y∈Slxpix
maxNj=1{δjy}
δ
pix
x
Since the inequality holds for any processor x it holds also
for the processor that maximizes the two sides of the inequality
M
max
x=1
{
∑
y∈Slxpix
{δpiyy }
δ
pix
x
} ≤ Mmax
x=1
{
∑
y∈Slxpix
maxNj=1{δjy}
δ
pix
x
}
Since it holds for an arbitrary task with index pix that
belongs to an arbitrary permutation pi it holds also for any
task of the application 1 ≤ j ≤ N and we have
M
max
x=1
{
∑
y∈Slxpix
maxNj=1{δjy}
δ
pix
x
} =
M
max
x=1
{
∑
y∈Slxk max
N
j=1{δjy}
δkx
} (20)
Since 1 ≤ k ≤ N it also holds that
M
max
x=1
{
∑
y∈Slxk max
N
j=1{δjy}
δkx
} ≤
N
max
i=1
{ Mmax
x=1
{
∑
y∈Slxi max
N
j=1{δjy}
δix
}} (21)
From (20) and (21) we have
M
max
x=1
{
∑
y∈Slxpix
maxNj=1{δjy}
δ
pix
x
} ≤
N
max
i=1
{ Mmax
x=1
{
∑
y∈Slxi max
N
j=1{δjy}
δix
}}
Let pi′ be the permutation that provides the maximum value
for the maximum accumulated capacity loss. Since it holds for
any pi it holds also for the permutation pi′, we have:
M
max
k=1
{
∑
y∈Slx
pi′x
{δpi
′
y
y }
δ
pi′x
x
} ≤
N
max
i=1
{ Mmax
x=1
{
∑
y∈Slxi max
N
j=1{δjy}
δix
}}
Equivalently,
max
pi∈σM
{ Mmax
k=1
{
∑
y∈Slxpix
{δpiyy }
δ
pix
x
}} ≤
N
max
i=1
{ Mmax
x=1
{
∑
y∈Slxi max
N
j=1{δjy}
δix
}}
From the definitions of λmax, λ and idleix given by equa-
tions (8), (11) and 10 we have: λmax ≤ λ. As a result the λ
is always higher or equal (for the case where the maximum
unused capacity and the speed of the node that executes
the critical path are derived form the same permutation) to
λmax.
Proof of Lemma (4)
Proof. For an arbitrary task pix that it is executing on its x
th
fastest processor it holds that:
δ
pix
x ≥
N
min
i=1
{Prf ix}
Since the size of the Prf i is equal to the number of
processor M it holds that:
M∑
x=1
{δpixx } ≥
M∑
x=1
N
min
i=1
{Prf ix}
Since it holds for an arbitrary permutation pi it holds also
for the permutation that provides the minimum value.
min
pi∈σM
{
M∑
x=1
δ
pix
x } ≥
M∑
x=1
N
min
i=1
{Prf ix}
From the definitions of SminM and SM given by equations
(7) and (9) respectively, the statement of the lemma holds.
B. Impact of task-processor compatibility
Figure 7 presents the impact of having tasks that are
not compatible with all the processors. The horizontal axis
presents the speed threshold for the applications Fibonacci,
Sort, Strassen, and FFT; the speed for a processor type of
a task type is set to be incompatible (δit = 0) if it has
speed smaller than the threshold. The vertical axis presents
the pessimism of the EM approach with respect to the Sim.
The considered platform has 32 processors for 4 processor
types. The Limit is set to 100 for this experiment.
Fig. 7. Impact of making tasks incompatible to a processor type that have speed smaller than a threshold (M = 32, H = 4).
Initially, for threshold 0.1 all the applications have pes-
simism close to 25% since the scheduler can find some
compatible available processor type due to high number of
processors but the calculation of the makespan needs to
pessimistically assume the speed equal to 0 for the calculation
of EM . Next, it can be seen that for all the application as
the threshold increases (relatively more incompatible tasks)
the pessimism initially remains constant and then increases
since fewer processors are available for the tasks to execute
on. Next, we can note that the pessimism starts to increase
for Fibonacci and Strassen from speed threshold 0.5 while for
Sort and FFT the tightness begins to decrease after 0.8 and
0.7 respectively. Since Fibonacci and Strassen have fewer task
types, 3 task types each, compared to Sort and FFT that have 6
and 9 respectively, more tasks are characterized to this single 3
task types for the former (i.e., Sort and FFT) cases. As a result,
more tasks have fewer processors to be executed for Sort and
FFT. Finally, we can see that the pessimism decreases for large
numbers of threshold since the scheduler cannot find available
processors to schedule the tasks and the total execution of the
schedule increases. Consequently, the pessimism compared to
Sim decreases.
C. Table of symbols
Description
M Number of processors.
H Number of processor types.
N Number of application tasks.
ui Task with index i.
eti WCET of task i for processor type t.
G DAG of the application.
γ A path in G.
Gmin
Isomorphic DAG with G where for each node ui.
it holds emini = min
H
t=1{eti} for i = 1, 2, . . . N .
cp Critical path in Gmin.
L The sum of the WCETs of all the tasks of cp.
C The sum of the WCETs of all the tasks of Gmin.
δit Speed of processor type t with respect to task ui.
Prf i
Set of processor-type speeds in non-increasing
order for ui.
Prf ix The speed of the xth fastest processor for task ui.
Slix
The set of the slower processors in Slix when
ui is executing on its xth fastest processor.
pi
Different mappings (i.e., execution scenarios) of
the tasks on p processors.
pix The xth task in permutation pi.
σp The set of all the permutations of size p.
Bp
The sum of time intervals where exactly p
processors are busy.
B
pi
p
The sum of intervals where exactly p processors
are busy for the permutation pi of tasks of size p.
L
pi
p
The amount of the workload that belongs to the cp
and it is consumed when p processors are busy by
the tasks that are present in permutation pi .
Lp
The amount of the workload that belongs to the cp
and it is consumed when p processors are busy for
all the permutations of size p.
C
pi
p
The amount of the total workload that is
consumed when p processors are busy by tasks
belonging to the permutation pi .
Cp
The amount of the total workload that is consumed
when p processors are busy.
S
pi
x The processor capacity for a permutation pi .
λpi The heterogeneity for a permutation pi.
SminM
The minimum processor capacity among
all permutations.
λmax The maximum heterogeneity among all permutations.
SM The minimum processor capacity.
idleix
The maximum unused capacity, when task ui is
executing on its xth faster processor.
λ Heterogeneity
TABLE IV
DESCRIPTION OF THE MODEL PARAMETERS
