Assigning real-time tasks on heterogeneous multiprocessors with two types of processors by Andersson, Björn & Bletsas, Konstantinos
  
 
 
 
 
 
 
Assigning Real-Time Tasks on 
Heterogeneous Multiprocessors with Two 
Types of Processors 
 
 
 
 
www.hurray.isep.ipp.pt 
Technical Report 
HURRAY-TR-091104 
Version: 1 
Date: 11-03-2009 
Björn Andersson  
Konstantinos Bletsas 
Technical Report HURRAY-TR-091104 Assigning Real-Time Tasks on Heterogeneous Multiprocessors with Two Typ
© IPP Hurray! Research Group 
www.hurray.isep.ipp.pt   
1 
Assigning Real-Time Tasks on Heterogeneous Multiprocessors with Two 
Types of Processors 
Björn Andersson and Konstantinos Bletsas 
IPP-HURRAY! 
Polytechnic Institute of Porto (ISEP-IPP) 
Rua Dr. António Bernardino de Almeida, 431 
4200-072 Porto 
Portugal 
Tel.: +351.22.8340509, Fax: +351.22.8340509 
E-mail:  
http://www.hurray.isep.ipp.pt 
 
Abstract 
Consider the problem of scheduling a set of implicit deadline sporadic tasks on a heterogeneous multiprocessor so as to 
meet all deadlines. Tasks cannot migrate and the platform is restricted in that each processor is either of type-1 or type-2 
(with each task characterized by a different speed of execution upon each type of processor). We present an algorithm 
for this problem with a time complexity of O(n*m), where n is the number of tasks and m is the number of processors. 
It offers the guarantee that if a task set can be scheduled by any non-migrative algorithm to meet deadlines then our 
algorithm meets deadlines as well if given processors twice as fast. Although this result is proven for only a restricted 
heterogeneous multiprocessor, we consider it significant for being the first realtime scheduling algorithm to use a low-
complexity binpacking approach to schedule tasks on a heterogeneous multiprocessor with provably good performance. 
 
Assigning Real-Time Tasks on Heterogeneous Multiprocessors with Two Types of
Processors
Bjo¨rn Andersson and Konstantinos Bletsas
IPP-HURRAY Research Group, CISTER/ISEP, Polytechnic Institute of Porto
Rua Dr. Anto´nio Bernardino de Almeida 431, 4200-072 Porto, Portugal
bandersson@dei.isep.ipp.pt, ksbs@isep.ipp.pt
Abstract
Consider the problem of scheduling a set of implicit-
deadline sporadic tasks on a heterogeneous multiprocessor
so as to meet all deadlines. Tasks cannot migrate and
the platform is restricted in that each processor is either
of type-1 or type-2 (with each task characterized by a
different speed of execution upon each type of processor).
We present an algorithm for this problem with a time-
complexity of O(n·m), where n is the number of tasks and
m is the number of processors. It offers the guarantee that
if a task set can be scheduled by any non-migrative algo-
rithm to meet deadlines then our algorithm meets deadlines
as well if given processors twice as fast. Although this re-
sult is proven for only a restricted heterogeneous multipro-
cessor, we consider it significant for being the first real-
time scheduling algorithm to use a low-complexity bin-
packing approach to schedule tasks on a heterogeneous
multiprocessor with provably good performance.
1.. Introduction
Parallel processing platforms are spreading at an un-
precedented rate [8]. Traditionally, parallel processing was
used to speed up large computational jobs such as pre-
dicting the weather. Today however, parallel processing
platforms are also used in low-end and embedded real-time
systems thanks to the availability of multicore processors.
Such systems often consist of numerous independent tasks.
Designers are well-aware that processing units special-
ized for a specific function can offer significant perfor-
mance boost. For example, computer graphics are rendered
much faster with a graphics processor than with a general
purpose processor. Similar advantages can be obtained
using network processors, digital signal processors, SIMD
arrays, etc. Consequently, heterogeneous multiprocessors
(especially on a single chip) now enjoy widespread use.
Virtually all major semiconductor companies are offering
or have declared plans to offer heterogeneous multiproces-
sors implemented on a single chip [11][15][10][14][20].
Despite the widespread availability of heterogeneous
multiprocessor platforms and the eagerness to use them,
their deployment in embedded systems is a non-trivial
task for designers. The complicating factor is that many
embedded systems have real-time requirements, whose
satisfaction at run-time has to be proven/guaranteed a
priori. The way tasks are scheduled significantly influ-
ences whether their timing requirements are met. For this
reason, a comprehensive toolbox of real-time scheduling
algorithms and analysis techniques [22][23] have been
developed in order to help designers. Unfortunately, few
results apply to heterogeneous multiprocessors.
An algorithm for deciding if and only if a task set can
be scheduled on a heterogeneous platform exists [5] but it
assumes that tasks can migrate. This algorithm is useful
for researchers but the assumption that tasks can migrate
is often unrealistic in practice, since processing units with
different functionalities typically have different instruction
sets and data layouts (big-endian/little-endian for exam-
ple). The problem of assigning tasks to processors and then
scheduling them with a uniprocessor scheduling algorithm
(i.e. without migration) is of much greater practical signif-
icance. It requires solving two sub-problems: (i) assigning
tasks to processors and (ii) once tasks are assigned to
processors, performing uniprocessor scheduling on each
processor. The latter problem is well-understood (e.g. one
may use EDF [19]); the difficult part is the task assignment.
Among known task assignment schemes for multipro-
cessors in general (i.e. not necessarily heterogeneous), only
(i) bin-packing schemes and (ii) Integer-Linear-Program-
ming (ILP) modeling offer provably good performance.
Bin-packing schemes are popular for task assignment
but unfortunately, the proof techniques used on identical
multiprocessors do not easily translate to heterogeneous
Figure 1. An example of the task-to-proces-
sor assignment problem over a multiproces-
sor with two types of processors. The Cell
processor is such an architecture: P1 is a
general purpose processor and P2-P9 are
8 vector processors (termed “synergistic”).
Each task must be assigned to exactly one
processor; a processor may be assigned
zero, one or many tasks. All task deadlines
must be met with uniprocessor scheduling.
multiprocessors. Consequently, the current literature of-
fers no bin-packing scheme for assigning real-time tasks
on heterogeneous multiprocessors. Instead, the task-to-
processor assignment problem is modelled [6][7] as Zero-
One Integer-Linear-Programming (ILP). Such a formula-
tion can be solved directly but has high computational
complexity. In particular, the decision problem ILP is NP-
complete and even with knowledge of the structure of the
constraints in the modeling of heterogeneous multiproces-
sor scheduling, no polynomial-time algorithm is known
(see [12], p. 245 ). Via relaxation of the ILP formulation
to LP and certain tricks [21], it is possible to design a po-
lynomial-time approximation scheme [6][7]. The derived
linear program is solvable in polynomial time [16][17] but
unfortunately the degree of the polynomial is high.
Yet, in practice many heterogeneous multiprocessors
only use two types of processors; for example one type is
a graphics processor and the other one is general-purpose.
Intel [15] and AMD [1] plan to ship such chips; FreeScale
already does [11]. Graphics processors were traditionally
meant just for graphics tasks, hence task assignment was
straightforward. Designers today [13] use graphics proces-
sors in a wide range of calculations though and this makes
task assignment non-trivial. This is accentuated in the Cell
processor [14][20] (Figure 1) where the “graphics proces-
sor” (called synergistic processor) is Turing-complete, thus
able to compute anything that the main processor can. Such
chips are now deployed in embedded systems for their ex-
cellent performance/cost ratio. CAD tool support for task
assignment algorithms exploiting their special structure
would tap more of the potential performance.
This motivates our new task assignment algorithm for
heterogeneous multiprocessors, which exploits the fact
that only two processor types exist. It is a bin-packing
algorithm whose time complexity is O(n · m), where n
and m are the number of tasks and processors respectively.
This algorithm offers the guarantee that if a task set can
be scheduled by any non-migrative algorithm to meet
deadlines then our algorithm meets deadlines as well
provided that it is given processors that are twice as fast.
In the remainder of this paper, Section 2 offers neces-
sary preliminaries. Section 3 presents some useful lem-
mata, used in Section 4, where we formulate the new
algorithm and prove its performance. Section 5 discusses
the context of the work and previous work and concludes.
2. Preliminaries
In a computer platform with two types of processors,
let P 1 be the set of type-1 processors and P 2 be the set
of type-2 processors. The workload is comprised of τ , a
set of tasks each of which releases a (potentially infinite)
sequence of jobs. The sporadic model is used, i.e. the exact
time of a job release is unknown but the time between any
two successive job releases of a task τi is at least Ti.
A task is assigned to a processor and all jobs released
by this task must execute on this processor. The execution
requirement (in time units) of some task τi depends on the
type of processor to which it is assigned. It is Ci
ri,1
upon
a type-1 host processor but Ci
ri,2
upon a type-2 processor.
Note that we allow ri,1 = 0 (or ri,2 = 0) if task τ1 cannot
be assigned at all to a type-1 (or type-2) processor.
Let τ [p] denote the set of tasks assigned to processor p.
Earliest-Deadline-First (EDF) is a very popular algorithm
in uniprocessor scheduling [19]. A slight adaptation of a
previously known result [19] gives us:
Lemma 1. If all tasks in τ [p] are scheduled under EDF
on processor p (which is of type-z, where z stands for 1
or 2) and ∑τ∈τ [p] Ciri,z·Ti ≤ 1, then all deadlines are met.
Proof: Follows from Theorem 7 in [19].
Then the necessary and sufficient set of conditions for
schedulability on a partitioned heterogenous multiproces-
sor with two types of processor is the following:
∑
τ∈τ [p]
Ci
ri,1 · Ti
≤ 1 ∀p ∈ P 1 (1)
∑
τ∈τ [p]
Ci
ri,2 · Ti
≤ 1 ∀p ∈ P 2 (2)
Thus our problem of scheduling tasks on a heteroge-
neous multiprocessor with two types of processors is re-
duced to finding an assignment of tasks to processors such
that the above constraints are satisfied. In the general case,
however, even this problem is intractable; see Theorem 1.
2
Theorem 1. Deciding if a feasible mapping exists for a
given task set and computing platform is NP-complete.
Proof: If |P 1|=|P 2|=1, ri,1=ri,2=1 ∀i and
∑
i∈τ
Ci
Ti
=2
the problem reduces to PARTITION – see [12], p. 223.
Faced with this fact, we opt for the design of a non-
optimal algorithm which would still offer good perfor-
mance but which would be of polynomial time complexity.
Commonly, the performance of an algorithm is char-
acterized using the notion of the utilization bound [19]:
an algorithm with a utilisation bound of UB is always
capable of scheduling any task set with a utilisation up
to UB so as to meet deadlines. This definition has been
used on uniprocessor scheduling [19] and multiprocessors
with identical processors [2]. However, it does not trans-
late to heterogeneous multiprocessors hence we rely on
the resource augmentation framework to characterize the
performance of the algorithm under design.
The speed competitive ratio CPTA of an algorithm A is
defined as the lowest number such that for every task set τ
and computing platform Π′ it holds that if it is possible for
a non-migrative algorithm to meet all deadlines of τ on Π′
then algorithm A meets all deadlines of τ on a computing
platform Π whose every processor is CPTA times faster
than the corresponding processor in Π′ .1
A low speed competitive ratio indicates high perfor-
mance; the best achievable is 1. If a scheduling algorithm
has an infinite speed competitive ratio then a task set exists
which could be scheduled to meet deadlines (under another
algorithm) but which would miss a deadline with the
actually used algorithm even if processor speeds were mul-
tiplied by an “infinite” factor. Such behavior is undesired in
design tools, consequently we aim to design an algorithm
with a finite (ideally small) speed competitive ratio.
3. Useful results
Before describing the new algorithm, we will derive
some useful statements which facilitate various proofs.
Bin-packing task assignment algorithms are popular in
the context of identical [18] and also uniform [3] multipro-
cessors, as they run fast and offer a finite speed competitive
ratio. Yet, the straightforward application of a bin-packing
algorithm to heterogeneous multiprocessors with two types
of processors performs poorly, as illustrated by Example 1.
Example 1. Consider a set of 2k tasks and 2 processors
(for an integer k ≥ 3). Processor P1 is of type-1 and pro-
cessor P2 is of type-2. Tasks indexed 1..k are characterized
by Ci=Ti=1, ri,1 = 1, ri,2 = k and tasks indexed k+1..2k
are characterized by Ci=Ti=1, ri,1 = k, ri,2 = 1.
1. Our notion of speed competitive ratio in this paper is equivalent to
that in previous work by Baruah [5]. It differs from that used in [3][4].
It is possible to assign tasks such that the condition
of Lemma 1 is met for both processors; assigning tasks
indexed 1..k to P2 and the rest to P1 does that. Yet, the
application of a normal bin-packing algorithm for identical
multiprocessors (such as Next-Fit or First-Fit) causes
failure. These algorithms consider tasks in a sequence and
each time use the condition of Lemma 1 to decide if the task
in consideration can be assigned to a processor. Whether
under Next-Fit or First-Fit, τ1 will end up on P1 (as pro-
cessors are considered by order of ascending index). Yet,
at most one task from among those indexed 1..k can be
assigned there. Thus, k− 1 ≥ 2 tasks (those indexed 2..k)
will then have to be assigned to P2. The bin-packing sche-
me would continue trying to assign tasks indexed k+1..2k
to P2; none would fit and the algorithm would fail.
Let us now provide the bin-packing algorithm with pro-
cessors k−1 times faster. Then, tasks indexed 1..k−1 will
be assigned to P1 and the kth task to P2 before considering
tasks indexed k+1..2k. Of the latter, many can be assigned
to P 2 but not all and, since none can be assigned to P 1,
the bin-packing algorithm would again fail.
The above reasoning holds for any k ≥ 3. By consid-
ering k→∞ we obtain that the speed competitive ratio of
such bin-packing schemes is infinite.
It can be seen that the cause of the low performance
of such a bin-packing scheme is that, by considering tasks
one by one, it lacks a “global view” of the problem, hence
a task may be assigned to a processor where it executes
slowly. It seems like a good idea to try to assign each task
to the processor where it executes faster. We will use this
idea, therefore let us introduce the following definitions:
P 1 is the set of type-1 processors and P 2 is the set of
type-2 processors. The task set τ is viewed as two disjoint
subsets, τ1 and τ2. The set τ1 consists of those tasks which
run at least as fast on a type-1 processor as on a type-2
processor; τ2 consists of all other tasks. In notation:
τ = τ1 ∪ τ2 (3)
∀τi ∈ τ
1 : ri,1 ≥ ri,2 (4)
∀τi ∈ τ
2 : ri,1 < ri,2 (5)
We proceed with two useful observations (their correctness
is evident; for formal proof, see the Appendix).
Lemma 2. If there is a task τi in τ1 such that 1 < Ciri,1·Ti ,
it is then impossible to meet deadlines. Likewise if there is
a task τi in τ2 such that 1 < Ciri,2·Ti .
Proof: See the Appendix.
Lemma 3. It is impossible to meet deadlines if
∑
i∈τ1
Ci
ri,1 · Ti
+
∑
i∈τ2
Ci
ri,2 · Ti
> |P 1|+ |P 2| (6)
3
Proof: See the Appendix.
3.1. Inequalities which we will use
We highlight how the problem in consideration is rela-
ted to other known computational problems, to help with
proofs later. If you read this paper for the first time, you
may want to skip this section now and revisit later.
Fractional knapsack problem: A vector x has n
elements. The problem instance is represented by vectors
v and w of real numbers, arranged such that vi
wi
≥ vi
wi
.
(Intuitively, vi and wi may be thought of as, respectively,
the “value” and “weight” of an item, indexed i, while xi as
the fraction of it that is employed). Consider the problem
of assigning values to the elements in vector x so as to
maximize
∑n
i=1 xi · vi
subject to ∑ni=1 xi · vi ≤ CAP
and 0 ≤ xi ≤ 1
and xi is a real number.
(Intuitively, determine how much of each item to use
such that cumulative value is maximized, subject to cumu-
lative weight not exceeding some bound).
Lemma 4. The Fractional Knapsack Problem can be
solved by the following algorithm:
1. reindex tuples {vi, wi} by order of descending vi/wi
2. for i:= 1 to n do
3. xi:=0;
4. end for
5. i:=1;
6. SUMWEIGHT:=0;
7. SUMVALUE:=0;
8. while ((SUMWEIGHT+wi ≤ CAP ) and (i ≤ n)) do
9. xi:=1;
10. SUMWEIGHT:=SUMWEIGHT+wi;
11. SUMVALUE:=SUMVALUE+vi;
12. i:=i+1;
13. end while
14. if i ≤ n then
15. xi:=(CAP-SUMWEIGHT)/wi;
16. SUMWEIGHT:=SUMWEIGHT+wi · xi;
17. SUMVALUE:=SUMVALUE+vi · xi;
18. end if
This is known from undergraduate textbooks (for ex-
ample, see Chapter 16.2 in [9]). We now consider a
multiprocessor scheduling problem.
Lemma 5. Consider n tasks and a heterogeneous multi-
processor conforming to the system model (and notation)
of Section 2. Let x denote a number such that 0≤x≤ |P 1|2 .
Let A1 denote a subset of τ1 such that
∑
i∈A1
Ci
ri,1 · Ti
>
|P 1|
2
− x (7)
and for every pair of tasks τi∈A1 and τj∈τ \ A1 it
holds that ri,1
ri,2
−1≥
rj,1
rj,2
−1. Let A2 denote τ1\A1.
Let B1 denote a subset of τ1 such that
∑
i∈B1
Ci
ri,1 · Ti
≤
|P 1|
2
− x (8)
Let B2 denote τ\B1. It then holds that:
∑
i∈A1
Ci
ri,1 · Ti
+
∑
i∈A2
Ci
ri,2 · Ti
+
∑
i∈τ2
Ci
ri,2 · Ti
≤
∑
i∈B1
Ci
ri,1 · Ti
+
∑
i∈B2
Ci
ri,2 · Ti
(9)
Proof: Let us arbitrarily choose A1 and B1 as defined
above. Using Inequalities 7 and 8 we clearly get:∑
i∈A1
Ci
ri,1 · Ti
>
∑
i∈B1
Ci
ri,1 · Ti
(10)
With this choice of A1 and B1, let us consider two
instances of the fractional knapsack problem:
Instance1.
CAP = left-hand side of Inequality 10,
vi =
(
1
ri,2
− 1
ri,1
)
Ci
Ti
,
wi =
Ci
ri,1·Ti
Instance2.
CAP = right-hand side of Inequality 10,
vi =
(
1
ri,2
− 1
ri,1
)
Ci
Ti
,
wi =
Ci
ri,1·Ti
Using, Inequality 10, we get:
CAPInstance1 > CAPInstance2 (11)
where CAPInstance1 is defined as CAP in Instance1 and
CAPInstance2 is defined analogously.
Observe that Instance1 and Instance 2 differ only in
their value of CAP. Instance1 can be perceived as a relaxed
version of Instance2. Therefore we have:
SUMVALUEInstance1 ≥ SUMVALUEInstance2
(12)
where SUMVALUEInstance1 is defined as the value of
the variable SUMVALUE in Instance1 when the algo-
rithm in Lemma 4 terminates and SUMVALUEInstance2
is defined analogously. Observe that this choice of
CAPInstance1 ensures that, on line 15 of the pseudocode,
xi is assigned the value 1 when the algorithm of Lemma 4
takes Instance1 as input. Note that
SUMVALUEInstance1 =
∑
i∈A1
(ri,1
ri,2
−1
) Ci
ri,1 · Ti
(13)
4
because of the definition of A1. Also, note that
SUMVALUEInstance2 ≥
∑
i∈B1
(ri,1
ri,2
−1
) Ci
ri,1 · Ti
(14)
because, in Instance2, the set B1 cannot produce a solution
for the fractional knapsack problem that is higher than the
optimal one. Applying Equation 13 and Inequality 14 on
Inequality 12 yields:
∑
i∈A1
(ri,1
ri,2
− 1
) Ci
ri,1 · Ti
≥
∑
i∈B1
(ri,1
ri,2
− 1
) Ci
ri,1 · Ti
(15)
Then we can reason as follows.
(15)⇔
∑
i∈A1
(
1
ri,2
−
1
ri,1
)
Ci
Ti
≥
∑
i∈B1
(
1
ri,2
−
1
ri,1
)
Ci
Ti
⇔
−(
∑
i∈A1
1
ri,2
·
Ci
Ti
−
∑
i∈A1
1
ri,1
·
Ci
Ti
)
≤ −(
∑
i∈B1
1
ri,2
·
Ci
Ti
−
∑
i∈B1
1
ri,1
·
Ci
Ti
)
Now, observing that τ=τ1∪τ2=B1∪B2 gives us:
∑
i∈τ1
Ci
ri,2 · Ti
+
∑
i∈τ2
Ci
ri,2 · Ti
=
∑
i∈B1
Ci
ri,2 · Ti
+
∑
i∈B2
Ci
ri,2 · Ti
(16)
Adding these two together produces the inequality:
∑
i∈τ1
Ci
ri,2 · Ti
+
∑
i∈τ2
Ci
ri,2 · Ti
−
( ∑
i∈A1
Ci
ri,2 · Ti
−
∑
i∈A1
Ci
ri,1 · Ti
)
≤
∑
i∈B1
Ci
ri,2 · Ti
+
∑
i∈B2
Ci
ri,2 · Ti
−
( ∑
i∈B1
Ci
ri,2 · Ti
−
∑
i∈B1
Ci
ri,1 · Ti
)
(17)
Rearranging the terms yields:
∑
i∈A1
Ci
ri,1 · Ti
+
(∑
i∈τ1
Ci
ri,2 · Ti
−
∑
i∈A1
Ci
ri,2 · Ti
)
+
∑
i∈τ2
Ci
ri,2 · Ti
≤
∑
i∈B1
Ci
ri,1 · Ti
+
( ∑
i∈B1
Ci
ri,2 · Ti
−
∑
i∈B1
Ci
ri,2 · Ti
)
+
∑
i∈B2
Ci
ri,2 · Ti
=
∑
i∈B1
Ci
ri,1 · Ti
+
∑
i∈B2
Ci
ri,2 · Ti
(18)
Exploiting the fact that A2 = τ1\A1 gives us:
∑
i∈A1
Ci
ri,1 · Ti
+
∑
i∈A2
Ci
ri,2 · Ti
+
∑
i∈τ2
Ci
ri,2 · Ti
≤
∑
i∈B1
Ci
ri,1 · Ti
+
∑
i∈B2
Ci
ri,2 · Ti
This is the statement of the lemma.
Lemma 5 considers the task set τ . We can however
apply this on only a subset of τ . Let us assume that τH1
and τH2 are two disjoint subsets of τ . We apply Lemma 5
on τ \ (τH1∪τH2) and then add the same sum to the both
sides of Inequality 9. This gives us:
Lemma 6. Consider n tasks and a heterogeneous multi-
processor conforming to the system model (and notation)
of Section 2. Let x denote a number such that 0≤x≤ |P 1|2 .
Let A1 denote a subset of (τ1\(τH1∪τH2)) such that
∑
i∈A1
Ci
ri,1 · Ti
≥
|P 1|
2
− x (19)
and for every pair of tasks τi∈A1 and
τj∈(τ\(τ
H1∪τH2))\A1 it holds that ri,1
ri,2
−1≥
rj,1
rj,2
−1. Let
A2 denote (τ1 \ (τH1 ∪ τH2))\A1.
Let B1 denote a subset of (τ \ (τH1 ∪ τH2)) such that
∑
i∈B1
Ci
ri,1 · Ti
≤
|P 1|
2
− x (20)
Let B2 denote (τ \(τH1∪τH2))\B1. It then holds that:
∑
i∈τH1
1
ri,1
·
Ci
Ti
+
∑
i∈τH2
1
ri,2
·
Ci
Ti
+
∑
i∈A1
1
ri,1
·
Ci
Ti
+
∑
i∈A2
Ci
ri,2 · Ti
+
∑
i∈τ2 \ (τH1 ∪ τH2)
1
ri,2
·
Ci
Ti
≤
∑
i∈τH1
1
ri,1
·
Ci
Ti
+
∑
i∈τH2
1
ri,2
·
Ci
Ti
+
∑
i∈B1
1
ri,1
·
Ci
Ti
+
∑
i∈B2
1
ri,2
·
Ci
Ti
(21)
Lemma 6 will be useful for proving the performance of
our new algorithm, formulated in Section 4.
4. The new algorithm
Our goal is to design an algorithm with a speed com-
petitive ratio 2. The new algorithm is based on two ideas.
Idea1.A task should preferably be assigned to the type
of processor on which it runs faster.
Idea2.A task which has a utilization less than 50% on
one type of processor and utilization greater than
5
50% on a processor of the other type of pro-
cessor should be assigned to the former type of
processor. This is a special case of Idea1 but we
state it separately because this facilitates creating
an algorithm with the desired speed competitive
ratio. The rationale behind this idea is that we
are interested in comparing the performance of
our new algorithm versus every other algorithm
using processors of half the speed; by following
Idea2, we create assignments that mimic what
every other algorithm does (assuming that the
other algorithm successfully assigns tasks).
Based on these ideas, we will use the concepts of τ1
and τ2 (already defined in Section 2). We also define:
τH1 = {τi ∈ τ
1 : Ci
Ti·ri,2
> 1/2} (22)
τH2 = {τi ∈ τ
2 : Ci
Ti·ri,1
> 1/2} (23)
τF1 = τ1 \ τH1 (24)
τF2 = τ2 \ τH2 (25)
Intuitively, τH1 and τH2 identify those tasks which
should be assigned based on Idea2. τH1 stands for ”Set
of tasks that are heavy if they are not assigned to their
favorite processor, of type-1.” Analogous for τH2. Also,
intuitively, τF1 and τF2 identify those tasks which should
be assigned based on Idea1. τF1 stands for ”Set of tasks
that have a processor of type-1 as its favorite and for which
heaviness should not be considered.” Analogous for τF2.
Figure 2 shows the new algorithm FF-3C. The intuition
behind the design of our algorithm is that first we assign
tasks to their favorite processors so that the tasks are
not heavy (lines 4-5). Then we assign the remaining
tasks to their favorite processors (lines 6-7). Then if there
are remaining tasks these tasks have to be assigned to
processors that are not their favorite (line 12 and line 20).
The name FF-3C is derived from the fact that first-fit is
used to assign a task to a processor and a task has three
chances to be used by first-fit. A task has the chance to be
assigned by first-fit if it follows Idea2 (to avoid making a
task heavy). Then a task has the chance to be assigned to
its favorite processor. And then a task has a chance to be
assigned to a processor which is not its favorite processor.
The algorithm FF-3C keeps track of processor utiliza-
tions in a global vector U, initialized to zero (line 2).
As already mentioned, the algorithm FF-3C performs
several passes with first-fit bin-packing. It uses a subrou-
tine first-fit which takes two parameters, a set of
tasks to be assigned using first-fit bin-packing and a set
of processors to assign these tasks, and it returns the set
of tasks that were successfully assigned. Figure 3 shows
pseudo-code for first-fit.
We next establish the competitiveness ratio of FF-3C.
Output: τ [p] specifies the tasks assigned to processor p.
1. Form sets τH1, τH2, τF1, τF2 as defined by Eq. 22-25
2. ∀p: U[p] := 0
3. ∀p: τ [p] := ∅
4. if first-fit( τH1, P 1) 6= τH1 then declare FAILURE
5. if first-fit( τH2, P 2) 6= τH2 then declare FAILURE
6. τF11 := first-fit( τF1, P 1)
7. τF22 := first-fit( τF2, P 2)
8. if τF11 = τF1 ∧ τF22 = τF2 then declare SUCCESS
9. if τF11 6= τF1 ∧ τF22 6= τF2 then declare FAILURE
10. if τF11 6= τF1 ∧ τF22 = τF2 then
11. τF12 := τF1 \ τF11
12. if first-fit( τF12, P 2) = τF12 then
13. declare SUCCESS
14. else
15. declare FAILURE
16. end
17. end
18. if τF11 = τF1 ∧ τF22 6= τF2 then
19. τF21 := τF2 \ τF22
20. if first-fit( τF21, P 1) = τF21 then
21. declare SUCCESS
22. else
23. declare FAILURE
24. end
25. end
Figure 2. The new algorithm, FF-3C
1. function first-fit( ts : set of tasks; ps : set of processors)
return set of tasks
2. assigned tasks := ∅
3. Order the tasks ts and order the processors ps.
This order should be maintained during the execution
of the function first-fit
4. τi := first task in ts
5. Pp := first processor in ps
6. Let k denote the type of processor Pp (either 1 or 2)
7. if U[p]+ Ci
Ti·ri,k
≤ 1 then
8. U[p] := U[p]+ Ci
Ti·ri,k
9. τ [p] := τ [p] ∪ {τi}
10. assigned tasks := assigned tasks ∪ {τi}
11. if remaining tasks exist in ts then
12. τi := next task in ts
13. go to line 5.
14. else
15. return assigned tasks
16. end if
17. else
18. if remaining processors exist in ps then
19. Pp := next processor in ps
20. go to line 6.
21. else
22. return assigned tasks
23. end if
24. end if
Figure 3. First-fit bin-packing
Theorem 2. The speed competitiveness ratio of FF-3C is
at most 2.
Proof: An equivalent claim is that any task set τ
which is not schedulable under FF-3C over a computing
platform Π would likewise be unschedulable, using any al-
6
gorithm, over computing platform Π′ with processors each
half as fast as the corresponding one in Π. This, we will
prove (by contradiction). From the definition of Π′ :
r
′
i,1
ri,1
=
r
′
i,2
ri,2
=
1
2
∀i (26)
Assume that FF-3C has failed to assign τ on Π but it is
possible (using an algorithm OPT) to assign τ on Π′ . Since
FF-3C failed to assign τ on Π, it follows that FF-3C decla-
red FAILURE. We explore all possibilities for this to occur:
Failure on line 4 in FF-3C.
If |τH1| ≤ |P 1| then there would be no failure on
line 4 in FF-3C. Therefore, we know that |τH1| ≥
|P 1|+1. Hence OPT must assign at least one task
in τH1 on a processor in P 2. Let τi denote this
task. Using the definition of τH1 gives us that
Ci
Ti·ri,2
> 1/2 and applying Equation 26 yields
Ci
Ti·r′i,2
> 1. But since OPT assigned τi on a
processor in P 2 it must be that Ci
Ti·r′i,2
≤ 1. This
is a contradiction.
Failure on line 5 in FF-3C.
This results in a contradiction. It can be shown
because this case is symmetric to the case above.
Failure on line 9 in FF-3C.
From the case, we obtain that τF11⊂τF1
and τF22⊂τF2. Therefore, there was a task
τfailed1 ∈ τ
F1 which could not be assigned
on any processor in P 1 and there was a task
τfailed2 ∈ τ
F2 which could not be assigned on
any processor in P 2. Consequently, we obtain:
∀p ∈ P 1 : U [p] +
Cfailed1
Tfailed1 · rfailed1,1
> 1 (27)
and ∀p ∈ P 2 : U [p] + Cfailed2
Tfailed2 · rfailed2,2
> 1 (28)
Suppose that Cfailed1
Tfailed1·rfailed1,1
> 1/2. We know
that τfailed1 ∈ τF1 and this gives us rfailed1,1 ≥
rfailed1,2 which gives us Cfailed1Tfailed1·rfailed1,2 > 1/2.
This implies that τfailed1 ∈ τH1 but this is
impossible because τH1 and τF1 are disjoint.
Therefore, we have that: Cfailed1
Tfailed1·rfailed1,1
≤
1/2. With analogous reasoning, we obtain:
Cfailed2
Tfailed2·rfailed2,2
≤ 1/2. Using these inequalities
on Inequalities 27 and 28 gives:
∀p ∈ P 1 : U [p] > 1/2 (29)
and ∀p ∈ P 2 : U [p] > 1/2 (30)
Observing that tasks assigned on processors in
P 1 are a subset of τ1 and using Inequality 29
gives us:
∑
τi∈τ1
Ci
Ti · ri,1
>
|P 1|
2
(31)
With analogous reasoning, we obtain:
∑
τi∈τ2
Ci
Ti · ri,2
>
|P 2|
2
(32)
Observing these two inequalities and Equation 26
and Lemma 3 gives us that OPT fails to assign
tasks on Π′ . This is a contradiction.
Failure on line 15 in FF-3C.
From the case, we obtain that τF11⊂τF1 and
τF22=τF2. Therefore, there was a task τfailed ∈
(τF1 \ τF11) which was attempted to each of
the processors in P 2. But all of them failed.
Therefore, we have:
∀p ∈ P 2 : U [p] +
Cfailed
Tfailed · rfailed,2
> 1 (33)
We can add these inequalities together and get:
∑
p∈P 2
U [p] > |P 2| · (1 −
Cfailed
Tfailed · rfailed,2
) (34)
We know that the tasks assigned to processors
in P 2 are τH2 ∪ τF22 ∪ τF12assigned where
τF12assigned is the set of tasks that were assigned
when executing on line 12 of FF-3C. We also
know that τF12assigned ⊂ τF12. Hence:∑
i∈(τH2∪τF22∪τF12)
Ci
Ti · ri,2
> |P 2| · (1−
Cfailed
Tfailed · rfailed,2
) (35)
We also know that since τfailed ∈ τF1 it follows
that τfailed is not in τH1 and hence:
Cfailed
Tfailed · rfailed,2
≤ 1/2 (36)
But since τfailed1∈τF1⊆τ1, using Inequality 4
gives us:
Cfailed
Tfailed · rfailed,1
≤ 1/2 (37)
Combining this with Inequality 34 yields:
∑
i∈(τH2∪τF22∪τF12)
Ci
Ti · ri,2
>
|P 2|
2
(38)
We also know that FF-3C has executed line 6
and when it performed first-fit-bin-packing, there
must have been a task τfailed1 ∈ (τF1 \ τF11)
which was attempted to each of the processors
7
in P 1. But all of them failed. Note that this task
τfailed1 may be the same as τfailed mentioned
above or it may be different. Because it was
not possible to assign τfailed1 on any of the
processors in P 1, we have:
∀p ∈ P 1 : U [p] +
Cfailed1
Tfailed1 · rfailed1,1
> 1 (39)
Adding these inequalities together gives us:
∑
p∈P 1
U [p] > |P 1| · (1−
Cfailed1
Tfailed1 · rfailed1,1
) (40)
We know that the tasks assigned to processors in
P 1 are τH1 ∪ τF11. Therefore, we have:
∑
i∈(τH1∪τF11)
Ci
Ti · ri,1
> |P 1| · (1 −
Cfailed1
Tfailed1 · rfailed1,1
) (41)
We also know that since τfailed1 ∈ τF1 it follows
(from the definition τH1 in the beginning of this
section) that τfailed1 is not in τH1 and hence:
Cfailed1
Tfailed1 · rfailed1,1
≤ 1/2 (42)
Combining them yields:
∑
i∈(τH1∪τF11)
Ci
Ti · ri,1
>
|P 1|
2
(43)
Let us now discuss OPT, the algorithm which
succeeds in assigning the task set τ on the
computer platform Π′ . Let us discuss tasks in
τH1. From the definition, we know that:
∀τi ∈ τ
H1 :
Ci
Ti · ri,2
> 1/2 (44)
Using Equation 26 gives us:
∀τi ∈ τ
H1 :
Ci
Ti · r′i,2
> 1 (45)
Therefore, OPT would fail if a task in τH1 was
assigned to a processor in P 2. Since we know
that OPT succeeded, it follows that every task
in τH1 is assigned to a processor in P 1. With
analogous reasoning, we have that every task
in τH2 is assigned to a processor in P 2. Let
τOPT1 denote the tasks (except those from τH1)
that were assigned to processors in P 1 by OPT.
Analogously, let τOPT2 denote the tasks (except
those from τH2) that were assigned to processors
in P 2 by OPT. Therefore (using Inequalities 1
and 2), we know that:
∑
τi∈(τH1∪τOPT1)
Ci
Ti · r′i,1
≤ |P 1| (46)
and
∑
τi∈(τH2∪τOPT2)
Ci
Ti · r′i,2
≤ |P 2| (47)
Using Equation 26 gives us:
∑
τi∈(τH1∪τOPT1)
Ci
Ti · ri,1
≤
|P 1|
2
(48)
and
∑
τi∈(τH2∪τOPT2)
Ci
Ti · ri,2
≤
|P 2|
2
(49)
Having obtained inequalities about the assign-
ments of FF-3C and OPT, we can now reason
about them. Rewriting Inequality 43 yields:
∑
i∈τF11
Ci
Ti · ri,1
>
|P 1|
2
−
∑
i∈τH1
Ci
Ti · ri,1
(50)
Also, simple rewriting of Inequality 48 yields:
∑
τi∈τOPT1
Ci
Ti · ri,1
≤
|P 1|
2
−
∑
τi∈τH1
Ci
Ti · ri,1
(51)
We can see that Inequalities 50 and 51 with
x =
∑
i∈τH1
Ci
Ti·ri,1
ensure that the assumptions
of Lemma 6 are true. Using Lemma 6 gives us:∑
i∈τH1
1
ri,1
·
Ci
Ti
+
∑
i∈τH2
1
ri,2
·
Ci
Ti
+
∑
i∈τF11
1
ri,1
·
Ci
Ti
+
∑
i∈τF12
Ci
ri,2 · Ti
+
∑
i∈τF22
1
ri,2
·
Ci
Ti
≤
∑
i∈τH1
1
ri,1
·
Ci
Ti
+
∑
i∈τH2
1
ri,2
·
Ci
Ti
+
∑
i∈OPT1
1
ri,1
·
Ci
Ti
+
∑
i∈OPT2
1
ri,2
·
Ci
Ti
(52)
Applying Inequality 48 and Inequality 49 on
Inequality 52 gives us:∑
i∈τH1
1
ri,1
·
Ci
Ti
+
∑
i∈τH2
1
ri,2
·
Ci
Ti
+
∑
i∈τF11
1
ri,1
·
Ci
Ti
+
∑
i∈τF12
Ci
ri,2 · Ti
+
∑
i∈τF22
1
ri,2
·
Ci
Ti
≤
|P 1|
2
+
|P 2|
2
Applying Inequality 38 and Inequality 43 on the
above inequality gives us:
|P 1|
2
+
|P 2|
2
<
|P 1|
2
+
|P 2|
2
(53)
This is a contradiction.
8
Failure on line 23 in FF-3C.
A contradiction results – the proof is analogous
that for the case ”Failure on line 15 in FF-3C”.
We see that all cases where FF-3C declares FAILURE
lead to contradiction. Hence, the theorem holds.
5. Discussion and Conclusions
The model considered is restricted but captures many
current and future single-chip heterogeneous multipro-
cessors. The Cell processor [14][20] is comprised of a
Power processor and 8 synergistic processor elements. The
planned AMD Fusion [1] is similarly arranged, with a
processor and graphics processors integrated onto a single
chip. A difference from the Cell processor is that the
graphics processors are not Turing-complete, hence there
may be some programs which they cannot execute but
the main processor can. This can be easily modelled by
treating the execution rate of those tasks on the graphics
processor as zero. Intel has similar plans [15]. Similar
solutions are already in the marketplace, such as the
MPC5121e processor [11] and the network processor [10]
from Freescale. In fact, most network processors are het-
erogeneous multiprocessors with two types of processors.
We specify preemptive EDF scheduling on each proces-
sor but do not specify which processor performs the sche-
duling. In the Cell processor, the synergistic processors
can only execute tasks assigned to them by the Power
processor; they are not autonomous. Our approach can be
applied though by having the Power processor keep track
of all tasks (ready, runnable or blocked) of all processors
(i.e. both the Power processor and the synergistic ones)
and then notify each processor which task to execute and
when. Similarly with AMD Fusion and MPC5121e.
As seen, the model studied in this paper is of sig-
nificant practical interest and its importance is expected
to increase further in the future. But while the hard-
ware/software codesign community has dealt with the
problem of scheduling real-time tasks (see for exam-
ple [18]), algorithms are evaluated by simulation only;
not by proofs of their performance. Scheduling theorists
however offer proofs of algorithm performance. Multipro-
cessors are commonly categorized as:
Identical – All processors have the same speed.
Uniform – Each processor has a speed. A processor
with a higher speed executes every task proportionately
faster.
Heterogeneous (sometimes called unrelated parallel
machines) – A matrix, indexed with i and p, gives the
speed that task i has when it executes on processor p.
Clearly, heterogeneous multiprocessors are the most
general model but are still plagued by requiring scheduling
algorithms to have a large computational complexity if
provably good performance must be achieved. We claim
that our model of heterogeneous multiprocessors with two
types of processors (i) is capable of modelling many of
those heterogeneous multiprocessors that are of practical
interest and (ii) allows the design of algorithms that run
much faster but still maintain provably good performance.
Indeed, our new algorithm for task assignment over a hete-
rogeneous multiprocessor with only two types of proces-
sors is certainly faster than algorithms based on Integer Li-
near Programming (or relaxation to Linear Programming).
References
[1] AMD Inc. AMD Completes ATI Acquisition and Creates
Processing Powerhouse (Press release).
http://www.amd.com/us-en/Corporate/VirtualPressRoom/
0,,51 104 543˜113741,00.html, October 2006.
[2] B. Andersson, S. Baruah, and J. Jonsson. Static-Priority
Scheduling on Multiprocessors. In Proceedings of the 22nd
IEEE Real-Time Systems Symposium, pages 193–202, 2001.
[3] B. Andersson and E. Tovar. Competitive Analysis of
Partitioned Scheduling on Uniform Multiprocessors. In
Proceedings of the 15th International Workshop on Parallel
and Distributed Real-Time Systems, pages 1–8, 2007.
[4] B. Andersson and E. Tovar. Competitive Analysis of Static-
Priority of Partitioned Scheduling on Uniform Multiproces-
sors. In Proceedings of the 13th IEEE International Con-
ference on Embedded and Real-Time Computing Systems
and Applications, pages 111–119, 2007.
[5] S. Baruah. Feasibility analysis of preemptive real-time
systems upon heterogeneous multiprocessor platforms. In
Proceedings of the 25th IEEE International Real-Time
Systems Symposium, pages 37–46, 2004.
[6] S. Baruah. Partitioning real-time tasks among heteroge-
neous multiprocessors. In Proc. of the 33rd International
Conference on Parallel Processing, pages 467–474, 2004.
[7] S. Baruah. Task partitioning upon heterogeneous multi-
processor platforms. In Proceedings of the 10th IEEE
International Real-Time and Embedded Technology and
Applications Symposium, pages 536–543, 2004.
[8] S. Borkar. Thousand Core Chips - A Technology Per-
spective. In Proceedings of the 44th ACM/IEEE Design
Automation Conference, pages 746–749, 2007.
[9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
Introduction to Algorithms, 2nd Ed. McGraw-Hill, 2001.
[10] Freescale Semiconductor. C-5 Network Processor (NP).
http://www.freescale.com/webapp/sps/site/
prod summary.jsp?code=C-5&nodeId=01DFTQ3126q62S.
[11] Freescale Semiconductor. Freescale Unveils Multi-
core Processor for Telematics, Consumer, Industrial Ap-
plications. http://parts.ihs.com/news/freescale-multicore-
processor.htm, May 2007. (Press release).
9
[12] M. R. Garey and D. S. Johnson. Computers and Intractabil-
ity: A Guide to the Theory of NP-Completeness. W. H.
Freeman & Co, 1979.
[13] D. Geer. Taking the Graphics processor Beyond Graphics.
IEEE Computer, 38(9):14–16, 2005.
[14] M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins,
Y. Watanabe, and T. Yamazaki. Synergistic Processing in
Cell’s Multicore Architecture. IEEE Micro, 26(2):10–24,
2006.
[15] Intel Corporation. Intel Developer Forum Day 1
News Disclosures From Beijing (Press release).
http://www.intel.com/pressroom/archive/releases/
20070416supp.htm, April 2007.
[16] N. Karmakar. A new polynomial-time algorithm for linear
programming. Combinatorica, 4(4):373–395, 1984.
[17] L. G. Khachiyan. A polynomial algorithm in linear pro-
gramming. Doklady Akademia Nauk, 20:1093–1096, 1979.
[18] T. G. C. Lavarenne and Y. Sorel. Optimized rapid prototyp-
ing for real-time embedded heterogeneous multiprocessors.
In Proceedings of the 7th International Workshop on Hard-
ware/Software Codesign, pages 74–48, 1999.
[19] C. L. Liu and J. W. Layland. Scheduling algorithms for
multiprogramming in a hard real-time environment. Journal
of the ACM, 20:46–61, 1973.
[20] S. Maeda, S. Asano, T. Shimada, K. Awazu, and H. Tago.
A real-time software platform for the Cell processor. IEEE
Micro, 25(5):20–29, 2005.
[21] C. N. Potts. Analysis of a linear programming heuristic for
scheduling unrelated parallel machines. Discrete Applied
Mathematics, 10:155–164, 1985.
[22] L. Sha, R. Rajkumar, and S. S. Sathaye. Generalized rate-
monotonic scheduling theory: a framework for developing
real-time systems. Proc. of the IEEE, 82(1):68–82, 1994.
[23] K. W. Tindell. An Extensible Approach for Analysing Fixed
Priority Hard Real-Time Tasks. Technical Report YCS 189,
Dept. of Computer Science, University of York, UK, 1992.
Appendix
Lemma 2. If there is a task τi in τ1 such that 1 < Ciri,1·Ti ,
it is then impossible to meet deadlines. Likewise if there is
a task τi in τ2 such that 1 < Ciri,2·Ti .
Proof: Intuitively, if the execution time of τi exceeds
its deadline even on the type of processor where it runs
fastest, it cannot be assigned anywhere so as to meet
deadlines. it cannot meet deadlines assigned anywhere.
Lemma 3. It is impossible to meet deadlines if
∑
i∈τ1
Ci
ri,1 · Ti
+
∑
i∈τ2
Ci
ri,2 · Ti
> |P 1|+ |P 2| (54)
Proof: The proof is by contradiction. Let τ be a task
set for which Inequality 54 holds. Assume then that a
feasible partitioning of τ exists.
Given that τ is feasible, the set of constraints expressed
by Inequalities 1 and 2 must hold. Then, from Inequali-
ties 1 and 2 respectively we have:
∑
i∈τ [p]∩τ1
Ci
ri,1 · Ti
+
∑
i∈τ [p]∩τ2
Ci
ri,1 · Ti
≤ 1 ∀p ∈ P 1 (55)
∑
i∈τ [p]∩τ1
Ci
ri,2 · Ti
+
∑
i∈τ [p]∩τ2
Ci
ri,2 · Ti
≤ 1 ∀p ∈ P 2 (56)
However, from Inequalities 5 and 4 respectively:
(5)⇒
Ci
ri,1 · Ti
>
Ci
ri,2 · Ti
∀i ∈ τ2 (57)
and (4)⇒ Ci
ri,1 · Ti
≤
Ci
ri,2 · Ti
∀i ∈ τ1 (58)
Then (55) (57)⇒
∑
i∈τ [p]∩τ1
Ci
ri,1 · Ti
+
∑
i∈τ [p]∩τ2
Ci
ri,2 · Ti
< 1 ∀p ∈ P 1 (59)
Likewise, (56) (58)⇒
∑
i∈τ [p]∩τ1
Ci
ri,1 · Ti
+
∑
i∈τ [p]∩τ2
Ci
ri,2 · Ti
≤ 1 ∀p ∈ P 2 (60)
We can combine Inequalities 59 and 60 into:
∑
i∈τ [p]∩τ1
Ci
ri,1 · Ti
+
∑
i∈τ [p]∩τ2
Ci
ri,2 · Ti
≤ 1 ∀p (61)
Via summation of Inequality 61 over all p we obtain
∑
p
∑
i∈τ [p]∩τ1
Ci
ri,1 · Ti
+
∑
p
∑
i∈τ [p]∩τ2
Ci
ri,2 · Ti
≤
∑
p
1
⇒
∑
i∈τ1
Ci
ri,1 · Ti
+
∑
i∈τ2
Ci
ri,2 · Ti
≤ |P 1|+ |P 2| (62)
This contradicts Inequality 54.
10
