Scheduling of Stream-Based Real-Time Applications for Heterogeneous Systems by Virlet, Bruno M.
c© 2010 Bruno Marie Joseph Virlet
SCHEDULING OF STREAM-BASED REAL-TIME APPLICATIONS
FOR HETEROGENEOUS SYSTEMS
BY
BRUNO MARIE JOSEPH VIRLET
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2010
Urbana, Illinois
Adviser:
Professor David Padua
ABSTRACT
Recent mobile devices present the challenge of trying to offer both more
and more processing power and increasing battery life to their users.
Heterogenous systems offer opportunities to solve this challenge. In this
thesis, we study the scheduling of tasks in a real-time context on a
heterogeneous system-on-chip. We develop a heuristic scheduling algorithm
which minimizes the energy while still meeting the deadline. We introduce
the concept of task heterogeneity and model sets of tasks to conduct
extensive experiments. These experiments show that our heuristic has a
much higher success rate than existing state of the art heuristics and derives
a solution whose energy requirements are close to the optimal solution.
ii
To my Parents
iii
ACKNOWLEDGMENTS
I would first like to express my gratitude to my advisor, Professor David
Padua, for the trust he put in me since the beginning and his guidance in
my research. His knowledge, patience and kindness made my graduate
student experience rich and exciting. He introduced me to the world of
academic research and allowed me to interact with great people. I would
also like to thank María Garzarán for her understanding, her guidance and
constant availability and all the Polaris group for their friendship.
A special thanks goes to the people at Intel: David Kuck was
immediately encouraging as I began my early research and my thesis would
not have been possible without the advices and feedback from Bob Kuhn
and Jean-Pierre Giacalone. I would also like to thank Peng Tu for his
kindness and for enabling me to discover the industrial applications of a
whole field of research during my internship at Intel.
I would like to address a special thank to my roommates Bernd,
Guillaume and Martin. I will always remember our long diner discussions,
often about trying to solve the world’s problems with ingenuity and
sometimes about our respective research topics; exchanging ideas with
Guillaume was especially productive.
Finally, I want to thank all my friends and family back in France,
especially my parents for their love, their support and for calling me every
week to encourage me. It is thanks to them that I received one of the best
possible education.
This material is based upon work supported by the National Science
Foundation under Awards CCF 0702260, CNS 0509432, and by the
Universal Parallel Computing Research Center at the University of Illinois
at Urbana-Champaign, sponsored by Intel Corporation and Microsoft
Corporation.
iv
CONTENTS
Chapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 3 PROBLEM FORMULATION . . . . . . . . . . . . . . . . 5
3.1 Hardware and Power Model . . . . . . . . . . . . . . . . . . . 5
3.2 Application Model . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 4 SCHEDULING ALGORITHM . . . . . . . . . . . . . . . . 8
4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1.1 Task Set Identification . . . . . . . . . . . . . . . . . . 8
4.1.2 Mapping Phase . . . . . . . . . . . . . . . . . . . . . . 9
4.1.3 Frequency Choice Phase . . . . . . . . . . . . . . . . . 11
4.2 Improving the Heuristic . . . . . . . . . . . . . . . . . . . . . 12
Chapter 5 OTHER APPROACHES . . . . . . . . . . . . . . . . . . . 15
5.1 Solving ILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 The Greedy Heuristic . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 The LR-heuristic . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 6 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . 17
6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.1.1 Task Set Generation . . . . . . . . . . . . . . . . . . . 17
6.1.2 Time Constraint . . . . . . . . . . . . . . . . . . . . . 18
6.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2.1 Energy Savings and Success Rate . . . . . . . . . . . . 19
6.2.2 Sorting by Heterogeneity vs. Sorting by Size . . . . . . 23
6.2.3 Sensibility Analysis . . . . . . . . . . . . . . . . . . . . 24
6.2.4 Execution Time of the Heuristic . . . . . . . . . . . . . 25
Chapter 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . 28
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
v
Chapter 1
INTRODUCTION
Advances in portable devices demand increasing speed of computation as
well as longer operational autonomy. Techniques that balance program
execution speed and energy consumption are the topic of this thesis. These
techniques should be studied in the context of the most important
applications of portable devices. For this reason, we focus on streaming
computations which are networks of tasks that operate on a data stream
flowing into the computation at a fixed rate [1]. Upon completion, each
task passes its output to its successor(s) in the network. Streaming
computations are often used to implement video post-processing
applications, which are among the most important for the future of mobile
devices. In video post-processing computations, tasks implement filters and
the data stream is a sequence of video frames. The computation is typically
subjected to a real-time constraint which is to display between 15 and 30
frames per second for many of today’s video post-processing applications.
The objective of the techniques discussed in this thesis is to minimize
the energy consumed by streaming computations under the constraint of a
minimum output rate. We assume that there could be multiple classes of
processors embedded in the mobile device and that they have dynamic
voltage scaling (DVS) functionality. Since the maximum possible frequency
is not typically required to achieve the desired output rate, energy
consumption can often be minimized by lowering the voltage and hence the
frequency as much as the real time constraint allows. This minimization
process is complicated by three situations. First, only a discrete number of
frequencies are possible in today’s machines. Second, changing the voltage
consumes time, which means that such changes must be applied judiciously.
Third, energy efficiency can be controlled not only by DVS but also by
choosing the class of processor on which to map each task since different
processors are efficient for different types of tasks. A GPU will excel in
1
vector operations but will be inferior to a conventional CPU for
applications rich in control flow.
Our techniques take the form of a two step heuristic that first chooses
on what class of processors to map each task and then applies a
homogeneous scheduling algorithm to do the voltage scaling within each
homogeneous subsystem.
2
Chapter 2
RELATED WORK
Task scheduling for power efficiency is a fairly recent problem. Scheduling
on one processor with DVS capability was discussed as early as 2001 [2] and
has since then been extensively studied [3, 4, 5]. Scheduling for
homogeneous multiprocessors with DVS capability has also been
studied [6, 7, 8]. As discussed below, we use some of these techniques (those
described in [8]) as part of our scheduling strategy for heterogeneous
systems.
Mixed software-hardware strategies in which the application is
partitioned between hardware and software components [9] have also been
studied but they require some of the scheduling to happen at hardware
design time. Mixed strategies may offer more performance because of the
specialized hardware.
A few papers study the scheduling on heterogeneous processors with
multi-level voltages. The case of heterogeneous single-voltage setup system
has been studied in [10, 11]. The single-voltage setup problem consists in
choosing a fixed frequency for each processor. This technique helps system
designers choose the most efficient operating point for the products.
However, such systems clearly lack flexibility when optimizing different
applications.
Luo and Jha addressed the heterogeneous scheduling problem with
continuous voltage scaling [12]. They assume that the frequency can be
scaled continuously between a minimum and a maximum value. Yang,
Chen, Kuo and Thiele [13] study the heterogeneous multi-level voltage
scheduling problem but they assume that the processors can be set at any
frequency or that the frequency will only be changed once [14] at any point
of the computation. In this approach, the tasks must execute frequency
scaling operations at dynamically selected points of the computation or an
external monitoring subsystem can change the frequencies at the right time.
3
It would be delicate to do a fair comparison with their algorithm (MTRIM)
because MTRIM also offers a guarantee of precision if the solution found is
a feasible schedule and choosing the precision in a fair manner would be
difficult. Finally, the dynamic programming approach they take has the
drawback that the amount of memory it requires is polynomial in the input
size and the required precision which makes execution time far from
optimal; this is a concern on embedded devices where memory is scarce or
for compilers expected to be fast enough and reduce the use of this
algorithm to design time only.
To the best of our knowledge the paper by Yu and Prasanna [15] is the
only one so far to propose a reasonably efficient algorithm for the
heterogeneous multi-level discrete frequency scaling scheduling. They
propose a fast heuristic algorithm based on the linear relaxation of the
linear-programming problem but this heuristic fails in multiple cases where
a good solution can be found, especially for smaller size problems.
The strategy we introduce in this thesis is a two-step heuristic which
optimizes the partition of the computation between different heterogeneous
processors and then takes advantage of homogeneous scheduling techniques.
We allow frequency scaling at the granularity of the tasks. This enables us
to place the code for frequency scaling at natural locations. Our heuristic
algorithm outperforms previous heuristics, with a high success rate and, for
the cases we studied, achieves optimality often. It also offers very stable
results when the architecture and task heterogeneities vary. Additionally,
the memory usage is linear in the size of the input. Finally, we demonstrate
the importance of the concept of heterogeneity of a task in the choice of the
processors and show how combining two heuristics enables even better
results.
4
Chapter 3
PROBLEM FORMULATION
3.1 Hardware and Power Model
We assume that the target system contains m processing elements (PE) Pj
for j = 1 . . .m. We also assume a finite number of possible frequencies. The
system considered is heterogeneous, which means that it contains different
classes of processors: we assume that there are p different types of processor
PT1 . . . PTp in the system. We say that Pj ∈ PTk if Pj is of type PTk.
Let Fk be the (finite) set of allowed frequencies for the processor type
PTk. If the processor type does not support DVS, the set Fk is reduced to
the singleton containing the only frequency offered by processors of type
PTk.
The power consumed by the system will fluctuate over time depending
on which processor units are in use. The power function indicating the
average energy consumption rate of the processor as a function of the
frequency can be obtained from specifications or can be measured. For each
frequency f ∈ Fk, let pk(f) be the associated average dynamic power. We
will not consider the static power because we assume that it is constant.
We assume that ∀k ∈ [1, p], f 7−→ pk(f) is an increasing function of the
frequency. Generally, pk is also convex such that a slight increase in the
frequency at low frequency will not have much impact on power whereas
the same increase at high frequency produces a much higher power increase.
In fact, for CMOS DVS processors, the dynamic power function p(f) can be
approximated by p(f) = Cf 3/κ2 where C is the switch capacitance and κ is
a design specific constant [16].
5
3.2 Application Model
The scheduling problem of dependent tasks without dependence cycles can
be reduced to the scheduling of a kernel of independent tasks as it has been
shown by Liu et all [8]. We will discuss this point more in depth in
section 4.1.1.
Therefore, we will consider a set of n independent tasks Ti
(i = 1, 2, . . . , n) and a deadline d defined as the maximum time alloted to
process one dataset (in the case of video post-processsing, this would be the
time required to generate one video frame). We define CPTik as the number
of cycles required to execute task Ti on a processor of type PTk. Cij is also
defined as the number of cycles to execute Ti on processor Pj. We assume
that the number of cycles is not data dependent. This is the case for the
video post-processing filters we studied. If it were data dependent, CPTik
could be defined as the worst-case number of cycles so that the algorithm
can guarantee the feasibility of the found schedule.
Let CˆPTik =
1
p
∑p
k=1C
PT
ik be the average cycles of a task on the different
types of processing units. We define the heterogeneity of a task by:
Hi =
∑p
k=1(C
PT
ik − CˆPTik )2
CˆPTik
A task Ti1 is more heterogeneous than another task Ti2 if Hi1 > Hi2 . This
means that choosing the right processor for Ti1 has a more significant
impact on its execution time than choosing a processor for Ti2 would have
on Ti2 ’s execution time.
Notice that the execution time of task Ti on processor Pj and at
frequency f is Cij
f
.
3.3 Scheduling Problem
Let Vz = (Pjz , fz) be a processor-frequency pair. There are v =
∑m
j=1 |Fj|
pairs. Let xiz be 1 if task Ti is mapped to the pair Vz (which means that
the task Ti will run on processor Pjz at frequency fz) and 0 otherwise. The
value eiz = pk(fz)
Cijz
fz
is the energy consumed by task Ti running at
frequency fz on processor Pjz of type PTk. Given Vz = (Pjz , fz), the
6
fraction of processor Pjz utilized by the task Ti when mapped to a pair Vz is
uiz =
Cijz
d×fz where d is the deadline. If this fraction is smaller than 1, there is
still room in the processor to accommodate other tasks and still meet the
deadline. The utilization Uj of processor Pj is the sum of the utilizations
for all tasks: Uj =
∑n
i=1
∑
Vz∈V LG(j) uizxiz. For each processor Pj, the
real-time constraint requires that Uj ≤ 1.
The scheduling problem can be formulated as an integer linear
programming problem ILP which consists in the minimization of:
n∑
i=1
v∑
z=1
eizxiz (3.1)
such that:
Uj ≤ 1 1 ≤ j ≤ m (3.2)
v∑
z=1
xiz = 1 1 ≤ i ≤ n (3.3)
xiz ∈ {0, 1} 1 ≤ i ≤ n, 1 ≤ z ≤ v (3.4)
While Equation 3.2 states that the deadline must be respected, Equations
3.3 and 3.4 specify that each task has to be mapped entirely to one
processor. We call optimal solution any solution to this problem, if it exists.
We call feasible schedule any solution to this problem without the
minimization constraint (Equation 3.1). Finally, a mapping of all the tasks
which does not respect the time constraint (Equation 3.2) but satisfies
Equations 3.3 and 3.4 is called an unfeasible schedule.
For discrete frequencies, finding the optimal schedule that minimizes the
energy consumption while meeting the time constraint is clearly a NP-hard
problem since the scheduling problem is NP-hard even ignoring the energy
issues. We therefore must solve the problem with a heuristic to avoid the
exponential complexity. The algorithm we use to address this problem is
presented in the next section.
7
Chapter 4
SCHEDULING ALGORITHM
4.1 Algorithm
Before scheduling an application, it is necessary to identify which tasks to
schedule. In a first section we study how to build a kernel of independent
tasks. Our heuristic has then two phases which apply to this kernel. The
first, mapping, chooses the processor type for each task. The power
function is used in that phase to guide the choice. The second phase,
frequency choice, chooses one of the processors within the type selected in
the first phase and the frequency for each task.
4.1.1 Task Set Identification
Given a stream-based application, the preliminary step consists in
partitioning the total set of tasks into sets of independent tasks. Since the
same sequence of tasks are applied to every input unit, the partitioning can
be organized in such way that a single set of independent tasks would
suffice to represent all of the tasks except for a few at the beginning
(prolog) and at the end (epilog). We call this representative set, kernel. For
correctness, the kernel repeated in time and applied to all input units must
give the same output dataset as the original application.
The general approach to build a kernel is to use a technique similar to
software-pipelining [8] provided that the tasks do not have dependence
cycles (in other words, the task graph can be represented as a directed
acyclic graph or DAG); one can reorganize the computation as kernels of
independent tasks where different tasks in each instance kernel work on
different input data unit: in the case of video post-processing applications,
each task within each instance of the kernel operates on a different frame.
8
For instance, we can consider different filters from MJPEGTools [17]
arranged in a pipeline: denoise, sharpen, increase frame rate, up-scale.
Each filter is a task and each frame goes through all the filters in a fixed
order. The data communication will then happen between two schedule
repetitions. Let Ti[k] be the filter Ti applied to frame k; the normal chain of
dependencies would be T1[k]→ T2[k]→ · · · → Tn[k]. At a given time, we
can clearly schedule the kernel
Tn[k + n− 1], Tn−1[k + n− 2], . . . T2[k + 1], T1[k] whose tasks are
independent on each other. Figure 4.1 illustrates this for n = 4.
Once a kernel of independent tasks has been built, another way to
manipulate this kernel is to split the tasks. Video post-processing filters are
generally easy to tile; for instance, instead of considering task Ti working on
input k, we can consider two subtasks, Ti,1 and Ti,2 working respectively on
two subsets of input k. The effect of splitting on the scheduling is to
increase the number of tasks and thus give more flexibility of choice to the
scheduler. As we will see in the experiments, increasing the number of tasks
has an impact on the optimality of the schedule.
In the case where the dependence graph of the tasks is not a DAG,
there will be multiple kernels instead of a single one and the methodology
described below can apply to each kernel.
4.1.2 Mapping Phase
Given a kernel of independent tasks, the mapping phase, presented in
Algorithm 1, assigns each task to a processor type. The power function is
an increasing function. The faster a processor computes, the greater the
energy consumed. The goal of the heuristic is to find a mapping of the
tasks to processor types so that they run as slowly as possible, and hence
consume the smallest amount of energy while still meeting the deadline.
The algorithm assigns processors to tasks following a decreasing
heterogeneity order (line 7). By following this order, we give more choice to
the tasks whose energy is most affected by the type of processor. The end
result is that the overall energy consumption is reduced. That following this
order is effective in reducing the overall energy is reflected in Section 6.2.2.
We then compute the “insertion frequency" of this task on each
9
Figure 4.1: Kernel identification for four video filter tasks and d frames.
Algorithm 1 Heuristic algorithm
1: Input: kernel of independent tasks Ti, set of processors Pj.
2: Output: variables xij set to 1 if task i is mapped to processor j.
3: for all i, j do
4: xij = 0
5: fj = 0
6: end for
7: for each task Ti in decreasing heterogeneity Hi order do
8: for each processor Pj of type PTk do
9: fopt,j =
∑
i′ xi′jCi′j+Cij
d
.
10: fnew,j = fopt,j round to the next discrete frequency available on PTk
or max(Fk) if fopt,j ≥ max(Fk).
11: δej = pk(fnew,j)[(
∑
i′
xi′jCi′j
fnew,j
) +
Cij
fnew,j
]− pk(fj)
∑
i′
xi′jCi′j
fj
12: end for
13: Choose j0 such that δej0 is minimal and task Ti fits on Pj0 if Ti and all
the tasks already assigned to Pj0 were to run at maximum frequency.
Fail if no such processor exists.
14: xij0 = 1.
15: fj0 = fnew,j0 .
16: end for
10
processing unit and the associated energy increase if the task were to be
mapped on this processing unit (lines 8 to 12).
The “insertion frequency” fopt,j is the ideal frequency [14, 18] at which
the processor Pj should run such that all the tasks already assigned to Pj
run within the time constraint while minimizing the energy consumed. This
frequency converges to an approximation of the frequency at which all the
tasks on the processor should run in an ideal situation. In general this
frequency is not available and the tasks would have to run at the smallest
higher discrete frequency available fnew,j. If the insertion frequency is
higher than the highest available frequency, this processor will not be able
to execute this task and all the previously assigned tasks within the
deadline and when space is not available on another processor, the
scheduling algorithm fails.
At line 13, we choose the processor which produces the minimum energy
increase when assigned the task to at the insertion frequency. We also make
sure that the deadline is always met when running at the maximum
frequency. By checking this, we avoid this step of the algorithm to generate
an infeasible schedule. It might be possible that, at this step, no processor
can arrange an additional task. If this is the case, the heuristic fails. It
doesn’t mean however that no feasible schedule exist. Failure is inherent to
the heuristic approach. In section 4.2, we will discuss ways to reduce the
failure rate in finding a feasible solution.
Finally, the current insertion frequency of the processor is reset to the
new frequency (line 15) and we proceed to the next task.
4.1.3 Frequency Choice Phase
Once this first phase finishes, each task is associated with a given processor
type. In the process, we actually assigned each task to a specific processor.
This last phase will rearrange the task between the processors of each
homogeneous group and assign them a final frequency. We consider each
homogeneous group of processors (Algorithm 2) and apply to them a
homogeneous scheduling technique as shown in Algorithm 3. We chose to
use the SpringS algorithm [8]. SpringS reorganizes the tasks between the
processors of a homogeneous subgroup and takes care of choosing the
11
frequencies. Applying the SpringS algorithm is only possible because the
processors of a homogeneous group have the same characteristics (frequency
and cycles for each task). This algorithm starts with an existing schedule.
In our application, we start with the schedule found by the mapping phase
and we reset all the frequencies to the minimum frequency (Algorithm 3,
line 3). Then the SpringS algorithm, as its name indicates, behaves like a
Spring. If the schedule is too long, it will find the best task for which to
increase the frequency (line 7) (because it leads to the smallest energy
increase) and try to reschedule part of the tasks. On the other hand, if the
schedule is too short, there might be a possibility to slow down a task to
save some energy (line 16).
After this phase, the variables xiz are final and define a feasible schedule
for ILP.
We observe that in the case of homogeneous scheduling, the mapping
phase can be skipped and our heuristic is reduced to the SpringS heuristic.
Algorithm 2 Frequency Choice Phase
∀kin[1, p], Hk = {Pj ∈ PTk|j ∈ [1,m]}
for each homogeneous subgroup Hk do
Apply the SpringS algorithm to Hk with the schedule found by the
mapping phase.
end for
4.2 Improving the Heuristic
Although heuristics are fast, they are likely to fail in finding a solution as
shown by the success rates in the experiment Section 6. However, different
solutions are available to improve the success rate of the heuristic. The first
solution is to run a different heuristic in combination with our heuristic.
We will call this combination the hybrid heuristic. This heuristic takes the
best results between LR-heuristic [15], which we will present later, and our
heuristic. As we will see in the next section, not only does the hybrid
heuristic help improve the optimality but it also reduces the failure rate of
the scheduler. Since both heuristic are fast to execute, this can be an
interesting technique to try to improve a given schedule.
12
Algorithm 3 SpringS Algorithm. The operators argmin and argmax refer
respectively to the index of the minimum and of the maximum in an ordered
set.
1: Input: homogeneous subgroup Hk of processors of type PTk and initial
schedule of tasks from the set T .
2: Output: optimized schedule of tasks from T on processors from Hk.
3: ∀Ti ∈ T , ∃z such that xiz = 1: set fz = min(Fk)
4: while true do
5: j0 = argmax(Uj|Pj ∈ Hk)
6: if Uj0 > 1 then
7: Find the task Tref on Pj0 with the minimum energy increase when
increasing its assigned frequency to the next step fref+1 if its fre-
quency is not already the highest.
8: Let R be the set of tasks whose current execution time is smaller
than the execution time of Tref running at fref+1.
9: for each task Ti in R in decreasing number of cycles order do
10: j1 = argmin(Uj|Pj ∈ Hk)
11: Assign Ti to Pj1 without changing its frequency if all the tasks
assigned to Pj1 and not in R fit at maximum frequency. If not,
return the schedule previously found.
12: Remove Ti from R.
13: end for
14: else
15: j1 = argmin(Uj|Pj ∈ Hk)
16: On Pj1 , find the task T with the minimum execution time increase
when decreasing its frequency to the lower step.
17: Decrease T ’s frequency if possible. If not, return: the current sched-
ule is the result of the heuristic.
18: end if
19: end while
13
Additionally, it is clear that solving the scheduling problem for a
deadline tighter than the required constraint also satisfies the original
problem. For instance, if the original problem requires a constraint d, any
solution to the same problem with a new constraint βd with 0 ≤ β < 1 is
also a solution to the original problem. Although the failure rate is higher
for tighter constraints, the heuristic algorithms are sensitive to small
changes in the constraint: in our algorithm, the deadline is involved in the
computation of the optimal insertion frequency. Therefore, if our heuristic
fails for the initial constraint, we can retry with a slightly tighter one and
may find a valid schedule. In the experiments, we tried to tighten the
constraint up to 20% of the original constraint (β = 0.80) with a new try
every 1%. This of course requires 20 runs of the heuristics and, by
consequence, the heuristic with retry is up to 20 times slower. We based the
heuristic with retry on our heuristic. It would of course be possible to base
it on the LR-heuristic which might improve it, or even on the hybrid
heuristic which would of course lead to even better results.
In last resort, it is still possible to run the linear solver. However, this
might take a long time and for a reasonable time constraint this should not
be necessary.
14
Chapter 5
OTHER APPROACHES
We compare our results with a greedy heuristic and a heuristic based on the
linear relaxation of the integer linear programming problem (LR-heuristic)
both presented in [15]. To be able to compare the algorithms in an absolute
fashion, we also search for the optimal solution. We present these different
approaches in the next subsections.
5.1 Solving ILP
ILP is an NP-hard problem. For problems of small size, it can be solved by
exhaustive search. We search for a solution of this previously defined
integer linear programming problem by using lp_solve, an open-source
linear solver [19]. By using a branch-and-bound approach, lp_solve gives us
the optimal solution if it exists. For problems small enough, lp_solve will
find the solution in a reasonable amount of time, which allows us to
compare the optimality of the algorithms. Searching for the optimal
solution to ILP may be slower than expected for small problem sizes
depending on the characteristics of the input data.
5.2 The Greedy Heuristic
The Greedy heuristic will pick the tasks one after another – order doesn’t
matter – and for each task consider all the possible processors and
frequencies, and choose the task mapping with the smallest energy among
all the possibilities. The Greedy heuristic is presented in Algorithm 4.
We can easily see that an issue with the Greedy algorithm is that it will
tend to allocate the first tasks considered at a low frequency and remaining
15
Algorithm 4 The Greedy heuristic
T = set of all tasks
while T is not empty do
For each task ti, find the pair Vz such that eiz is minimum and that
Ujz ≤ 1. Save the value as (Ti, Vz, eiz).
Among all the triples, choose the one with the minimum energy eiz
and map the corresponding Ti it to the processor Pz and frequency fz:
xiz = 1.
Remove this task from T.
end while
tasks might not fit anymore. Therefore, we can expect that the success rate
of Greedy will be low.
5.3 The LR-heuristic
The LR-heuristic [15] uses properties of the linear relaxation of the
scheduling problem to iteratively map tasks to processors while reducing
the size of the problem at each step by removing the tasks that are already
mapped. The LR-heuristic considers the integer linear programming
problem ILP defined previously. However, in that problem, the variables xiz
are constrained to be binary. The LR-heuristic solves the more general
relaxed problem LP in which Equation 3.4 is changed into
xiz ∈ [0, 1] 1 ≤ i ≤ n, 1 ≤ z ≤ v (5.1)
The only difference with ILP is that the variables xiz are allowed to take
any value between 0 and 1. The LR-heuristic is then as described in
Algorithm 5.
Algorithm 5 The LR-heuristic
repeat
Remove all the useless xiz variables which set to 1 would make Ujz > 1.
Solve the linear relaxation problem LP. As proved in [15], at least one
variable xiz will be equal to 1, in spite of being able to take any value
between 0 and 1.
All the variables xiz = 1 are fixed and removed from the problem.
until all tasks are mapped or no feasible schedule is found
16
Chapter 6
EXPERIMENTS
To demonstrate the quality of our heuristic compared to the previously
described heuristics, we generate a large number of experiments using
synthetic task sets. In this section, we first describe the experiment setup
and then discuss the results.
6.1 Experiment Setup
6.1.1 Task Set Generation
In order to generate a synthetic task set, we define two parameters, the task
heterogeneity τ and the architecture heterogeneity η as presented in [20]. τ
represents how different the cycles number of the tasks will be on a same
platform. η allows to tweak how different this number will be between the
various platforms. For a given task i, we draw τi from a uniform
distribution U(1, τ) and for each processor type we draw ηi,k from a
uniform distribution U(1, η) and we set CPTik = τiηi,k.
In order to have realistic parameters for these task sets, we use numbers
from experiments on Intel Atom. For the five filters denoise, sharpen, color
correction, increase frame rate and up-scale from MJPEG Tools, we
measured an average cycle number of 1010. Therefore, we chose τ = 105 and
η = 105 for our experiments to achieve an average cycle number of 1010. In
Section 6.2.3, we will consider different values of τ and η. We generate task
sets of different sizes.
The limit to the task set size is set by the execution time of the linear
solver. We limit ourselves to 40 tasks in the following experiments.
17
6.1.2 Time Constraint
Once we have a synthetic task set, we want to generate a reasonable
deadline. The minimum execution time of a task i is the execution time of
this task on the processor best suited for this task and at the maximum
frequency on this processor:
ti,min = mink(
CPTik
max(Fk)
)
We define the tight time constraint as the sum of the minimum execution
time of the n tasks distributed among the m available processors:
dtight =
∑n
i=1 ti,min
m
Unless the tasks were perfectly balanced, the tight time constraint would be
impossible to meet. This is why we define the relaxed time constraint:
d = α · dtight
where α is an input parameter of the experiment. By varying α starting
from 1, we obtain a constraint more or less tight. In our experiments with
MJPEGTools, a deadline of 30 frames per second translated to a value
α = 1.5. In section 6.2, we present present results for different values of α:
1.1, 1.5 and 2.0.
6.2 Experiment Results
In this section, we first compare the different heuristics for different
numbers of tasks and for different constraints. Then we consider different
ways of sorting the tasks and we study their impact on the optimality of
the schedule. We also study the sensibility of the different heuristics to
different values of the task and architecture heterogeneities τ and η. In the
last section, we compare the execution time of the different algorithms.
18
Figure 6.1: Fraction of feasible schedules for the linear solver with the first
hardware configuration. This shows what percentage of the generated task
sets leads to feasible schedules. If the linear solver fails, it means that there
is no feasible schedule for this task set.
6.2.1 Energy Savings and Success Rate
For each experiment, we generate one thousand synthetic task sets. Not
every task set allows a feasible schedule but the linear solver will always
find the optimal schedule if there is one. Running the linear solver first tells
us if there is a feasible schedule and, if there is one, gives us the optimal
energy. If there is no feasible schedule, we discard the task set. Figure 6.1
reports the percentage of feasible schedules over the thousand task sets
generated for different values of α and different number of tasks.
We measure the success rate of each heuristic as the number of
schedules found to the number of feasible schedules.
If it succeeds, each heuristic returns a schedule and the associated
energy. We measure the optimality of the different heuristics by the ratio of
the energy found to the optimal energy. We call it error to optimal.
Two CPUs and one GPU We consider a first hardware configuration
composed of three processing units. Two of the processing units follow the
power function and the available frequencies of an Atom CPU as described
in Table 6.1a. The other processing unit has the power characteristics of a
GPU with only one power state of 344 mW at 800 MHz as shown in
Table 6.1b.
The first set of experiments considers α = 1.1 with this first hardware
configuration. Such a value for α leads to a very tight constraint. The
number of feasible schedules is very low and is close to 0% for more than 20
19
(a) α = 1.1
(b) α = 1.5
(c) α = 2
Figure 6.2: Error of the heuristics for different values of α. Two CPUs and
one GPU. The horizontal axis is the number of tasks. The vertical axis
represents how far from the optimal the heuristics are. The optimal is
computed with the linear solver.
20
(a) α = 1.1
(b) α = 1.5
(c) α = 2
Figure 6.3: Success rate of the heuristics. Two CPUs and one GPU. A
success rate of 100% means that the heuristic found a schedule each time
the linear solver found one.
21
Frequencies (GHz) 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.4
Power (mW) 240 300 360 750 1100 1620 2160 3240
(a) ATOM CPU power function
Frequencies (GHz) 0.8
Power (mW) 344
(b) GPU power function
Table 6.1: Power functions of the processing units
tasks as seen on Figure 6.1. Therefore, we only present results up to 20
tasks for these experiments. The success rate of all the heuristics is quite
low but as shown in Figure 6.3a, our original heuristic outperforms the
other heuristics, especially for 10 and 20 tasks. The Greedy heuristic finds
only a few schedules. Although the success rates are low, the heuristics
perform well when they find a schedule with an error rate averaging under
5% of the optimal. The Figure 6.2a shows the results where the error to the
optimal is plotted as a function of the number of tasks.
For α = 1.5, the success rate of the heuristics plotted on Figure 6.3b
drastically improves. The number of feasible schedules is still close to 0%
for more than 20 tasks. The Greedy heuristic is still lagging behind with
less than 30% success rate for 5 tasks and almost 0% for 10 or more tasks.
The LR-heuristic has a decent success rate above 70% but all our heuristics
outperforms it with close to 100% success rate.
Figure 6.2c presents the results for α = 2 . We can see that the heuristic
is very close to the optimal, in average less than 3%, and outperforms the
LR-heuristic and the Greedy heuristic. Also for a small number of tasks,
Greedy and LR-heuristic won’t perform very well. As shown in Figure 6.3c,
the success rate of our heuristic is 100%, whereas the Greedy heuristic does
not succeed in finding a good schedule 20% of the time, and for some 10%
of the time, LR-heuristic cannot find a feasible schedule for small task sets.
The hybrid heuristic helps in reducing the error rate most of the time.
The only exception in our results is for α = 1.1 and 20 tasks. The success
rate of the LR-heuristic is so low that the hybrid heuristic average error
rate is closer to our heuristic’s error rate, which is slightly higher. It also
helps improving the success rate. This is particularly useful for tight
constraints as in Figure 6.3a when heuristics are more likely to fail. Finally,
22
Figure 6.4: Four CPUs and two GPUs with α = 2. Error of the heuristics.
The horizontal axis is the number of tasks. The vertical axis represents how
far from the optimal the heuristics are. The optimal is computed with the
linear solver.
the heuristic with retry helps improving the success rate for the very tight
schedules generated with α = 1.1.
Four CPUs and two GPUs We also consider a second hardware
configuration which is another platform composed of four Atom-like and
two GPU-like processing units with the power characteristics presented in
Tables 6.1a and 6.1b. Figure 6.4 and Figure 6.5 present the results. Our
heuristic stays under 2.5% of error to the optimal for all the task counts
considered. The Greedy and LR-heuristic improve a lot with an increasing
number of tasks thanks to the greater freedom in scheduling allowed by
more granularity. We present the results for α = 2. For α = 1.1 or α = 1.5
and for 1000 experiments run, only a small fraction gave a feasible schedule.
Comparing the algorithms for such a small number of schedules would not
be significant enough.
6.2.2 Sorting by Heterogeneity vs. Sorting by Size
As discussed in Section 4.1, a key point in our heuristic is sorting tasks by
decreasing heterogeneity at the beginning of the mapping phase. One
alternative logic could consider sorting the tasks by decreasing size to first
map the largest tasks which could be the most difficult to fit in the
schedule; this logic is still better than picking tasks in a random order or
23
Figure 6.5: Four CPUs and two GPUs with α = 2. Success rate of the
heuristics. A success rate of 100% means that the heuristic found a
schedule each time the linear solver found one.
Figure 6.6: Error to optimal when sorting respectively by decreasing
heterogeneity, by decreasing cycles, not sorting at all or sorting by
increasing heterogeneity with the first hardware configuration. The y-axis
shows the average error to the optimal for 1000 experiments with α = 2
sorting them by increasing heterogeneity. Figure 6.6 shows the importance
of sorting tasks by heterogeneity instead of sorting them by decreasing size.
The error to optimal is divided by more than 2 in most cases. Success rate
for all experiments presented in this figure is greater than 99%.
6.2.3 Sensibility Analysis
In this subsection, we analyze the sensibility of the heuristics to the task
and architecture heterogeneities. We chose the first hardware configuration
and let the task heterogeneity τ and the architecture heterogeneity η vary
respectively between 10 and 109. Figures 6.7a and 6.7b present the results.
24
(a) Sensibility to task heterogeneity τ : error to optimal as a
function of τ
(b) Sensibility to architecture heterogeneity η: error to
optimal as a function of η
Figure 6.7: Error to optimal of the different heuristics as a function of the
task and architecture heterogeneity.
Our different heuristics are insensible to variations on the whole spectrum
of heterogeneity considered. On the other hand, both the LR-heuristic and
the Greedy heuristic are sensible to changes in η and τ ; on Figure 6.7b, we
see that the Greedy heuristic and the LR-heuristic are negatively impacted
by an increase in architecture heterogeneity.
6.2.4 Execution Time of the Heuristic
A fast scheduler allows to test a lot of different configurations in a short
time. It is also better for online scheduling, especially on real-time systems
since the scheduling will consume time and power. Finally, if the scheduler
25
Figure 6.8: Comparison of the execution time in milliseconds as a function
of the number of tasks for scheduling on the first hardware configuration
with α = 2. The heuristic and the heuristic with retry curve are almost the
same because the retry only happens when no schedule is found.
#tasks 5 10 20 30 40
Exhaustive 8.05 88.27 13,000 172,470 420,090
Table 6.2: Execution time in milliseconds of the linear solver as a function
of the number of tasks.
is embedded in a compiler, it will allow shorter compilation times.
Our heuristic was implemented in C++ without specific performance
optimizations. We also reimplemented in C++ the LR-heuristic and the
Greedy heuristic presented in [15]. We used the lp_solve library [19] to
solve the linear programming problems, both for the optimal case and for
the LR-heuristic. For the optimal case, we simply make a call to the
library, whereas for the LR-heuristic we wrap the call to lp_solve in some
code controlling the different iterations of Algorithm 5. We ran 100 serial
experiments on a Intel Core i7 machine, timed each experiment with
gettimeofday() and we present in Figure 6.8 the average running time of
each algorithm in milliseconds. We kept the exhaustive search numbers
apart in Table 6.2 due to the huge difference in execution time.
We notice in the results on Figure 6.8 that all the heuristics always
perform very fast compared to the linear solver. The linear solver average
execution time increases considerably as soon as the number of tasks
increases. In a large number of instances of the problem, even if the number
of tasks is high, the execution time of the exhaustive search can be short
depending on if the branch-and-bound approach of the linear solver can
26
eliminate more or less branches depending on the constraints. The average
is very high because some instances of the problem take an extremely long
time to explore. In the experiment presented in Figure 6.8, the heuristic
with retry performs exactly the same as the heuristic because α = 2 being a
loose enough constraint, there is almost no need for retry. The hybrid
heuristic execution time is higher and is exactly the sum of our heuristic
and the LR-heuristic execution times.
27
Chapter 7
CONCLUSION
In this thesis, we explored the scheduling of real-time tasks on a
heterogeneous platform with energy minimization as a goal. Our heuristic
offers a high success rate and significantly improves the state of the art
heuristics, especially for small task sets. Our algorithm is stable in its
results when exploring different task sets and platforms of different
heterogeneities. We also underlined the importance of the order of selection
of the tasks for scheduling and we have shown how to further improve the
results by combining two heuristics and how to improve the success rate by
using the sensitivity of the scheduler to the tightness of the constraint.
28
BIBLIOGRAPHY
[1] W. Thies, M. Karczmarek, M. Gordon, D. Z. Maze, J. Wong,
H. Hoffman, M. Brown, and S. Amarasinghe, “Streamit: A compiler for
streaming applications,” Massachusetts Institute of Technology,
Cambridge, MA, Technical Report MIT/LCS Technical Memo
LCS-TM-622, Dec 2001.
[2] P. Pillai and K. G. Shin, “Real-time dynamic voltage scaling for
low-power embedded operating systems,” in SOSP ’01: Proceedings of
the eighteenth ACM symposium on Operating systems principles. New
York, NY, USA: ACM, 2001, pp. 89–102.
[3] W. Yuan and K. Nahrstedt, “Energy-efficient soft real-time cpu
scheduling for mobile multimedia systems,” in SOSP ’03: Proceedings
of the nineteenth ACM symposium on Operating systems principles.
New York, NY, USA: ACM, 2003, pp. 149–163.
[4] C.-H. Hsu and U. Kremer, “The design, implementation, and
evaluation of a compiler algorithm for cpu energy reduction,” in PLDI
’03: Proceedings of the ACM SIGPLAN 2003 conference on
Programming language design and implementation. New York, NY,
USA: ACM, 2003, pp. 38–48.
[5] C. J. Hughes, J. Srinivasan, and S. V. Adve, “Saving energy with
architectural and frequency adaptations for multimedia applications,”
in MICRO 34: Proceedings of the 34th annual ACM/IEEE
international symposium on Microarchitecture. Washington, DC,
USA: IEEE Computer Society, 2001, pp. 250–261.
[6] J. H. Anderson and S. K. Baruah, “Energy-efficient synthesis of
periodic task systems upon identical multiprocessor platforms,” in
ICDCS ’04: Proceedings of the 24th International Conference on
Distributed Computing Systems (ICDCS’04). Washington, DC, USA:
IEEE Computer Society, 2004, pp. 428–435.
[7] J. Li and J. F. Martínez, “Power-performance considerations of parallel
computing on chip multiprocessors,” ACM Trans. Archit. Code Optim.,
vol. 2, no. 4, pp. 397–422, 2005.
29
[8] H. Liu, Z. Shao, M. Wang, and P. Chen, “Overhead-aware system-level
joint energy and performance optimization for streaming applications
on multiprocessor systems-on-chip,” in ECRTS ’08: Proceedings of the
2008 Euromicro Conference on Real-Time Systems. Washington, DC,
USA: IEEE Computer Society, 2008, pp. 92–101.
[9] J. Henkel and Y. Li, “Energy-conscious hw/sw-partitioning of
embedded systems: a case study on an mpeg-2 encoder,” in
CODES/CASHE ’98: Proceedings of the 6th international workshop on
Hardware/software codesign. Washington, DC, USA: IEEE Computer
Society, 1998, pp. 23–27.
[10] E. T.-H. Chu, T.-Y. Huang, and Y.-C. Tsai, “An optimal solution for
the heterogeneous multiprocessor single-level voltage-setup problem,”
Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 28, no. 11, pp.
1705–1718, 2009.
[11] T.-Y. Huang, Y.-C. Tsai, and E. T.-H. Chu, “A near-optimal solution
for the heterogeneous multi-processor single-level voltage setup
problem,” in IPDPS, 2007, pp. 1–10.
[12] J. Luo and N. K. Jha, “Power-efficient scheduling for heterogeneous
distributed real-time embedded systems,” IEEE Trans. on CAD of
Integrated Circuits and Systems, vol. 26, no. 6, pp. 1161–1170, 2007.
[13] C.-Y. Yang, J.-J. Chen, T.-W. Kuo, and L. Thiele, “An approximation
scheme for energy-efficient scheduling of real-time tasks in
heterogeneous multiprocessor systems,” in DATE, 2009, pp. 694–699.
[14] T. Ishihara and H. Yasuura, “Voltage scheduling problem for
dynamically variable voltage processors,” in ISLPED ’98: Proceedings
of the 1998 international symposium on Low power electronics and
design. New York, NY, USA: ACM, 1998, pp. 197–202.
[15] Y. Yu and V. K. Prasanna, “Power-aware resource allocation for
independent tasks in heterogeneous real-time systems,” in ICPADS
’02: Proceedings of the 9th International Conference on Parallel and
Distributed Systems. Washington, DC, USA: IEEE Computer Society,
2002, p. 341.
[16] J.-J. Chen, C.-Y. Yang, T.-W. Kuo, and C.-S. Shih, “Energy-efficient
real-time task scheduling in multiprocessor dvs systems,” Asia and
South Pacific Design Automation Conference, vol. 0, pp. 342–349,
2007.
[17] Mjpeg tools. [Online]. Available: http://mjpeg.sourceforge.net/
30
[18] H. Aydi, P. Mejía-Alvarez, D. Mossé, and R. Melhem, “Dynamic and
aggressive scheduling techniques for power-aware real-time systems,” in
RTSS ’01: Proceedings of the 22nd IEEE Real-Time Systems
Symposium. Washington, DC, USA: IEEE Computer Society, 2001,
p. 95.
[19] P. N. Michel Berkelaar, Kjell Eikland. Lpsolve. [Online]. Available:
http://lpsolve.sourceforge.net/5.5/
[20] S. Ali, H. J. Siegel, M. Maheswaran, S. Ali, and D. Hensgen, “Task
execution time modeling for heterogeneous computing systems,” in
HCW ’00: Proceedings of the 9th Heterogeneous Computing Workshop.
Washington, DC, USA: IEEE Computer Society, 2000, p. 185.
[21] V. A. Korthikanti and G. Agha, “Analysis of parallel algorithms for
energy conservation in scalable multicore architectures,” in ICPP ’09:
Proceedings of the 2009 International Conference on Parallel
Processing. Washington, DC, USA: IEEE Computer Society, 2009,
pp. 212–219.
[22] P. Gai, G. Lipari, and M. D. Natale, “Minimizing memory utilization
of real-time task sets in single and multi-processor systems-on-a-chip,”
in RTSS ’01: Proceedings of the 22nd IEEE Real-Time Systems
Symposium. Washington, DC, USA: IEEE Computer Society, 2001,
p. 73.
[23] F. Liberato, S. Lauzac, R. Melhem, and D. Mosse, “Fault tolerant
real-time global scheduling on multiprocessors,” in In Proc. of The 10
th IEEE Euromicro Workshop in Real-Time Systems, 1999.
[24] H. Aydin and Q. Yang, “Energy-aware partitioning for multiprocessor
real-time systems,” in IPDPS ’03: Proceedings of the 17th
International Symposium on Parallel and Distributed Processing.
Washington, DC, USA: IEEE Computer Society, 2003, p. 113.2.
[25] M. A. Trick, “A linear relaxation heuristic for the generalized
assignment problem,” Naval Research Logistics, vol. 39, pp. 137–151,
1992.
31
