Dynamic scheduling techniques for adaptive applications on real-time embedded systems by YU HENG
Dynamic Scheduling Techniques for Adaptive
Applications on Real-Time Embedded Systems
Yu Heng
(B.Eng, National University of Singapore, Singapore, 2006 )
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING




This thesis would not have the opportunity to progress and present itself, without
the enduring guidance, cooperation, accompany, and encourage from my super-
visors, colleagues, and my family. I wish I could express my gratitude to all of
them.
First of all, I would like to sincerely thank my supervisors, Prof. Ha Yajun
and Prof. Bharadwaj Veeravalli, for all their devoted supports during my doctoral
studies. I am grateful that they opened my door to the scientiﬁc exploration, that
they provided timely and valuable advices whenever there are obstacles ahead, and
that they enlightened me with their insights of life the way a role model does. I
will not forget the time that they arrived before sunrise to help me with the paper
revise before its submission. I could be no luckier to have both of my supervisors
as they are.
I would like to acknowledge the help from Dr. Zhu Guolei and Dr. Akash
Kumar for the discussions with key concepts in the NoC related work. I would
have no more gratitude to Dr. Wei Ying for introducing me to the Latex world
and encouragement during the hard time.
I appreciate the support from the smiling ladies in the Electronic Design Labs
on my GA duties, as well as the mutual assistance from Zhang Wenjuan, Chen
i
ACKNOWLEDGEMENTS
Xiaolei, and Ganesh Iyer.
I am lucky to spend my best time in the VLSI Laboratory with all my fellow
mates, for the fun and memory.
I have no way to express the love to my parents. They are where warmth and
encouragement originate from. To them, this thesis is dedicated.
ii
Abstract
The ability to trade oﬀ Quality-of-Service (QoS) with resources on modern em-
bedded platforms makes adaptive applications an interesting value proposition.
Applying dynamic scheduling for such applications will bring further ﬂexibility
for meeting the overall system’s performance goals. However, the state-of-the-art
dynamic scheduling strategies, in general, either are incapable of QoS optimiza-
tions, or ignore the increasing platform-introduced impacts that may substantially
deteriorate the scheduling performance.
This thesis focuses on the design of dynamic scheduling algorithms for adaptive
applications, with the goal of maximizing QoS based on the runtime slack reclama-
tion and re-distribution. For the QoS modeling, both the Imprecise-Computation
(IC) model [1] and a proposed generic model, are validated and studied. The al-
gorithms are built upon increasingly complicated assumptions, namely scheduling
(1) IC-modeled tasks on uni-processor systems, (2) dependent IC-modeled tasks
on homogeneous multiprocessors, and (3) a generic QoS model on heterogeneous
multiprocessors considering the leakage energy and QoS deterioration due to inter-
processor communications.
First, a dynamic algorithm for scheduling IC tasks mapped on a single pro-
cessor is presented. We prove that the QoS maximization can be achieved by
iii
SUMMARY
employing the intra-task Dynamic Voltage Scaling (DVS). The derived theorem
leads to the convenient selection of a slack receiver, by comparing the QoS gradi-
ents of the IC-modeled receivers. A Gradient Curve Shifting (GCS) approach is
proposed to make the theorem applicable to both linear and concave QoS models.
Second, we extend to scheduling IC tasks on homogeneous multiprocessors.
Although it is possible to apply the uni-processor algorithm to dedicate the whole
slack to only one receiver, we consider all parallel receivers in multiprocessors, and
optimally derive the slack distribution strategy that outperforms the uniprocessor-
based algorithm. Beyond that, a heuristic slack receiver selection strategy is pre-
sented to select the best receiver set that potentially produces the maximal QoS.
Third, we extend the idealized IC model by proposing a more practical generic
QoS model, and present a dynamic scheduling algorithm targeting heterogeneous
multiprocessors, where each processor has its individual frequency and energy char-
acteristics. We propose a Guided-Search algorithm that eﬃciently determines the
receiver execution speed, in order to achieve the QoS maximization for the generic
model. The receiver selection methodology is also novelly designed for the generic
model. Moreover, an enhancement on the scheduling performance by taking care
of slack losses due to inter-processor communications is reported.
Finally, to make our work self-contained, we develop a static scheduling algo-
rithm targeting inter-processor communications on Network-on-Chip (NoC) archi-
tectures. While our dynamic approaches are assumed to adopt any static schedul-
ing results, the proposed method is a uniﬁed approach that optimally achieves the
computation element mapping, the communication path decision, and the execu-
tion time scheduling.
We support our proposed algorithms by evaluating the performance of schedul-
iv
SUMMARY
ing numerous synthesized task sets and realistic adaptive applications. The evalu-







List of Figures x
List of Tables xiv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 11
2 Related Work 12
2.1 Adaptive Applications . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Application Scheduling Techniques . . . . . . . . . . . . . . . . . . 14
2.2.1 Real-Time Scheduling . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Energy-Aware Scheduling . . . . . . . . . . . . . . . . . . . 15
vi
CONTENTS
2.2.3 Scheduling for Adaptive Applications . . . . . . . . . . . . . 18
2.3 NoC-Aware Scheduling and Mapping . . . . . . . . . . . . . . . . . 19
3 System Modeling and Problem Formulation 21
3.1 Architectural and Energy Model . . . . . . . . . . . . . . . . . . . . 21
3.2 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Problem Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Scheduling Imprecise Computation Tasks on a Single Processor 31
4.1 Static Scheduling Strategy . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Dynamic Slack Reclamation without DVS . . . . . . . . . . . . . . 33
4.2.1 Slack allocation for linear QoS functions . . . . . . . . . . . 33
4.2.2 Slack allocation for concave QoS functions . . . . . . . . . . 36
4.3 Dynamic Slack Reclamation under DVS . . . . . . . . . . . . . . . 38
4.3.1 Deciding maximal optional cycles . . . . . . . . . . . . . . . 39
4.3.2 Allotting optional cycles . . . . . . . . . . . . . . . . . . . . 41
4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Scheduling Imprecise Computation Tasks on Multiprocessors 46
5.1 Motivational Example . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Slack Distribution Optimality Analysis . . . . . . . . . . . . . . . . 50
5.3 Slack Receiver Selection . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1 Task grouping . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 Receiver selections in FCS and PCS . . . . . . . . . . . . . 55
5.3.3 Online distribution . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 60
vii
CONTENTS
6 Scheduling Generic Models on Multiprocessors with Realistic Con-
siderations 64
6.1 Motivational Example . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Slack Distribution with Frequency Scaling . . . . . . . . . . . . . . 68
6.2.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2.2 Guided-Search heuristic . . . . . . . . . . . . . . . . . . . . 70
6.3 Slack Receiver Selection . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.1 Graph decomposition . . . . . . . . . . . . . . . . . . . . . . 76
6.3.2 Receiver selection from FCS . . . . . . . . . . . . . . . . . . 78
6.3.3 Receiver selection from PCS . . . . . . . . . . . . . . . . . . 79
6.3.4 Runtime receiver selection . . . . . . . . . . . . . . . . . . . 81
6.3.5 Implication to static scheduling . . . . . . . . . . . . . . . . 83
6.4 Slack Distribution Considering Inter-Processor Communication . . . 84
6.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5.1 Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5.2 Synthesized task simulation . . . . . . . . . . . . . . . . . . 89
6.5.3 The JPEG2000 decoder . . . . . . . . . . . . . . . . . . . . 90
6.5.4 Considering communication variation . . . . . . . . . . . . . 91
7 Supplement: A Communication-Aware Static Scheduling Approach 99
7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 107






1.1 A JPEG2000 decoded image using (a) resolution = 3; (b) resolution
= 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Aircraft pitch performance for controller task level 2 and 4. . . . . . 4
1.3 Scope of the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Typical gate leakage behavior of Intel 45nm HK+MG transistors,
compared to 65nm Poly/SiON transistors[51]. . . . . . . . . . . . . 23
4.1 (a) S within S’. Allocating S to i gives the maximal QoS. (b) Left
shifting i by S cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 (a) S larger than S’. S cannot be fully allocated to i. (b) shifting i
by S’ so that curves i’ and j intercept at y-axis. (c) Shifting j by Sj,
i’ by Si, simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 The Energy−Time space . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Normalized dynamic QoS vs. no. of tasks. . . . . . . . . . . . . . . 43
4.5 Eﬀects of no DVS applicable to GCS and optimal solutions. . . . . 44
4.6 Energy and time utilization of the three algorithms. . . . . . . . . . 45
5.1 Framework of multiprocessor dynamic scheduling for IC tasks. . . . 47
x
LIST OF FIGURES
5.2 (a) Illustrative example where 2© distributes slack. (b) Slack distri-
bution results on 4©, where S is used to generate Δo4. Note that all
tasks in (a) are IC-modeled, thus are divided into mandatory and
optional parts, e.g. m4 and o4. For clarity purpose, this is not shown
in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 (a) Graph decomposition illustration for a©. Note that the link
between d© and j© is omitted due to precedence redundancy. Same
as e© and m©. (b) A task can belong to PCS or FCS of diﬀerent
slack generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 An example showing runtime slack time uncertainty for PCS, S = τs. 57
5.5 QoS increase in percentage compared to static scheduled cycles, with
varied slack factors (SF): (a) SF = 0.1, (b) SF = 0.5, (c) SF = 0.9. 61
5.6 QoS increase percentage vs. number of processors. Number of tasks
= 60, SF = 0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.7 Algorithm eﬃciency comparison, Our approach v.s. MLSSR, mea-
sured as the number of instructions. . . . . . . . . . . . . . . . . . . 63
6.1 Illustrative example showing DVS eﬀect to increase extra cycles. . . 66
6.2 (a) Task d prevents c from receiving the full slack. (b) b and d
compete for the slack time, while d might have more residual cycles. 75
6.3 (a) Total slack time is 110 since a© blocks c© and d©. (b) Total slack
time gained is 150. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
xi
LIST OF FIGURES
6.4 (a) Graph decomposition illustration for a©. Note that the link
between d© and j© is omitted due to precedence redundancy. Same
as e© and m©. (b) A task can belong to PCS or FCS of diﬀerent
slack generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.5 (a) The FCS that fully adopts τs. (b) The resulted graph after
transformation: all precedence tasks are connected. (c) A coloring
example that minimally uses three colors to identify the grouping of
tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6 The slack received for PCS tasks depends on the online execution
status. (a) τs,e = 0. (b) τs,e =MIN(τs, tl). . . . . . . . . . . . . . . . 80
6.7 (a) An FC selection instance by applying graph coloring, with their
runtime residual cycles. (b) The ﬁnal FC 2 optimized by applying
Algorithm 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.8 (a) A static DAG mapping on a 6-processor system in favor of dy-
namic cycle generation. (b) A static mapping creating PCS nodes,
not preferred for dynamic scheduling. . . . . . . . . . . . . . . . . . 84
6.9 The experiment tool set. . . . . . . . . . . . . . . . . . . . . . . . . 95
6.10 Normalized cycle gain on (a)8, (b)32, (c)64 processors using three
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.11 Scheduler cycles compared with a typical synthesized task. . . . . . 96
6.12 Cycle diﬀerence between w/ and w/o local scaling, v.s. Gaussian
distribution variances in generating traﬃc time. . . . . . . . . . . . 96
6.13 Performance of Algorithm 6.5 under diﬀerent NoC routing schemes,
on various network size. (a) 3× 4, (b) 4× 6, (c) 5× 6, (d) 6× 6. . . 97
6.14 Eﬃciency of Algorithm 6.5 compared to the iterative approach. . . 98
xii
LIST OF FIGURES
7.1 A transmission scenario to illustrate the hierarchical deﬁnitions.
Γ(Φ(j), φ(i)) = {γ1(Φ(j), φ(i)), γ2(Φ(j), φ(i))} is the set of two routes
of routing {j1, j2} to i. The route γ1(Φ(j), φ(i)) = {p1,1, p1,2} is one
way of routing by using path p1,1 to connect φ(j1) and φ(i), while us-
ing path p1,2 to connect φ(j2) and φ(i). γ2(Φ(j), φ(i)) = {p2,1, p2,2}
represents another route. Each path px,y from φ(jα=1or2) to φ(i)
consists of two links. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2 Simulation results of averaged makespan on the three applications
by applying the three algorithms. . . . . . . . . . . . . . . . . . . . 109
7.3 Simulation results of average transmission time on a 3×3 mesh using
3 algorithms on 3 applications. . . . . . . . . . . . . . . . . . . . . . 111
xiii
List of Tables
1.1 QoS levels and timing requirements for Controller. P = primary, S
= secondary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Frequency and energy-per-cycle relationship. . . . . . . . . . . . . . 24
5.1 Task attributes in Fig. 5.2: static scheduled time, immediate parent
nodes, and ki. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 List of frequencies and the corresponding energy-per-cycle . . . . . 66
6.2 Frequency and energy-per-cycle relationship of the experimental pro-
cessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 DWT cycles to transform diﬀerent levels of resolution. . . . . . . . 91
6.4 Performance from scheduling a JPEG2000 decoder. . . . . . . . . . 91
7.1 Facts about applications. Critical path is the longest execution path
in the task graph, no transmission delay. Level of parallelism is the





Advancements in silicon processing, IC design, and electronic design automation
(EDA) technologies continuously push the drastic performance improvement of
embedded computing systems. The complexity of applications that an embedded
platform could handle increases as well. Deﬁnitions of application execution per-
formance have been extended from “hard” parameters such as memory utilization,
energy consumption, and application response time, to the “soft” behaviors of ap-
plication execution that emphasize on the execution Quality-of-Service (QoS). For
instance, the problem of “at which quality level the video could be rendered to the
viewer” comes under concern once the transmission reliability is ensured.
In view of this, adaptive applications are gaining growing attentions owing
to their capabilities to provide the scalable execution quality in reaction to the
execution environment. Rather than simply completing or failing the execution,
adaptive applications usually deﬁne multiple execution granularities such that a
1
CHAPTER 1. Introduction
ﬁner-grained version produces better QoS, at the price of increased program cycles
and energy. This feature makes them promising as real-time embedded applications
provide tunable parameters to cope with the unpredictable execution environment,
by intelligently reducing the service level when the system is overloaded, or boosting
the software performance when system resources are under-utilized.
One of the areas of applying quality adaptation is in multimedia. For example,
the Scalable Video Coding (SVC) scheme in H.264/MPEG-4 AVC standard, is pro-
posed to provide customized QoS to accommodate varying network conditions and
device qualities [2]. Another concrete example is the JPEG2000 codec supporting
multiple playback resolutions [3]. The JPEG2000 decoder allows the reconstruction
of images in a progressive manner. This is possible by the use of Discrete Wavelet
Transform (DWT), which encodes an image into multiple subbands so that a lower
frequency subband contains a ﬁner frequency resolution and a coarser time resolu-
tion. At the decoder, as more data are received, higher resolution images can be
decoded making use of the higher frequency information. Fig. 1.1 illustrates the
eﬀects of image decoding using diﬀerent resolution settings.
Other than the multimedia applications, Fig. 1.2 and Table 1.1 for example,
excerpted from [4], illustrate the application of an adaptive controller on an Aerial
Combat F-16 ﬂight simulator, as well as the required CPU resources (timing). The
controller is able to command the ﬂight behaviors at two quality levels, with the
primary actuator commands (including elevator, ailerons, rudder, and throttle)
and the secondary set of actuators that further improves the ﬂight performance.
The secondary actuators include the F-16’s afterburner for the extra engine thrust,
as well as wing ﬂaps and a speed brake used to enhance the slow-airspeed control.





Fig. 1.1: A JPEG2000 decoded image using (a) resolution = 3; (b) resolution = 1.
Table 1.1: QoS levels and timing requirements for Controller. P = primary, S = secondary.
Level Reward Exec Time (ms) Period (sec) Version
1 100 60 1 P only
2 104 80 1 P + S
3 120 60 0.2 P only
4 124 80 0.2 P + S
and the resource utilization.
State-of-the-art embedded system design methodologies strike to achieve op-
timizations at dual phases: design-time optimization and runtime optimization.
For design-time optimizations, hardware/software co-design strategies are exten-
sively applied that partition functionalities to respective hardware and software
components, synthesize (including mapping and scheduling), and conduct hard-
ware/software co-simulations to iteratively improve the performance. On the other
hand, the runtime optimization strategies achieve, at all abstraction levels, per-




Fig. 1.2: Aircraft pitch performance for controller task level 2 and 4.
execution environment dynamism. In this thesis, we focus on the OS-level runtime
optimization techniques, speciﬁcally the design of real-time dynamic scheduling
algorithms for adaptive applications.
Dynamic scheduling algorithms diﬀer from their static counterpart in several
ways. For the static scheduling, task timings and processor frequencies are deter-
mined prior to execution, and the eﬃciency of the algorithm itself is less of concern.
For the dynamic scheduling, however, the task invocation time and execution speed
are adjusted at the runtime, and the algorithm eﬃciency is of great importance.
Dynamic task scheduling results in less system idle time and better performance
by exploiting the substantial variation in the actual execution time of tasks. An
important parameter that the dynamic scheduler intakes is the slack time/energy
generated from the precedent tasks [44, 46, 47]. In the context of the adaptive
application scheduling, a slack is re-distributed to its successive tasks to achieve
4
CHAPTER 1. Introduction
further QoS improvements than statically determined, while contemporary energy-
minimization based dynamic schedulers use the slack as the speed slowing down
space.
The design of eﬃcient QoS-aware scheduling algorithms is challenging espe-
cially because it has to meet many simultaneous design requirements and con-
straints. Some of generic, as well as adaptive-speciﬁc, considerations in dynamic
scheduling algorithm designs are listed below.
• Other than general purpose OS schedulers that pursue the resources fairness,
real-time schedulers have high temporal requirements. The executional cor-
rectness is not only judged by the computational correctness, but also by the
timeliness of task completion. Carefully deciding task execution order, as
well as the starting time, to avoid deadline violations is in general a primary
goal for real-time schedulers.
• The dynamic algorithm itself, since it is running in the runtime environment,
has to be eﬃcient in terms of the execution time. Established optimization
algorithms such as simulated annealing suﬀer from the runtime eﬃciency. Be-
sides the appropriate formulation of the scheduling algorithm, heuristics are
sometimes necessary to tradeoﬀ between the optimization and the eﬃciency.
• Design of embedded systems, especially battery-supported devices such as
smart phones and wireless sensors, greatly emphasize energy eﬃciency. In the
last decade, Dynamic Voltage Scaling (DVS) technique has been extensively
studied as the mainstream power reduction strategy for platforms with DVS-
enabled processors. However, scheduling is further complicated by the need




• Due to the fact that embedded systems are usually made to cater speciﬁc
applications, the execution time ﬂexibility of adaptive applications introduces
another level of the decision dimension. That is, the task execution time is
not limited to discrete choices depending on available DVS frequencies, but
turns continuous within the range, leading to substantially increased design
complexities and optimization costs.
Besides the intrinsic complexity in adaptive application scheduling algorithm
designs, semiconductor technology trends further complicate the formulation and
solution of the scheduling problems.
• Multiprocessor platforms, usually with the heterogeneity nature, introduce
the thread running concurrency and performance diﬀerentiation on distinct
processing components. The scheduling decision space is thus exponentially
extended and optimization costs are drastically increased.
• With semiconductor technology improvements, the device feature size keeps
shrinking, resulting in the signiﬁcant leakage power that necessitates the com-
bination of both dynamic and leakage energy consumptions into the schedul-
ing framework.
• Inter-processor transmissions as the performance bottleneck for multiproces-
sor systems contribute to a substantial portion of the application makespan.
Without taking speciﬁcally into account, transmission time variations could




Given the constrained timing and energy requirements, as well as the ﬂexibil-
ity nature of adaptive applications, determining an optimized and eﬃcient runtime
schedule is in general not easy, and involves trade-oﬀ between contradicting opti-
mization objectives. Speciﬁcally, traditional DVS techniques can eﬀectively reduce
system energy by scaling down the processor frequency, but it gains no program
quality improvement with unchanged execution cycles. QoS-aware DVS techniques
are needed to strike a tradeoﬀ between three conﬂicting goals: maximized execution
QoS, minimized energy consumption, and real-time deadline satisfaction.
Contemporary dynamic scheduling approaches are not suitable for the emerg-
ing adaptive applications, because not only of the incapability of taking applica-
tion adaptiveness into account, but also of the sluggishness in considering fast-
evolving platform-introduced design complexity, such as processor heterogeneity
and the bottlenecked inter-communication impact. Moreover, the lack of a generic
QoS-application model makes it ad-hoc for currently available adaptiveness-aware
scheduling approaches, which usually deal with a speciﬁc adaptive application
model. A more generic adaptive application modeling is necessary, and targeted
on which, the dynamic scheduling algorithm proposed can be more merited to get
widely adopted.
1.2 Thesis Contributions
This thesis presents an analytical framework of adaptive application scheduling
methodologies for embedded systems, with the special emphasis on dynamic ap-
proaches. The proposed methodologies aim at simultaneously maximizing the QoS
7
CHAPTER 1. Introduction
of adaptive applications and maintaining the energy and timing budgets. The pro-
posed framework, as illustrated in Fig. 1.3, is capable of covering various adaptive
application modelings and platform features, and is developed in a logical manner
with the increased complexity on problem assumptions: single processor −→ ho-
mogeneous multiprocessors −→ heterogeneous multiprocessors with inter-processor
communication, etc.
 Fig. 1.3: Scope of the thesis.
• Our work emphasizes on two modelings of adaptive applications, namely a
representative modeling of adaptive applications – the Imprecise Computa-
tion (IC) model, and the proposed generic adaptive application model based
on [QoS, cycle range] pairing. It turns out that the available adaptive appli-
cation models can be treated as special cases of our proposed model.
8
CHAPTER 1. Introduction
• We start by exploiting the dynamic scheduling approach of the imprecise
computation modeled applications, on a uniprocessor system. We formally
prove and articulate that the QoS gradient of the IC task should be used to
guide the slack distribution, and propose an intra-task voltage scaling scheme
named Gradient Curve Shifting (GCS) that maximizes the total QoS.
• The algorithm is then extended to multiprocessor systems. We provide an
optimized formulation to calculate the maximized QoS considering slack par-
allelization featured by multiprocessors, and analyze the factors that sub-
stantially impact the QoS gain. The analysis also leads to a two-stage slack
receiver selection heuristic.
• As one of the key merits of the framework, a scheduling methodology for
heterogeneous multiprocessor systems is proposed to deal with the proposed
generic model that is universally adoptable for various adaptive applications,
and use the energy model that includes both leakage and dynamic power
consumptions. Moreover, we consider the platform impacts on the scheduling
algorithm eﬃciency, and propose a local scaling scheme to compensate the
overheads caused by interconnection ﬂuctuations on the Network-on-Chip
(NoC) architectures.
• To make our work self-contained, we also propose a static scheduling algo-
rithm for NoC-based multiprocessor systems. With integration of traﬃc time,
the algorithm aims at minimizing the application makespan, and achieving
the two important NoC-based system-level design requirements, namely ap-
plication mapping and communication routing, simultaneously.
9
CHAPTER 1. Introduction
1.3 List of Publications
1. Heng Yu, Yajun Ha, and Bharadwaj Veeravalli, “Quality-Driven Dynamic
Scheduling for Real-time Adaptive Applications on Multiprocessor Systems
with Communication Awareness,” submitted to IEEE Trans. on Computers.
2. Heng Yu, Bharadwaj Veeravalli, and Yajun Ha, “Energy/QoS-Aware Dy-
namic scheduling for Multiprocessor Real-Time Embedded Systems,” prepar-
ing for journal submission.
3. Heng Yu, Bharadwaj Veeravalli, and Yajun Ha, “Leakage-aware Dynamic
Scheduling for Real-time Adaptive Applications on Multiprocessor Systems,”
Proc. Design Automation Conference (DAC’10), pp. 493-498, Anaheim, CA,
June 2010.
4. Heng Yu, Yajun Ha, and Bharadwaj Veeravalli, “Communication-Aware Multi-
Application Mapping and Scheduling for NoC-Based MPSoCs,” Proc. the
IEEE International Symposium on Circuit and Systems (ISCAS’10), pp.
3232-3235, Paris, France, May 2010.
5. Guolei Zhu, Heng Yu, and Yajun Ha, “A Multi-Application Mapping Frame-
work for Network-on-Chip Based MPSoC: An FPGA Implementation Case
Study,” Proc. the International Conference on Engineering of Reconﬁgurable
Systems and Algorithms (ERSA’09), pp. 267-270, Las Vegas, NV, June 2009.
6. Yanhui Li, S. Fernando, Heng Yu, Xiaolei Chen, Yajun Ha, and T. T. Tay,
“Tighter WCET Analysis of Input Dependent Programs with Classiﬁed-
Cache Memory Architecture,” Proc. of the 15th IEEE. International Confer-
ence on Electronics, Circuits, and Systems (ICECS’08), Malta, Aug. 2008.
10
CHAPTER 1. Introduction
7. Heng Yu, Bharadwaj Veeravalli, and Yajun Ha, “Dynamic Scheduling of
Imprecise-Computation Tasks for Maximizing QoS under Energy Constraints
for Embedded Systems,” Proc. the 13th Asia and South Paciﬁc Design Au-
tomation Conference (ASP-DAC’08), pp. 452-455, Seoul, South Korea, Jan.
2008.
8. Heng Yu and Yajun Ha, “CPU Scheduling of Imprecise-Computation mod-
eled DAGs in Maximizing QoS under Energy Constraint,” Proc. of the 2nd
International Ph.D. Student Workshop on SoC (IPS’07), Taiwan, July 2007.
1.4 Organization of the Thesis
The organization of this thesis is as follows. Chapter 2 reviews the historical and
state-of-the-art research status related to the adaptive application scheduling, with
emphasis on energy and platform issues. Chapter 3 describes the system modeling
used in the subsequent algorithm presentations, where besides introducing the IC
model, we also propose the generic modeling of adaptive applications. Chapter
4 presents our imprecise-computation scheduling algorithm for a single processor.
Chapter 5 addresses the extension from the single-processor algorithm to multipro-
cessors, by identifying the major diﬀerences in the problem deﬁnition. The multi-
processor targeted algorithm is further generalized to consider the proposed generic
model, and tackles the issues of heterogeneous multiprocessors, leakage power and
platform overheads in Chapter 6. Chapter 7 supplements the previous dynamic
algorithms by providing a static scheduling algorithm. Chapter 8 concludes the




In this section, previous work related to the topic of this thesis is reviewed, includ-
ing overviews of existing adaptive application models and scheduling techniques
that are aware of real-time, energy, application adaptiveness, and infrastructural
requirements.
2.1 Adaptive Applications
Application adaptation ambiguously refers to two aspects: the execution adapta-
tion and the quality adaptation. As a conventional notion in the distributed com-
puting, execution-adaptable applications feature in the irregular and unpredictable
computation and communication runtime loads imposed onto an execution plat-
form. There exist many dynamic load balancing methodologies that exploit task
reallocation to alleviate the workload “hot-spot” for performance improvement, e.g.
[5][6]. A well-known programming framework for those applications is the GrADs
project meant for Grid applications [7].
12
CHAPTER 2. Related Work
In contrast to spatial execution-adaptable applications, quality-adaptable ap-
plications feature in graceful degradation mechanisms that focus on the execution
quality adjustment and customization, and can be applied in scenarios such as the
runtime quality improvement and the real-time fault tolerance.
One of the representative adaptive task models is Imprecise Computation (IC)
model [1] that ﬂexibly ﬁnishes the task execution as-is in the presence of timing
constraints, under which not the exact execution result but the approximate result
of acceptable quality can be achieved. Promising in its applications to embedded
processings that have stringent timing requirement and transient overload situa-
tions, the IC model is observed in real-life applications such as the real-time image
transmission that is able to produce fuzzier images under limited network resources
[8], the network traﬃc prediction that approximates the neighbor information col-
lection to tradeoﬀ the searching precision and time [9], and the real-time database
processing to protect catastrophe caused by transient overloads [10].
Additional modelings of quality-adaptable applications exist in the literature.
Multiple-versional tasks [11, 12] deﬁne alternative task versions, with a primary ver-
sion producing full quality results but taking longer processing time, and a back-up
version producing acceptable results in a timely manner. As a fault tolerance strat-
egy, a primary-backup framework is proposed to provide fast but weakly-consistent
real-time system data recovery under limited system resources [13]. Another ap-
proach, known as elastic scheduling [14], speciﬁes a task with multiple periods
and elastic coeﬃcient, so that whenever system overloading occurs, task periods
(thus the overall execution quality) are adjusted to reduce the processor utiliza-
tion. Moreover, an (m, k)-ﬁrm guarantee strategy [15, 16] is proposed to model
periodical tasks that could alter the overall quality by meetingm out of k execution
13
CHAPTER 2. Related Work
instances.
From the practicality perspective, a QoS-negotiation model is proposed as a
methodology of building the QoS spectrum and its associated rewards/penalties [4].
2.2 Application Scheduling Techniques
In this section, scheduling strategies for real-time systems are reviewed. Although
it is a traditional topic, the scheduling algorithm design evolves with the tech-
nology advancements of real-time systems. The following subsections cover several
scheduling development stages, namely scheduling tasks for real-time requirements,
with additional energy requirements, and with additional QoS requirements.
2.2.1 Real-Time Scheduling
The seminal work of Liu and Layland [17] has paved the way on priority-based
scheduling methods that are widely studied and applied as the mainstream real-
time scheduling strategy. In [17], optimality and feasibility study of both ﬁxed-
and dynamic-priority schemes, namely rate-monotonic (RM) and earliest deadline
ﬁrst (EDF), have been discussed. Variants of RM and EDF scheduling methods
are deadline-monotonic (DM) and least laxity ﬁrst (LLF), and their optimality are
proved in [18] and [19] respectively.
For multiprocessor strategies, widely adopted approaches are partitioned and
global schedulings [25]. For partitioned scheduling, a task is assigned to a des-
ignated processor for execution. Hence, well-studied single processor scheduling
can be applied optimally to tasks on each processor. However, the optimal task-
14
CHAPTER 2. Related Work
processor allocation is proved existing only for the two-processors case [26]. The
global scheduling is a dynamic approach that manages a dispatch queue, and de-
livers the task at the queue head to the earliest available processor. The biggest
advantage is that runtime load balance can be achieved. Finding optimal schedules
for multiprocessors is, in general, an NP-Hard problem [27]. Hence, heuristics are
proposed to obtain sub-optimal solutions, of which the majority is derived form
the concept of list scheduling [28, 29]. It proposes to assign priority to tasks based
on their precedence constrains and relative deadlines, and allocate them to the
processors with proper priorities. Variations on the priority assignment methods
include LPT (Longest Processing Time) [30], ETF (Earliest Task First) [31], critical
path-based [32, 33], and cluster-based [34].
2.2.2 Energy-Aware Scheduling
The timing and energy consumption turn to be contradicting for embedded sys-
tem design, especially for battery-supported devices where energy eﬃciency is an
imperative design goal. The intuitive design is to suspend the processor if no task
requires execution, but the re-activation process brings considerable energy and
timing overhead. To improve the eﬃciency of the suspension approach, history-
based prediction heuristics are proposed in [35, 36]. However, the major body of
energy-aware scheduling is formed based on the Dynamic Voltage Scaling (DVS)
technique [37]. It is based on the fact that the dynamic power consumption is
quadratically related to the supply voltage and linearly related to the execution
frequency [38]. Since the execution frequency is linearly related to the supply volt-
age, reduced voltage leads to cubically reduced power consumption at the price of
linearly increased execution time.
15
CHAPTER 2. Related Work
In a more general case where the multiprocessing environment is assumed, by
deciding the invocation time, execution speed and volume, and the task-processor
mapping of every task in the system, application-level task scheduling methodolo-
gies are eﬀective to achieve energy and performance goals. Since real-time tasks
feature variable execution times which are typically shorter than their worst case
execution times (WCETs) [39], the scheduling process can be divided into two
phases: (1) Static Scheduling, in which task-processor assignments and frequency
scaling decisions are made oﬄine prior to task execution, and (2) Dynamic Schedul-
ing, in which the task invocation time and execution speed are adjusted at runtime
to reclaim any unused slack time and energy for further reduce energy or enhance
performance.
Static scheduling with energy minimization for more than two processors
is usually NP-Hard [27], therefore heuristics have been proposed to obtain sub-
optimal results. The most common strategy is to ﬁrstly map the tasks onto appro-
priate processors, and then do voltage scaling for energy minimization [40]. In [41],
the author proposes heuristics for task assignment based on simulated annealing
and applied list scheduling with a priority function. Zhang et al. [42] adopt an
integer programming formulation for execution time decision, while the initial task
assignment is done by pushing the schedule as tight as possible [42, 45]. Goh et al.
[43] combine the task mapping and voltage scheduling into an integrated frame-
work. Mishra et al. [44] assume that the task mapping was given, and propose a
static slack allocation scheme exploring the degree of parallelism in the schedule.
Moreover, studies (e.g. [49, 50]) have exploited the voltage switching opportunities
within a task, which is called the intra-task DVS.
Compared to static scheduling, dynamic scheduling strategies are relatively in-
16
CHAPTER 2. Related Work
suﬃciently explored on multiprocessor systems. For uniprocessor systems, Mosse´
et al. [46] propose and evaluate several heuristics for runtime task speed determi-
nation, and conclude that greedy-based method would in general not result better
than considering tasks globally. For multiprocessor systems, Zhu et al [47] propose
a slack sharing scheme to divide the dynamic slack to diﬀerent processors, so that
the application deadline will not be missed for both dependent and independent
task sets. In [44], on a task graph with ﬁxed processor assignment, the dynamic
slack is given to the next available task. In [48], Luo et al. heuristically distribute
the runtime slack evenly to the tasks in the hyperperiod. Most of these approaches
are greedy based, namely, giving the slack to the next ready task of the appropriate
processors. However, task-wide inspection approaches for better energy savings can
be further explored.
Contemporary semiconductor technology has reached to the nano-scale, at
which level the signiﬁcant leakage power contribution necessities combing both dy-
namic and leakage energy consumption into the scheduling framework [52]. One
technique named Adaptive Body Biasing (ABB) has been studied to ﬂexibly change
the threshold voltage to achieve exponential leakage current reduction [53? ], en-
abling embedded power-aware scheduling to consider both bias voltage and sup-
ply voltage. Leakage-aware scheduling methodologies are proposed to reduce the
system-level energy consumption. At the instruction level, [54] explores system
slack period, when the leakage reduction mechanism is invoked using compiler-
inserted commands. At the task level, a 3-approximation algorithm [55] is proposed
for combined leakage and dynamic energy minimization assuming continuous fre-
quency range. In [56], procrastination-based voltage selection is performed oﬄine,
and system on/oﬀ is employed as the online leakage reduction strategy. The au-
17
CHAPTER 2. Related Work
thors also propose a 2-approximation algorithm for the leakage minimization on
multiprocessors [57].
Overhead-awareness is an essential indicator of the scheduling algorithm eﬃ-
ciency in real-life situations. Works by [47, 58, 59] speciﬁcally consider the timing
and energy transition overheads incurred on DVS-capable processors, by mathe-
matically modeling the overheads and incorporating them into their frameworks.
No assumptions are made on the underlying communications platforms, thus im-
pacts on overheads caused by transmission ﬂuctuations are rarely found in the
literature.
2.2.3 Scheduling for Adaptive Applications
Scheduling techniques for adaptive applications are attributed with another goal –
QoS maximization. Together with abovementioned timing and energy constraints,
problem formulation for adaptive applications are complicated by that extra di-
mension. For QoS measured as the function of computation volume, deciding the
execution time of a task is far more complicated. In the framework of imprecise
computation, while early works focus on various timing characteristics [60–62], re-
cent publications comprehensively consider the timing, energy, and QoS aspects
for optimization.
For single-processor systems, [67] proposes a dynamic DVS technique on IC-
modeled tasks, aimed at maximizing QoS under the available energy budget. The
authors present a quasi-static approach that obtains several speed/optional-cycle
candidate pairs in the oﬄine stage, and dynamically apply the most suitable can-
didate to achieve maximized QoS value. Aydin et al. [68] provide an optimal static
solution for the IC task scheduling problem using convex programming. For multi-
18
CHAPTER 2. Related Work
versional tasks, [69] has proposed an MV-Pack algorithm that selects the proper
version for each task instance in order to maximize rewards under a rechargeable
energy budget model. None the less, the above work has not targeted at multipro-
cessing environment.
2.3 NoC-Aware Scheduling and Mapping
Network-on-Chip as an interconnection-network for Multiprocessor Systems-on-
Chip (MPSoC) has attracted great interest in the ﬁeld of embedded processing.
Several prototype works include the MIT RAW [76] and the Intel 80-tile architec-
ture [77]. While those architectures show greater advantages over traditional archi-
tectures in terms of throughput performance [78], customized application mapping
and scheduling techniques are essential for the execution eﬃciency by fully exploit-
ing the hardware features.
Recently several application mapping and scheduling algorithms targeting at
the NoC architecture have been presented [79–81]. [79] proposes a data transmission-
oriented task-to-processor mapping methodology, in which both computation task
and shared data are mapped with the objective of the shortest data transmis-
sion path. [80] presents a task mapping method based on the link bandwidth
balancing, and [81] proposes an energy-aware branch-and-bound heuristic on map-
ping and scheduling a real-time application onto the tile based NoC. [82] proposes
communication-aware scheduling algorithms that reduce total energy consumptions
including communication losses. The communication-aware algorithms are oﬄine
approaches that assume the worst-case delay to guarantee real-time requirements,
while online/dynamic strategies for further quality improvement are not exploited.
19
CHAPTER 2. Related Work
Nevertheless, the static approaches can serve as an oﬄine entry point based on
which our dynamic approach further improves the execution quality online.
Compared to transmission lengths concerned in [79–82], at runtime, variations
of the transmission lengths actually aﬀect the delivered slacks. Transmission varia-
tions on the NoC systems are presented at both data-level and infrastructure-level,
based on the types of QoS provided, namely guaranteed (GS) or best eﬀort (BES)
packet transmissions [83]. GS can be implemented by logical path reservation and
time division multiplexing (TDM)-based bandwidth allocation to ensure through-
put requirements [84]. NoC prototypes implementing GS include the Æthereal [85]
and Nostrum [86]. On the other hand, BES provides packet-based transmission
whose performance heavily relies on routing and switching mechanisms for packet
relay. Existing routing algorithms include XY routing [87], odd-even turn-model
adaptive routing [88], DyAD that supports the deterministic and dynamic runtime
switching [89], and PROM based on progressive and randomized on-hop path de-
cision [90]. It is arguable that BES may not suit real-time applications, however,
existing NoC prototypes, e.g. Æthereal, provide both GS and BES to guaran-
tee performance and improve eﬃciency. Authors in [91–93] have also focused on
soft/statistical GS or GS/BES hybrid research.
20
Chapter 3
System Modeling and Problem
Formulation
3.1 Architectural and Energy Model
The platforms that our algorithms target are broadly ranged from uniprocessor sys-
tems to heterogeneous processor systems with the underlying NoC communication
infrastructure.
We can describe the platforms from the heterogeneous processor systems,
which serve as the superset of both homogeneous and uniprocessor systems. A het-
erogeneous multiprocessor system is represented by an undirected graph Ga(P, L),
with processing element1 pi ∈ P, ∀i ∈ [0, |P | − 1], and link lpi,pj ∈ L, ∀{pi, pj} ⊆ P
to represent the physical duplex connection of adjacent processors. The link set L is
part of the underlying communication architecture. Each processor pi can operate
1In this work, we consider the processing elements as the devices with processing capability,
and not limited to CPUs. For simplicity, we use “processors” as the synonym for processing
elements in this work.
21
CHAPTER 3. System Modeling and Problem Formulation
in discrete frequency ranges, denoted as F pi = {f pi0 , ..., f piJ−1}, with J = |F pi |. Note
that both (F pi , F pj) and (f pik , f
pj
k ) are not necessarily equal pairs due to processor
heterogeneity.
For every pi, each of its operating frequency f
pi
k corresponds a per-cycle energy
consumption pik . To quantify 
pi
k , we note that the processor energy consumption
is not only dominated by the dynamic power, but also by the leakage power with
scaling down devices.
The dynamic power consumption is directly related to processor clock fre-






where Ceff is the eﬀective switching capacitance. Suppose pi runs at f
pi
k for ci




where relationships between the execution time ti (for pi to execute ci cycles), and







(Vdd − Vth)α ci, (3.3)
where K is a technology-dependent constant, α is the saturation velocity index
typically between 1.4 and 2, and Vth is the threshold voltage.
The leakage power consumption is static and not directly related to processor
behaviors. The leakage contributors at nano-scale level include subthreshold, gate
22
CHAPTER 3. System Modeling and Problem Formulation

Fig. 3.1: Typical gate leakage behavior of Intel 45nm HK+MG transistors, compared to 65nm
Poly/SiON transistors[51].
tunneling, and reverse bias junction [52]. Though gate tunneling leakage can be
considerable, its percentage is expected to drop signiﬁcantly because of improved
manufacturing process and new materials used. In fact, Intel has devised HK+MG
transistors in the 45nm technology, and reduced the gate leakage more than 1000×,
as shown in Fig. 3.1. Hence in this work, we consider subthreshold and junction
leakages adopting the formulation in [53], which proposes an adjustable reverse
bias voltage Vbs that ﬂexibly changes the CMOS device threshold voltage, in order
to achieve exponential leakage current reduction.
The leakage power consumption is deﬁned as
Psta = VddK3e
K4VddeK5Vbs + |Vbs|Ij, (3.4)
where K3, K4, and K5 are derived constants dependent on process technology, and
Ij is the approximate constant junction leakage current.
23
CHAPTER 3. System Modeling and Problem Formulation
Table 3.1: Frequency and energy-per-cycle relationship.
Freq.(MHz) 600 500 400 300 200 100
Ecyc(nJ) 2.46 1.95 1.58 1.14 0.86 0.67
Thus the total power consumption, including dynamic and leakage compo-






K4VddeK5Vbs + |Vbs|Ij. (3.5)






−1(VddK3eK4VddeK5Vbs + |Vbs|Ij) (3.6)
where Lg is logic path length of the circuit.
As observed from (3.6), under a ﬁxed f pik , there is a range of Ecyc due to varying
(Vdd, Vbs) pairs. By properly adjusting (Vdd, Vbs) values Ecyc can be minimized at




k . According to [53], 
pi
k
monotonically changes with f pik . TABLE 3.1 shows the Ecyc − f dependencies we
derived for the Crusoe 5600 processor, based on the parameters presented in [53].
3.2 Application Model
In our work, we study the adaptive applications represented in the imprecise com-
putation model in Chapters 4 and 5, and then extend our framework to consider the
generic cycle-QoS modeling in Chapter 6. Despite the diﬀerent models we studied,
the problems can be uniformly formulated.
24
CHAPTER 3. System Modeling and Problem Formulation
We model the applications to be scheduled by a set of application identities,
which contain the two important application properties, namely adaptiveness and
precedence relationships, that are studied in our algorithms.
Deﬁnition 3.1: An application identity is modeled by a periodical Direct Acyclic
Graph (DAG) Gt(T,E), where the vertex ti ∈ T, ∀i ∈ [0, |T | − 1] represents the
adaptive processing task2, and the edge eti,tj ∈ E enforces the precedence relation-
ships between two tasks.
It should be noted that ti in our work reﬂects a coarser-grained code trunk.
In a general notion, ti can be formed by a set of consecutive ﬁne-grained program
segments (usually represented by control ﬂow graphs (CFG)), and the quality adap-
tiveness is achievable by selectively executing the CFGs. Diﬀerent ways of relating
the execution length to quality exist, and the following two deﬁnitions describe the
two adaptive models that reﬂect diﬀerentiated QoS features for ti ∈ T in our work.
Deﬁnition 3.2: The imprecise computation model is deﬁned such that a task ti
is logically divided into two execution stages, namely the mandatory and optional
portions. ti must execute mandatorily mi cycles to guarantee a minimum accept-
able level of quality. Then optionally oi cycles can be used to process the optional
part in a best eﬀort manner[66].
For imprecise computation modeled ti, the more oi is executed, the more QoS
is generated. We use Δoi to represent the optional cycles generated at runtime.
2We use node and task interchangeably to indicate a task in the DAG.
25
CHAPTER 3. System Modeling and Problem Formulation
Each ti is associated with a QoS function fi(oi), which is a quantiﬁed measure of the
generated QoS. In this paper we use linear QoS function, fi(oi) = kioi+ci, which is
adequate to capture the monotonically increasing relationship between QoS values
and optional cycles, and more complex curved functions can be approximated by





The imprecise computation model, despite as a typical QoS modeling in terms
of execution cycles, could be very eﬀective for applications that have ﬁxed input
patterns or relatively less sophisticated computing architectures. For applications
with large execution time/cycle variations due to input or modern architectural
complexities, the imprecise computation model may not lead to accurate one-to-
one QoS-cycle correspondence, and we propose a generic QoS-cycle model that
reﬂects the discrete feature that a certain QoS level is actually related to an over-
lapped cycle range.
Deﬁnition 3.3: The generic QoS modeling of ti, denoted as Qi, is deﬁned as a
set of ascending numerical values that serve as the quantitative measure of the
application quality, with the following properties:
• Each qi ∈ Qi deﬁnes a quality level of executing ti. Details of specifying
qi can be referred in [4]. Similar to non-adaptive applications, to achieve
qi (for non-adaptive case, there exists exactly one qi in Qi to indicate the
completion of the task), the processing cycles are bounded by the best-case
Bqi and worst-case Wqi execution cycles.
26
CHAPTER 3. System Modeling and Problem Formulation
• For qi < qj, it is possible for [Bqi ,Wqi ] and [Bqj ,Wqj ] to overlap. However,
Bqi < Bqj to reﬂect the fact that processing cycles strictly increases to achieve
improved quality.
• For each actual execution at runtime, exactly one qi is associated to ti. The
execution cycle is denoted as ci ∈ [Bqi ,Wqi ]. If qi < qj, then ci < cj, to re-
ﬂect that processing cycles strictly increase to achieve improved quality. The
maximum cycles for ti is Wq|Qi|−1 .
It turns out that the generic model is a valid generalization of the available
representative QoS models. The imprecise computation model can be treated as a
special case that for each qi there matches exactly one cycle number instead of the
range [Bqi ,Wqi ]. The two-version task model can also be treated as a special case
such that each Qi contains two qis.
It is imperative to mention that, in our work the goal is to maximize the
application execution quality, namely the total qi. However, since mathematical
description of qi can be fundamentally diﬀerent for distinct applications, we convert
qi maximization to ci maximization, given the following reasons: (1) As described
above, qi could be improved with more cycle “budget” for execution; (2) Even if
increased cycles could not reach the threshold for the next qi level, extra slack can
still be distributed to subsequent tasks because of the leftover cycles; (3) By the
time-frequency relationship, ci can be more easily handled compared to qi.
The goal of ci maximization for the generic model coincides with that for the
imprecise computation model, thus a uniﬁed problem deﬁnition can be derived with
the common scheduling goal.
27
CHAPTER 3. System Modeling and Problem Formulation
3.3 Problem Deﬁnition
To properly deﬁne our dynamic algorithms, we give a whole picture on their position
in the scheduling framework, as well as the required input and output parameters.
Deﬁnition 3.4: The Statical Scheduling Algorithm is deﬁned as a pre-execution
scheduling function Fs : (Gt, Ga) → SPAS that intakes all the application iden-
tities, and determines the task-processor binding, task execution ordering, task
execution speed (processor frequency), and application QoS. It outputs in the form
of statically processed application set (SPAS), and provides an initial status for the
system execution.
Deﬁnition 3.5: The SPAS, represented as a DAG Gt(T,E), contains all statically
scheduled tasks with the following properties:
• For multiprocessors, task-processor binding, P : T → P , is applied to all
tasks of all application identities, with TGt =
⋃
∀Gt T .
• As a result of the binding, we can obtain: (1) the statically determined ex-
ecution speed of ti, denoted as f
P(ti)
k , as well as vi to support the frequency
(see 3.3), (2) the statically determined execution cycles ci, and (3) the stati-
cally determined quality level of qi, where the worst-case cycle Wqi to ensure
that quality is to be executed for ti. For imprecise computation models, ci is
matched to qi.
• Task execution order on P is determined for TGt . Thus, EGt contains both
the innate task precedence
⋃
∀Gt E, and a set of introduced precedence re-
lationships due to processor allocation. In total, the edge eti,tj ∈ EGt can
28
CHAPTER 3. System Modeling and Problem Formulation
have either meaning: (1) eti,tj implicitly enforces the execution order that tj
starts after ti completes; (2) eti,tj explicitly enforces the execution order if tj
is data-dependent on ti, namely tj starts after ti completes and has received
necessary data from ti.
Deﬁnition 3.6: The slack time, τi, is deﬁned as the diﬀerence between the stat-
ically determined (worst-case) and the actual execution times used to achieve qi.
The associated slack energy, Ei, is thus calculated as the product of per-cycle en-
ergy and the slack cycles, namely Ei = P(ti)k fP(ti)k τi.
Deﬁnition 3.7: The Dynamic Scheduling Algorithm (DSA) is deﬁned as a runtime
scheduling function that aims at further enhancing the execution quality making
use of system slack resources τi and Ei.
The dynamic algorithm is invoked following the completion of any ti, and
transforms the slack resources to extra execution quality of the subsequent tasks.
It intakes the SPAS with the task-processor binding and precedence information, as
well as task speed and quality (either in SPAS or resulted from previous DSA). Its
output includes the modiﬁed speed f
P(td)
k (hence the associated execution energy

P(td)
k ), as well as the increased execution cycle cd, for all tasks td in the slack
receiver group (SRG).
With few previous work proposing quality-aware dynamic scheduling algo-
rithms, this thesis focuses on the dynamic scheduling algorithm design that aims
at maximizing the extra execution quality in a timely manner. To achieve this
goal, our work speciﬁcally deals with the following aspects: (1) For both single-
29
CHAPTER 3. System Modeling and Problem Formulation
and multiprocessor- systems, we identify the SRG that is capable of
generating the largest quality, in terms of extra cycles, upon receiving
slack resources τi and Ei. (2) Among the tasks in SRG, we propose ef-
ﬁcient algorithms to actually achieve the largest cycles by optimization
processes on making use of dynamic frequency scaling technologies, and
diﬀerent algorithms are proposed targeting either imprecise computa-
tion and generic models. (3) Practicality constraints are considered such
as leakage energy and inter-processor communication. To cope with
possible quality deterioration due to τi variation on the transmission
platform, we also propose a fast local frequency scaling method to fur-
ther reduce the quality loss. (4) To make our work self-contained, we





Computation Tasks on a Single
Processor
In this chapter, we propose a dynamic scheduling algorithm named gradient curve
shifting (GCS) for IC-modeled tasks on single processor systems. The objective
of the algorithm is to maximize total application QoS without violating energy
constraints, by proper frequency scaling. This is an important problem to be
addressed for embedded devices since energy and QoS need to strike a trade-oﬀ
in performance optimization, given the fact that tasks impose real-time processing
requirements. The work also paves the road for formulating and solving more
complicated problems on multiprocessor systems.
Our approach ﬁrstly covers the static scheduling algorithm in brief, with an
optimized formulation. Then, we study how the optional cycles can be generated
31
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
from the slack under non-DVS condition, considering both linear and concave QoS
functions in the IC model. After that, the strategy can be easily extended to the
processor with DVS enabled to execute the optional tasks. The operating system
(OS) is assumed to have both inter- and intra-task DVS capability, namely, the IC-
task can adapt its execution speed before and during execution. This would provide
better QoS maximization opportunity than traditional inter-task DVS approaches.
Compared to the state-of-the-art approaches [67][68], GCS achieves better QoS
results with low complexity.
4.1 Static Scheduling Strategy
It is imperative to describe the static scheduling mechanism that sets the starting
point of designing a dynamic scheduling scheme. In the IC model, mandatory
tasks must be completely executed, while optional tasks may not be completed
under resource constraints. In such cases, optional task execution time has to
be decided to tradeoﬀ certain metrics such as timing, QoS, and/or energy. The
optional execution cycles should be statically decided in a way that maximizes the
optional task QoS in (3.7). Here we adopt the same model proposed in [67] to
formulate the static optimization problem. For a task set N containing task i ∈ N ,






0 ≤ oi ≤ Oi (4.2)
32
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
Vmin ≤ vi ≤ Vmax (4.3)
∑
i∈N
ti ≤ Td (4.4)
EN ≤ Ebudget (4.5)
where EN stands for the total energy consumption from Eqns. (3.2). We assume the
starting time of the ﬁrst task to be 0. The objective of the optional cycle decision
is to maximize the total QoS value represented in (3.7), subject to constraints in
(4.2)-(4.5). According to [67], this problem is polynomial time solvable.
4.2 Dynamic Slack Reclamation without DVS
Static scheduling naturally assumes that tasks run at their Worst Case Execution
Times (WCETs). Even for a statically determined QoS value, the execution time to
achieve it varies due to microarchitectural uncertainties. Hence, during run-time
execution, both mandatory and optional tasks probably complete prior to their
statically scheduled completion time. The run-time slack available thus can be
reclaimed and redistributed to subsequent tasks to derive additional QoS, making
use of the ﬂexibility of the IC model. In this section, we ﬁrstly describe a method
that strives to obtain the maximal QoS for a set of IC tasks with linear QoS
functions. The GCS algorithm dealing with concave QoS functions is presented in
the second part.
4.2.1 Slack allocation for linear QoS functions
For a task with linear QoS function, suppose it ﬁnishes its scheduled task
(mandatory and optional) S cycles ahead, we claim:
33
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
Theorem 4.1 : Under timing and energy constraints, the maximal QoS gained
from dynamic slacks is achieved by allocating the slack cycles to the task with the
largest gradient in its linear QoS function.
Proof. Let us suppose, say, task 1 has the largest QoS gradient, and there is a total
of n tasks left in the system after the execution of task k, then Theorem 4.1 is
equivalent to:






fi(oi + o¯i), (4.6)
where oi is the statically scheduled optional execution cycles, o¯i represents optional
task cycles assigned from the slack using any other allocation methods, e.g. [67][68].




We prove (4.6) by contradiction. Suppose allocating S to task 1 does not give
the maximal extra QoS, we have






fi(oi + o¯i), (4.7)
which is equivalent to
f1(o1 + S)− f1(o1 + o¯1) <
n∑
i=2
(fi(oi + o¯i)− fi(oi)). (4.8)
Dividing both sides by S − o¯i, (4.8) becomes
f1(o1 + S)− f1(o1 + o¯1)
S − o¯1 <
n∑
i=2






CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
Note that S − o¯1 =
n∑
j=2










where f ′i(oi) is the gradient of the linear QoS function of task i, ∀i ∈ [2, n]. We
consider a special case where all f ′i(oi), ∀i ∈ [2, n], are equal. Thus (4.10) turns
out as
f ′1(o1) < f
′
i(oi), ∀i ∈ [2, n]. (4.11)
This leads to a contradiction to the condition that task 1 has the largest gradi-
ent then all others. Up to this point, Theorem 4.1 is partially proved without
considering timing and energy constraints.
Interestingly, this allocation scheme is immune to timing and energy violations,
the reason being their linear relationship to the deadline and energy budget. This
is easily observable from equations (3.2) and (3.3), treating vi as a constant under
non-DVS conditions. At runtime, if S is generated by task i, they are reclaimed by
other tasks, meaning the total number of cycles remains unchanged. Because of the
linear relationship mentioned above, the total time and energy remain unchanged.
Hence the allocation scheme based on maximum gradient does not violate timing
and energy constraints. Hence the proof.
A special situation that may occur is when the remaining optional cycles of a
task are less than the slack cycles allocated. In this case, we allocate the remaining
slack in a greedy fashion. That is, if the node with the largest slope has no extra
optional cycles, we allocate the slack to the node with the second largest slope, and
35
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
Fig. 4.1: (a) S within S’. Allocating S to i gives the maximal QoS. (b) Left shifting i by S cycles.
so on.
4.2.2 Slack allocation for concave QoS functions
Unlike linear QoS functions with ﬁxed gradients, gradients of concave functions
can change from one time instance to another. However, we can still apply Theorem
4.1 iteratively in concave QoS scenarios.
For linear QoS functions, if the gradient of function fi(oi) is greater than that
of function fj(oj), then slack S should be given to task i at any instance. Similarly,
for concave QoS functions, if slack S is given to i within an interval when any
value of gradient i is larger than that of gradient j (see Fig. 4.1(a), S ′ being the
interval), then according to Theorem 4.1, the resultant extra QoS is the largest.
Allocating slack S to i can be viewed as left shifting curve i by S (see Fig. 4.1(b)).
On the other hand, if S exceeds S ′, as shown in Fig. 4.2(a), then after left shifting
i by S ′ as explained above (see Fig. 4.2(b)), the remaining slack (S−S ′) has to be
36
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
Fig. 4.2: (a) S larger than S’. S cannot be fully allocated to i. (b) shifting i by S’ so that curves
i’ and j intercept at y-axis. (c) Shifting j by Sj , i’ by Si, simultaneously.
.
distributed between tasks i and j. We shift curves i and j together. This is because
if we keep shifting curve i to the left, some portion of curve j must be above curve
i, meaning that slacks are not given to curve j which has the larger gradient. If
the two curves are shifted together, the resultant curves also have to intercept at
the same point on the Y-axis, as shown in Fig. 4.2(c). Assuming curve i is shifted
by Si, and curve j by Sj, their relationship can be expressed as an equation set:
⎧⎪⎨
⎪⎩
Si + Sj = Sremain = S − S ′
fi(Si) = fj(Sj).
. (4.12)
Moreover, with more tasks shifting together, thus more gradient curves, this






fi(Si) = fj(Sj), ∀i, j ∈ [1, k]
. (4.13)
37
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
Algorithm 4.1: GCS(S)
1: if curve i is highest
2: SHIFT i to intercept next highest curve
3: else if i is as high as other curves
4: SHIFT together to intercept next highest curve
5: RETURN the remaining slack
Function: MAIN(S)
1: for each task i
2: while S ≤ 0
3: S = GCS(S)
Note that function fi and fj are the QoS functions updated after every previous
shifting. The algorithm is shown inAlgorithm 4.1. The algorithm is invoked each
time a task has ﬁnished the execution and generated the slack. If it returns some
slack after the slack allocation, it is invoked again for another round of scheduling.
The worst-case complexity of the GCS algorithm is O(N2), which happens only
when the slack is large enough so that all other tasks are able to obtain some shares.
During the scheduling process, the slack is ﬁrstly given to the curve with the largest
gradient until its gradient value falls to the maximum value of the second highest
curve; then the two curves are shifted together until they reach the third, and so
on, till all curves reach the same point on the Y-axis. Assuming each such shift
consumes O(1) time, the complexity of the whole process is O(1 + 2 + ... + N),
equivalent to O(N2).
4.3 Dynamic Slack Reclamation under DVS
In the previous section we explored the slack allocation methods assuming the
processor operates at a constant speed. In this section we take DVS into the
picture. This will add another dimension into the search space, in which both
38
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
optional cycles and operating voltages have to be determined to derive the maximal
QoS. However, rather than considering them comprehensively as in [69] or [67],
our approach tackles the problem in two phases. Firstly, given the slack energy
and slack time available, we derive the largest possible slack cycles oi by properly
choosing its operating voltage. Secondly, using the GCS algorithm, we distribute
the oi cycles to the most appropriate tasks.
4.3.1 Deciding maximal optional cycles
Before the onset of the slack, a task i works under statically scheduled voltage,
denoted by vstai . If there are any dynamic slacks generated from i, we have τi as
the slack time, and Ei as the slack energy. Now, given constraints on extra energy
Ei and time τi, we need to determine the voltage vi such that the extra cycles oi
is maximum within τi. Since QoS is represented as non-decreasing functions of
cycles, it can be only maximized with maximum oi. In order to ﬁnd the maximum
oi, we claim:
Theorem 4.2 : To obtain the largest oi under energy and timing constraints Ei
and τi, its operation voltage vi should remain at v
sta
i , i.e. vi = v
sta
i .
Proof. We simplify Eqn. (3.3) by setting Vth = 0 and α = 2, as in [67]. By







CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor




where oi replaces Mi + oi since only dynamic slack oi is considered. Because vi,
Ei, and τi are intrinsically bounded, there exists a bounded Ei − τi space as shown
in the shaded region in Figure 4.3, where the two lines starting from the origin
have gradients CV 3max/k and CV
3
min/k respectively. The hyperbola curve represents
function Ei = Ck2o3i × 1τ2i , for which oi holds the value that makes this curve
contain the point P (Ei, τi). By Eqns. (4.14) and (4.15), every pair of (Ei, τi) values
determines a pair of (vi, oi) values. Notice in Eqn. (4.15), oi can be maximized
when both Ei and τi are maximum. It means that in the shaded region, point P
represents the voltage/cycle pair in which oi is maximum whilst Ei and τi are both
fully utilized. On the other hand, if Ei and τi are fully used, the statically assigned






Also, the cycles (both mandatory and optional ones) already executed by task i
under voltage vstai , takes T − τi units time and consumes E − Ei units energy, so





× (T − τi). (4.17)
From (4.16) and (4.17), it is immediately apparent that we can obtain vi = v
sta
i .
In total, we can conclude that if Ei and τi are maximally used, vi must be set to
vstai and simultaneously oi is maximized. Hence the proof.
40












Fig. 4.3: The Energy−Time space
4.3.2 Allotting optional cycles
With the maximum oi determined, we allocate the slacks to tasks to achieve
the largest dynamic QoS. As stated in Theorem 4.2, the operating voltage of the
oi slack cycles remains unchanged as if it was at the slack generater. But the slack
receiver may have been previously allocated slacks, whose voltage probably diﬀers
from vi, so inter-task DVS can be applied for voltage switching within the slack re-
ceiver. On the other hand, during the static optimization phase, tasks are generally
assigned diﬀerent operating voltages, which means our approach is also intra-task
DVS-based. The combination of inter- and intra-task DVS provides more optimiza-
tion opportunity. Our experimental results show the superiority in dynamic QoS
generation compared to other schemes. Moreover, because oi operates in confor-
mation to energy and timing constraints, wherever these cycles are allocated, the
total system energy and timing constraints remain inviolated. The eﬀect of voltage
scaling overhead in time and energy, as well as the online computation overhead,
need to be further exploited since they should not be neglected.
41
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
4.4 Results and Discussion
To evaluate our GCS algorithm, we have performed extensive simulation studies
with over 700 synthesized task sets. Each task set is attributed with a deadline
Td in the magnitude of several milliseconds. The number of cycles of each task
is calculated based on Td and the number of tasks in the task set, and randomly
generated with a uniform distribution. Nevertheless, we ensure the total task
execution time not to exceed Td for the sake of constrained optimization. The
energy budgets are also calculated based on task cycles, but are manipulated to
realize a reasonable energy constraint. We choose the voltage levels in the interval
of [1.2V, 1.6V ], as in most realistic cases. Values of C and k used in our simulation
studies are adopted from [53]. The QoS functions for tasks are adopted from [67]




3θioi, where αi, βi, and θi are randomly generated between
0 and 1.
We have compared our GCS algorithm with two other methods, namely the
quasi-static approach with 4 points per task and deadline slack between 0.1 and
0.6 (see [67]), and the inter-task based optimal approach, which runs the static
optimization process each time a task ﬁnishes execution and generates slack time
and energy (see [68]). Fig. 4.4 shows the result, in which every point shows the
average result of evaluating 50 task set instances. The data for comparison are
additional QoS generated in the dynamic phase, represented by
QoSdynamic = QoSactual −QoSstatic, (4.18)
in which QoSactual is the total QoS measured after the actual execution of the
task set, and QoSstatic is the total QoS calculated after the static scheduling phase
42
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
Fig. 4.4: Normalized dynamic QoS vs. no. of tasks.
(see Section 4.1). Results by the GCS algorithm and quasi-static approach are
normalized by the inter-task DVS optimal results. We observe from Fig. 5.8 that
our GCS algorithm generates extra QoS of as large as 18% more than the inter-task
DVS based optimal approach for diﬀerent number of tasks in a task set.
In order to make it evidenced that the above superiority of GCS to the inter-
task DVS optimal solution is subject to intra-task DVS, we conducted the second
set of experiments, by manipulating Td to be suﬃciently long so that only energy
is the scarce resource. Under this condition, the static optimization process results
in every task running at the lowest voltage, 1.2V . Hence in both the inter-task
DVS based optimal approach and our GCS algorithm, no DVS are required since
all tasks run at 1.2V . The results shown in Fig. 4.5 show that the GCS algorithm
is very near to the optimal but never exceeding it. Since DVS is the only factor of
variety, we conclude that the GCS algorithm gives better performance because of
43
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor
Fig. 4.5: Eﬀects of no DVS applicable to GCS and optimal solutions.
the intra-task DVS capability.
We also explore the time and energy budget utilization of the GCS algorithm,
shown in Fig. 4.6. We can see that the energy utilization of the GCS algorithm is
better than the inter-task DVS optimal approach. This gives the reason that GCS
generates larger QoS despite the fact that its time utilization is smaller than the
inter-task DVS optimal approach.
44
CHAPTER 4. Scheduling Imprecise Computation Tasks on a Single Processor






In this chapter, we describe a dynamic scheduling algorithm for IC tasks on mul-
tiprocessor real-time embedded systems, with the primary objective of enhancing
the system QoS making use of runtime slack time and energy. More speciﬁcally, by
utilizing the runtime slack, we hope to generate maximum
∑
(fi(oi+Δoi)− fi(oi))
from the optional parts of all slack receivers, without violating timing constraints
or incurring more energy consumption than Ei. The approach is able to address two
aspects that fundamentally complicate the scheduling problem on multiprocessing
environments. Firstly, for a set of slack receivers that execute in parallel, what
are the criteria of slack distribution to generate the maximum QoS gain, and how
diﬀerently τ and E are utilized compared to a single processor system. Secondly,
46
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors

Fig. 5.1: Framework of multiprocessor dynamic scheduling for IC tasks.
how that distribution criteria can be employed to guide the searching of the largest
QoS-rewarding slack receiver set.
Dynamic scheduling is part of the framework in Fig. 5.1. The framework
consists of the Static Scheduling and Dynamic Scheduling blocks. In this thesis,
we focus on the Dynamic Scheduling block, and assume that the Static Scheduling
block generates a static schedule to the Dynamic Scheduling block. To make our
work self-contained, we also present a static scheduling algorithm in Chapter 7 as
the prerequisite work of dynamic scheduling algorithms.
We begin describing the multiprocessor scheduling algorithm by illustrating a
motivational example, which gives the ﬁrst impression on the challenges and oppor-
tunities of scheduling IC tasks on multiprocessor systems. Then we derive the opti-
mal slack time/energy distribution rules to a slack receiver set. Node-classiﬁcation
47
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
based methodology is also proposed to search for the most QoS-rewarding receiver
set for QoS maximization.
5.1 Motivational Example
Fig. 5.2 shows a runtime scenario to illustrate the multiprocessor challenges. The
processor mapping and execution sequence of tasks are generated from a static
scheduling algorithm and assumed not to alter dynamically. Attributes of the
tasks are listed in TABLE 5.1. At runtime, task 2©1 completes 30μs earlier than
statically scheduled, i.e., generates 30μs slack time. Assuming K = 10−9 and
Ceff = 10
−9 whose typical orders of magnitude are obtained from [53], and all





















Fig. 5.2: (a) Illustrative example where 2© distributes slack. (b) Slack distribution results on 4©,
where S is used to generate Δo4. Note that all tasks in (a) are IC-modeled, thus are divided into
mandatory and optional parts, e.g. m4 and o4. For clarity purpose, this is not shown in (a).
To begin with, we focus on tasks 4©, 6©, and 7© that can receive the full slack
time S = 30μs, and compare the QoS gain by distributing the slack to diﬀerent
1For simplicity of description, within this thesis, we denote a task in the dependent task set
by circling its ID.
48
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
Table 5.1: Task attributes in Fig. 5.2: static scheduled time, immediate parent nodes, and ki.
Task ID 1 2 3 4 5 6 7
SST (μs) 50 60 60 40 30 40 40
IPN ID - - 1,2 2 3 4 4
ki 1 1 1.4 1.7 1.5 1 1
task combinations. According to Chapter 4, for each application sub-path, viz.
4© → 7© and 4© → 6©, S should be given to only one task that has maximal QoS
gradient. Hence, the possible slack receivers are { 4©} and { 6©, 7©}, but not { 4©,
6©} or { 4©, 7©}. Between { 4©} and { 6©, 7©}, it is intuitive to select the later since
the accumulative QoS gradient (ki) is 2 compared to 1.7 of 4©. However, as Es
is another constraint, optimally 4761 extra QoS gain can be obtained from { 6©,
7©} by executing Δo6 and Δo7 at 79.4MHz, while 5100 QoS can be obtained from
{ 4©} by executing Δo4 at 100MHz. In Section 5.2, we describe our optimization
formulation used to calculate the above QoS values. We show that, among task
sets that are equally eligible to claim the slack time, the QoS gradient is the only
criterion in comparing QoS gain, and can be used to guide receiver selection. Since
ki is constant, the slack receiver candidates can be prioritized oﬄine for eﬃcient
online selection.
Moreover, tasks 3© and 5© can only claim 10μs if 1© is not completed at the
moment, and such decision can only be made online due to the uncertainty of 1©.
Between { 3©} and { 5©}, the slack time (10μs) is also equally eligible to be claimed,
i.e., 3© and 5© can claim the same amount of slack time. Hence, it is reasonable
to form subgraphs containing slack receivers of 2©. In each subgraph, [ 4©, 6©, 7©]
and [ 3©, 5©] respectively, each task has equal eligibility to claim the same amount
of slack time, 30μs and 10μs respectively. The sets of receivers are selected and
49
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
prioritized oﬄine for each subgraph. At the online stage, only the amount of slacks
to each subgraph should be determined, and slack distribution can be eﬃciently
calculated based on the slack and prioritized slack receivers (e.g. { 4©, 5©} in which
{ 4©} ∈ [ 4©, 6©, 7©] and { 5©} ∈ [ 3©, 5©]) using our proposed formulation in Section
5.2. The oﬄine and online algorithms are described in Section 5.3 in detail.
5.2 Slack Distribution Optimality Analysis
In this section we discuss the strategy to optimally determine the extra optional
cycles within a SRG, T , given available runtime slack τ and E . Three scheduling
objectives have to be satisﬁed, namely (1) maximization of QoS, (2) execution time
of optional cycles cannot exceed τ to preserve timing constraints, and (3) optional












) ≤ E , (5.2)
fmin ≤ Δoi
τi
≤ fmax, ∀i ∈ T. (5.3)
0 ≤ oi +Δoi ≤ Oi. (5.4)
Constraint (5.2) is derived from Eqns. (3.2) and (3.3), and τi denotes the slack
time available to task i. The variables of interest are Δoi, ∀i ∈ T . The timing con-
straint is satisﬁed by allocating τi to Δoi, and the energy constraint is satisﬁed by
50
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
utilizing only E for cycle generation. Constraint (5.3) limits the frequency within
the acceptable range.
Theorem 5.1 : For a set T of tasks, assuming they have linear functions fi(oi) =
kioi+ci where ki is the QoS gradient. With unlimited frequency range, the maximal

























Proof. With the assumption of inﬁnite frequency range, we prove the theorem
by solving the standard optimization problem (5.1) − (5.2) using the Lagrange
Multipliers, while the frequency limitations (5.3) are analyzed afterwards.







and diﬀerentiate L(oi, λ) with respect to Δoi. Applying Kuhn-Tucker conditions





















i ) = P. (5.6)
From (5.5), Λ = ki(
τi
Δoi
)2. Substituting this into (5.6), and further processing the
51
















































The optimal Δoi obtained from (5.7) is unaware of frequency constraints (5.3).
To deal with violations on (5.3): (1) If Δoi
τi
< fmin, to maintain the value of Δoi,
τi can be reduced to
Δoi
fmin
. (2) If Δoi
τi
> fmax, we can only reduce Δoi to τi × fmax
because increasing τi violates the timing constraint. The cycle limit constraint
(5.4) will be dealt with in the next section.
Observed from (5.7), without considering the upper limit Oi, τi and k
0.5
i de-
termine Δoi. Moreover, for a task set where every task can claim the same slack
time,
∑
k1.5i becomes the sole determinant for the overall QoS. Since ki is the in-
nate attribute to a IC-modeled task, receiver selection can be processed oﬄine.






3 , according to (5.8). If T = { 6©, 7©}, the total QoS increase












3 . Thus, with the same τi, T = { 4©} is
prioritized oﬄine. At the online stage, when 2© generates slack, 4© can be used
directly. It implies that the dynamic execution overhead can be partially alleviated
by oﬄine preparation, if receivers can be grouped into proper subgraphs, such that
in each subgraph tasks can claim identical slack time.
52
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
5.3 Slack Receiver Selection
According to the previous discussion, to facilitate identifying the tasks by
∑
k1.5i ,
tasks receiving identical τi has to be grouped. Then, after the receiver selection
procedure within each group, all receiver candidates from all groups need to be
combined for ﬁnal SRG determination, and this can be viewed as a divide-and-
conquer approach. To be speciﬁc, this approach consists of three steps: First, task
grouping should be conducted on all slack receivers receiving equal τi = τ . Second,
within each group, not all tasks are in parallel due to precedence constraints, thus
receiver selection method should be applied. Third, while the ﬁrst two phases can
be implemented oﬄine, at runtime there are issues, such as slack time determina-
tion, insuﬃcient Δoi increasing budget, and task allocation conﬂict, that need to
be addressed online. In this section, we explain our methodology to deal with the
problems in the three steps.
5.3.1 Task grouping
For task grouping, we propose a graph decomposition routine that is applied
on each task ts, i.e., each slack generator, based on the analysis of the degree of
candidate slack receivers. Moreover, the group is named as either full candidate set
(FCS) or partial candidate set (PCS) as deﬁned below.
Deﬁnition 5.1: The degree d of a task is deﬁned as the number of its incident
edges in the dependence graph.
Deﬁnition 5.2: An FCS of ts is deﬁned as the set of slack receivers, ti, that
53
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
completely receives the full slack time of the slack generator, namely τi ≡ τ .
Deﬁnition 5.3: A PCS of ts is deﬁned as the set of slack receivers, ti, that may
not adopt the full slack time due to precedence constraints, namely τi ≤ τ .
An illustration of FCS and PCS is shown in Fig. 5.3(a), where tasks { c©, d©,
g©, h©} belong to the FCS of a©, while tasks { i©, j©, l©, m©, n©} belong to the PCS
of a©. A natural but misleading observation is that a candidate receiver ti with
d(ti) ≥ 2 should be grouped into the PCS. However, it depends on whether the slack
generator is the sole slack source, i.e., whether there exist other independent slack
sources at the time of slack generation. For example, in Fig. 5.3(b), task j© should
be grouped in the PCS of h©, but in the FCS of a©. The graph decomposition






















Fig. 5.3: (a) Graph decomposition illustration for a©. Note that the link between d© and j© is
omitted due to precedence redundancy. Same as e© and m©. (b) A task can belong to PCS or
FCS of diﬀerent slack generators.
The graph decomposition algorithm is shown in Algorithm 5.1. Firstly,
all precedent tasks of ts are removed since in real-time they should have been
54
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
completed. Then, graph traversal is conducted. For each child ti of ts, ﬁnd its root
ancestor. If the root is other than ts, the child belongs to PCS, and all descendent
nodes of the child have to be peeled oﬀ from the graph, and put into PCS. If ts is
the only root, then the child belongs to FCS.
Algorithm 5.1: Graph Decomp(G, ts)
1: G ′ = RMOV PARENTS(G, ts)
2: FCS ts = PCS ts = ∅
3: TRAVERSE(G ′), for each ti of ts
4: t′s = FIND ROOT(ti)
5: if t′s = ts
6: PCS ts = PCS ts ∪ {ti}∪ ALL CHILD(ti)
7: else
8: FCS ts = FCS ts ∪ {ti}
Note that for ts, its FCS and PCS are in the same position in receiving slacks,
with the only diﬀerence in the amount of slack time that can be received. In the
subsequent sections, we will explain the slack time determination. Nevertheless,
the methods applied on receiver selections within FCS and PCS are identical.
5.3.2 Receiver selections in FCS and PCS
Given that the slack time is duplicated for parallel receivers, the selection
process has to admit as many parallel receivers as possible, to create maximal
∑
k1.5i . However, precedence relationships introduce mutual exclusiveness among
the dependent tasks. Moreover, the cycle increase potential should also be regarded
to deal with runtime exhausted cycle budget (as oi +Δoi approaching Oi).
Hence, the receiver selection process should ﬁnd a list of task sets, each of
which contains maximum parallel tasks. Two of such sets can be found from Fig.
5.3(a), as { g©, c©, d©} and { g©, h©, d©}. At runtime, the slacks can be given to
55
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
the set with not only the largest
∑
k1.5i , but also suﬃcient cycle increase budget.
Deﬁnition 5.4: A slack receiving candidate (RC) is deﬁned as a subset of the
FCS or PCS (CS for simplicity), containing parallel tasks that have no precedence
relationships.
The problem of ﬁnding all RCs can be solved by exhaustive search at design
time, based on the fact that tasks in the RC are in parallel such that no two
of them have precedence constraints. A searching routine is shown in Algorithm
5.2, where searching is conducted on each task by eliminating its precedence-related
tasks, and the RC is formed by adding the rest of independent tasks into it. Finally,
RC s are removed if there exists repeated version in the searching procedure.
Algorithm 5.2: Enumerate RC(CS)
1: for each ti ∈ CS
2: ADD ti → RCi
3: /*Block ti’s precedence tasks*/
4: FIND RC’ = CS \ PREC(ti)
5: repeat CHOOSE tj ∈ RC’
6: ADD tj → RCi
7: RC’ = RC’ \ PREC(tj)
8: until RC’ = ∅
9: for all RCi, REMOV duplicated
With all RC s found for the CS, we sort them according to the QoS increase
potential, in the order of descending
∑
k1.5i . In this way, to determine the slack
receiver in CS at runtime, the RC with the largest
∑
k1.5i can be directly used,
assuming the cycle budget, Oi − (oi +Δoi) is suﬃcient. In the case of insuﬃcient
budget, an eﬃcient method is presented in the following, making use of the sorted
56













Fig. 5.4: An example showing runtime slack time uncertainty for PCS, S = τs.
RC list.
5.3.3 Online distribution
In the oﬄine stage, the followings are prepared: a set of CS s of the ts, and
for each CS a list of RC s sorted by
∑
k1.5i . For online distribution, several issues
have to be tackled.
First, the slack time has to be determined if it is distributed to PCS. This
can be viewed from the example in Fig. 5.4, which shows that task a© in PCS
is in general unable to receive the full slack. Actually, if one of the parent task
generates slack but another has yet to ﬁnish, the actual available slack time should
be calculated as the minimal value between the ﬁnish time of the slack generating
parent, and the statically scheduled (worst case) ﬁnish time of the unﬁnished direct
parent. Thus, τPCS =MIN(τs, d), where d is the minimum timing gap between the
PCS root and all its unﬁnished direct parents.
Second, although the RC list has been sorted for the CS, at runtime, it is not
recommended to directly distribute the slack to the RC with the largest
∑
k1.5i .
This is because the cycle budget Oi − (oi +Δoi) at runtime is unpredictable, and
could be depleted for the RC s in the front of the sorted list.
57
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
To deal with the cycle depletion situation, we propose a searching heuristic on
the RC list based on the degree of depletion (DoD) of each RC.
Deﬁnition 5.5: In a RC, the DoD is deﬁned as the number of tasks whose cycle
budget is depleted at runtime.
Deﬁnition 5.6: The task ti has its cycle budget depleted if and only if Oi − (oi +
Δoi) < 0, where Δoi is calculated using (5.7) on each slack distribution instance.
With the same amount of slack, the RC with a smaller DoD is preferred. If it
happens that the small DoD RC possesses larger
∑
k1.5i , the slack should be given
to it with certainty. Otherwise, a comparison on the actual ΔQoS is needed. To
decide which RC should gain the slack, we propose an eﬃcient heuristic approach
as in Algorithm 5.3, which keeps searching for the maximum ΔQoS RC with zero
DoD in the list in descending
∑
k1.5i order. The searching procedure stops once it
ﬁnds non-zero DoD, since in descending
∑
k1.5i order, the subsequent RC must not
achieve larger ΔQoS than the zero-DoD one. In case there is no zero-DoD, the RC
with the ﬁrst smallest DoD value, e.g. DoD = 1, is used to terminate comparison,
and all RC s in front of the queue (i.e., with larger or equal
∑
k1.5i ) are compared
to determine the maximal ΔQoS RC.
The runtime eﬃciency of Algorithm 5.3 is better than the worst case: search-
ing through all RC s in the list. If there is no zero-DoD RC, the list are searched
thoroughly with timing complexity O((NT
NP
)NP ).
Theorem 5.2: The timing complexity of searching through all RC s in a CS is
58
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
Algorithm 5.3: Runtime RC Search(LRC, τs)
1: for each RC in LRC , from largest
∑
k1.5i
2: CALC(DoD) as in Def. 5.5 and Def. 5.6
3: CALC(ΔQoS) as in (5.8)
4: if DoD = 0, break
5: if no DoD = 0
6: FIND FIRST(MIN DoD(RC ))
7: from LIST HEAD(LRC) to MIN DoD(RC )
8: FIND MAX(ΔQoS)
9: else FIND MAX(ΔQoS)
O((NT
NP
)NP ), where NT is the total number of tasks in the DAG, and NP is the
number of processors and treated as a constant for the considered architecture.
Proof. The worst case timing complexity is when all RC s in a CS is searched,
and all RC s represent all possible combinations of parallel tasks. By parallel tasks
we assume that in the CS only one task is selected on a processor. Thus on the
NP processors, assuming in the CS there are respectively a1, a2, . . ., aNP tasks












j∈[1,NP ] aj, which limits the searching complexity of all RC s in a CS.
Moreover, we have
∑
j∈[1,M ] aj ≤ NT . According to the arithmetic and geo-
metric mean inequality, it exists










Third, having found the best RC of each CS, we have to solve the issue of
59
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
processor conﬂict, which is demonstrated in Fig. 5.3. E.g., the conﬂicting pair
( d©, j©) belongs to diﬀerent CS s of a©. Due to the processor conﬂict, only one
task can be selected. Assuming the scheduler can retrieve the processor binding
information in O(1) time, the worst case complexity to resolve the conﬂict is to
traverse through the DAG. The complexity is then O(NT ).
5.4 Results and Discussion
In this section we demonstrate the performance of our algorithm through simulation
studies. We synthesize the DAGs by a task graph generator TGFF [72]. The
deadline of tasks are set to 900 time units and each task has execution time between
30 and 70 time units. For every task, we randomly assign a QoS function with
gradient k between 0 and 1.
We develop a virtual task execution platform to simulate the execution of
the synthesized tasks, monitor the completion status, and invoke the dynamic
scheduler to regulate task execution using runtime slack. A built-in scheduler
module is developed to implement our algorithm as well as other algorithms for
comparison. The slack time generated by each task is indicated by a slack factor
SF = ActualExe.T ime
StaticExe.T ime
. The platform completes each task in advance by TStatic× (1−
SF ) to generate the slack time. For each task graph, we vary SF between 0.1,
0.5, and 0.9. The results are collected within the virtual platform. Moreover, to
examine the timing performance of our algorithm, we use the SESC [73] instruction
set simulator to proﬁle the dynamic scheduler eﬃciency during each invocation.
We compare our work with two other algorithms: (1) A Modiﬁed LSSR
60
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors
algorithm[47], in which task is dispatched at runtime and slack time is shared
by subsequent descendants. Note that here we use the slack time for optional cy-
cle allocation rather than energy minimization in the original work. (2) A greedy
method adopted from [44], where the slack is assigned to the ﬁrst task without















































































Fig. 5.5: QoS increase in percentage compared to static scheduled cycles, with varied slack factors
(SF): (a) SF = 0.1, (b) SF = 0.5, (c) SF = 0.9.
Fig. 5.5 shows the results of performing the three algorithms on a 4-processor
system. By applying one of the algorithms, we measure the percentage of increased
optional cycle relative to the statically determined task cycles. In the ﬁgure all data
are normalized. Under the three slack factors, we observe that with larger runtime
slack more extra QoS are generated. Moreover, for a given value of SF , compar-
ing all the three algorithms our approach outperforms the other two. This can be
explained since the greedy approach lacks global receiver selection consideration,
and the MLSSR approach also greedily selects the receivers, and distributes the
slack time concentrating on deadline requirement but ignores optimized slack en-
ergy utilization. The average QoS gain under the three SFs is 54.9% compared to
61
CHAPTER 5. Scheduling Imprecise Computation Tasks on Multiprocessors






















Fig. 5.6: QoS increase percentage vs. number of processors. Number of tasks = 60, SF = 0.6.
We also vary the number of processors to examine the scalability of our algo-
rithm. As shown in Fig. 5.6, the results are obtained from a DAG containing 60
tasks, with slack factor SF = 0.6. The curve for the modiﬁed LSSR-N is fairly
consistent with the results in [47]. As the number of processors increases, more pro-
cessors become redundant at runtime due to precedence relationship. So whenever
there are new tasks ready, they will be assigned to idle processors without shared
slack, and less extra QoS is generated. This can be avoided by ﬁxed processor
assignment where any ready task will get the slack from its precedent tasks. From
Fig. 5.6, our algorithm and the Greedy approach tend to converge as the number
of processors increases. This is because of the increased parallelism resulted from
more processors. We can still see that our approach in general gives better results
than the Greedy approach.
To examine the execution overhead of our approach, we use the simulator
to proﬁle the number of processor instructions executed by our scheduling algo-
62
































Fig. 5.7: Algorithm eﬃciency comparison, Our approach v.s. MLSSR, measured as the number
of instructions.
rithm. We vary the number of tasks from 20 to 50 in steps of 2 and the number
of processors are set as 8, 16, and 32, to demonstrate performance under diﬀerent
scenarios. As shown in Fig. 5.7, the modiﬁed LSSR algorithms are advantageous
in terms of runtime eﬃciency, due to its simplicity in slack receiver selection and
cycle calculation. Our algorithm executes longer than the LSSR method, consum-
ing more processor instructions. However, the performance is still acceptable as
a runtime algorithm. Even with a single-issue processor with pipelining features,
the 32-processor scheduling algorithm can be expected to ﬁnish within 2000 cycles,
which is still extremely small compared to a typical million-cycle level task.
63
Chapter 6
Scheduling Generic Models on
Multiprocessors with Realistic
Considerations
In this chapter, we extend our study of adaptive application scheduling into the
advanced level. Firstly, instead of sticking on the imprecise computation model, we
explore the multiprocessor scheduling algorithm on a generalized adaptive applica-
tion model, which exhibits interesting energy-performance behaviors as illustrated
in the following sections. Secondly, the algorithm design is exposed to realistic
considerations, and as is to be discussed, we integrate into our framework leakage
power and overhead management that leads to a fundamentally diﬀerent approach
compared to Chapter 5, and include a real-life JPEG2000 application for algorithm
validation.
With the objective of achieving maximum additional cycles from the runtime
64
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
slack, our algorithm tackles the following fundamental questions: (1) how to de-
termine the frequencies of a given set of slack receivers, so that slack time and
(dynamic+leakage) energy are fully utilized to generate maximum extra cycles; (2)
how to select the best subset of receivers that provide the most cycles; (3) how to
improve the algorithm eﬃciency in presence of slack variation caused by inter-
processor communication.
The dynamic scheduling algorithm takes the following steps. First of all, we de-
rive an online heuristic that performs guided search on the largest increase amount
of cycles, by selectively adjusting the receiver frequencies based on the given slacks.
In addition, we select the best receiver candidates based on a graph decomposition
scheme, which can be performed oﬄine to reduce runtime overhead. Moreover, we
also propose a local scaling methodology to cope with the eﬀect of slack inaccu-
racy caused by transmission variation. We note that in our approach, the scaling
decision is made immediately after the slack is generated. However, before the
slack receiver actually receives it, the slack time can vary due to timing inaccuracy
of in-stream data transmission. The slack mismatch deteriorates the performance
of the guided-search algorithm. Thus, we extend our approach by a local scaling
phase based on the same formulation as the guided-search algorithm. That is, after
the transmission completes, the receiver executes a local scaling process to make
the best use of local timing/energy resource to complement the quality loss.
6.1 Motivational Example
In this section, we illustrate a slack time-energy transformation phenomenon, for
generalized adaptive applications on multiprocessors. While the illustrative ex-
65
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
Table 6.1: List of frequencies and the corresponding energy-per-cycle
Freq.(MHz) 400 300 200 100
Ecyc(nJ) 15 12 10 5
ample in Chapter 5 unveils the diﬃculties jumping from single to multiproces-
sor scheduling, the example below provides deeper insight on adaptive application
energy-performance properties that is less obvious by considering imprecise com-
putation model speciﬁcally.
According to Chapter 5, on a single processor system, both slack time and
energy can be fully consumed simultaneously if the slack receiver executes the
increased cycles at the same frequency as the slack generator. A naive yet under-
optimized approach for multiprocessor systems can be distributing the slack time
and energy to the direct slack receiver on the same processor, such that system
timing and energy constraints are not violated. However, with frequency scaling
capability, other parallel slack receivers can take the slack time and scale down
frequency to save energy, while the saved energy can be used to further increase














Fig. 6.1: Illustrative example showing DVS eﬀect to increase extra cycles.
Fig. 6.1 shows a set of three adaptive tasks a, b, and c. Task a has WCET
100μs and generates 40μs slack time (Fig. 6.1(a)). We refer to the successor tasks
66
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
that qualify to receive the slack simply as slack receivers. The slack receivers b and
c execute for 60μs and 80μs, respectively, before slack distribution. a and b run
at 100MHz and c runs at 400MHz. The energy consumption of each task can be
calculated from TABLE 6.1, which lists the available processor frequencies and the
corresponding per-cycle energy consumption Ecyc at each frequency. Since task a
has slack time of 40μs and runs at 100MHz, its slack energy is calculated as 20μJ .
According to Chapter 5, b takes the full slack energy while not violating timing
constraints, and generates 4000 extra cycles as shown in the grid area in Fig. 6.1(b).
Moreover, task c also acquires 40μs slack time but no slack energy left. We observe
that c can use the slack time to scale down its frequency to 300MHz, and save
some energy for extra cycle generation. The execution time of c after scaling down
becomes 106.7μs, and energy saved is 400MHz ∗ 80μs ∗ (15nJ − 12nJ) = 96μJ .
The saved energy is enough for c to run till its deadline at 180μs, generating 3990
cycles as shown in dotted area in Fig. 6.1(b). In total, the slack energy is capable
of generating 4000 extra cycles for b, while DVS is capable of generating another
3990 extra cycles for c.
The above example identiﬁes the multiprocessor challenges that we address in
this thesis. First, slack energy is shared amongst slack receivers but slack time is
duplicated, thus creating opportunities during runtime to avail additional program
execution cycles. This results in considerable gain in energy savings, especially
when the actual execution time is shorter. Second, DVS can be used to leverage
on the amount of slack time and energy for tapping additional cycles. Thus, it
is desired for a methodology that can fully utilize the slack times and energy to
maximize the number of cycles by cleverly adjusting the task frequencies if there
exists an imbalance between the required slack times and energy.
67
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
6.2 Slack Distribution with Frequency Scaling
To make the scaling problem straightforward, we ﬁrst assume a given slack receiver
group containing a random set of slack receivers, D = {t0, ..., tn−1}, while the
optimized receiver group selection procedure is presented in the next section. For a
task ti ∈ D, we denote its execution cycles before slack distribution is ci, executing
under frequency f ik
1. After the slack distribution the total cycles are denoted as
ci +Δci, executing under frequency f
i
k′ .
Prior to the invocation of DSA, slack time τs and energy Es are assumed for
input. Each ti receives its share of the slack resources τs,i and Es,i for extra cycle
generation. Since τs is duplicated in parallel, τs,i = τs for direct successors. τs,i < τs
if ti is blocked by the other predecessors. τs,i is unrelated to τs,j of any other tj ∈ D,
but Es,i is related to Es,j since they constitute Es.
The maximum runtime quality gain, i.e., the maximum extra adaptive cycles,
is represented as the sum of all Δci in D. As shown in the illustrative example,
DVS can be properly used to derive more runtime cycles than being conﬁned by
Es. The rest of this section shows that optimal DVS under the timing and energy
constraints is NP-hard, as well as our guided-search heuristic that eﬃciently ﬁnds
the largest possible extra cycles.
6.2.1 Optimization
The scheduling gain maximization problem can be formulated below with (6.1)
as the objective and Δci, βij as the decision variables to be optimized.
1For simplicity of description, we use f ik (
i




k ) for ti.
68



























k) + ΔEoh (6.3)
∑
j∈F pi
βij = 1, βij ∈ {1, 0}, ∀ti ∈ D (6.4)
ci +Δci ≤ Wq|Qi|−1 , ∀ti ∈ D (6.5)
Timing and energy constraints are enforced by (6.2) and (6.3), respectively.
Δtoh and ΔEoh are deﬁned as overhead diﬀerences due to altered frequencies of
ti and its preceding task on the same processor. We use the worst case values of
Δtoh and ΔEoh and treat them as constants. Since the available frequency range
is ﬁnite and discrete, we use a boolean variable βij as an indicator to reﬂect which
frequency level j is used for ti. Constraints (6.2) and (6.3) sum all j for each ti,
and require to decide optimal frequency j for ti. Thus the selection of frequency is
transformed into β value determination. Since βij requires binary integer value, the
above formulation is an integer programming program. Added that the constraints
are nonlinear (products of βij and Δci), the above optimization is an 0/1 integer
nonlinear programming problem with NP-hardness. This can be proved by reducing
the optimized formulation to a 0/1 programming problem, whose decision version
is NP-complete [70]. By assigning Δci(∀ti ∈ D) with a set of arbitrary values, it
immediately becomes the 0/1 programming formulation. A black box that solves
69
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
the above optimization problem by searching through the binary values of βi,j and
the range of Δci, can be directly adopted to solving the 0/1 programming problem
by searching through the binary values of βi,j and the ﬁxed values of Δci.
6.2.2 Guided-Search heuristic
Given the hardness to directly achieve optimized frequency and cycle alloca-
tion, we propose the guided-search heuristic that eﬃciently and optimally achieve
quality maximization. The heuristic is derived by analyzing the optimization for-
mulation discussed previously. For description clarity, let us release the constraints





















k) + ΔEoh (6.8)
where the variables of interest are ci and f
i
j . Assume we select a speciﬁc f
i
j for
every ti in the above formulation, thus the corresponding 
i
j is ﬁxed. Then, the
question becomes a linear programming problem that derive the maximal
∑
Δci
under the speciﬁc frequencies f ij . However, we still need to decide the best-ﬁt f
i
j
for each i. A non-ideal f ij fails to fully utilize Es and τs,i, and can have either (6.7)
or (6.8) equalized but not both.
70
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
Situation 1: If (6.8) is equal, it implies no enough energy to supply slack time for
ti. Then scaling down f
i
j can lead to reduced 
i
j, hence increased Δci according to
equalized (6.8) with all other values constant. According to (6.7), increased Δci
and decreased f ij use more slack time as long as not exceeding the deadline.
Situation 2: If branches in (6.7) are equalized it implies no enough slack time to
fully consume the slack energy. Then scaling up f ij would cost more energy in (6.8)
while leave extra slack time for Δci to increase.
The above observation reveals the frequency scaling directions to increase Δci
in both situations. Thus,
∑
Δci can be steadily increased in a guided search pro-
cess. We set the starting Δci to 0 since our approach keeps increasing Δci. Initially,
the highest frequency is used for all ti in (6.7), such that largest energy consump-
tion is made in (6.8). The reason to start from the most energy consumption is due
to the fact that τs is duplicated to receivers but not Es. Thus it is more possible
to equalize (6.8), by using highest frequency to enforce it. Note that the zero Δci
and highest f ij causes all branches in (6.7) unequal.
As discussed above, it is more possible that the l.h.s of (6.8) is larger than
the r.h.s after setting all f ij highest. Then we slow down f
i
js in (6.7) to reduce the
l.h.s of (6.8). Note that in this process all Δcis remain their initial values, 0, to
avoid increasing the l.h.s. value. The ti chosen is in the order of increasing residual
cycles Wq|Qi|−1 − ci, such that larger residual tasks reserve higher frequency slowing
down opportunities for cycle increase.
This process stops when the l.h.s. becomes smaller than the r.h.s. of (6.8).
Otherwise the process terminates since no frequency can be scaled down further.
71
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
Then we increase Δci in the l.h.s. of (6.8) to make situation 1 happen. Tasks with
small residuals are selected to increase Δci, because they have less chance to fully
exploit the beneﬁt of frequency scaling due to the limited residual cycles.
Under situation 1, the frequency of a ti in (6.7) is scaled down by one level,
while its Δci should be increased to maintain equality of (6.8). The criteria to
choose Ti depend on three aspects: the more residual cycles Wq|Qi|−1 − ci available,




j − ij− needs
to be small to avoid radical cycle increase given the residual cycle constraint; it
should have the largest laxity to timing constraint in (6.7). Combining the factors,




∗Δτi as frequency scaling down
target, where Δτi is the time left to violate timing constraint in (6.7). The selection
process repeats if (6.8) remain equalized.
It may happen that after several iterations, all branches in (6.7) are equalized
due to increased Δci, while there are still unused power quota in (6.8). In this
case, situation 2 occurs and we choose to increase f ij . Similar criteria apply to ti






This frequency scaling process terminates under three conditions:
• ci +Δci = Wq|Qi|−1 , ∀ti ∈ D;
• when increasing f ij is required, all f ij = f iJ−1;
• when decreasing f ij is required, all f ij = f i0;
Finally, if initially after setting all f ij to highest and Δci to 0, l.h.s. of (6.8) is
still smaller than the r.h.s., we directly increase the Δci from the smallest residual
72
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
ti, until (6.7) or (6.8) happens for further frequency scaling. The above discussion
is summarized in Algorithm 6.1.
Algorithm 6.1: Guided Search(D, τs,i, Es)
1: for each ti ∈ D /*Initialization*/
2: f ij = f
i
J−1
3: /*Pre-adjust before situations 1 & 2*/
4: if l.h.s.(11) < r.h.s.(11)
5: Increase Δci until (10) or (11) equalized
6: else if l.h.s.(11) ≥ r.h.s.(11)
7: SCALE DOWN(ti) until l.h.s.(11) < r.h.s.(11)
8: /*Make situation 1 happen*/
8: while l.h.s.(11) < r.h.s.(11)
9: Increase Δci, choosing smallest Wq|Qi|−1 − ci
10: /*Situation 1*/
11: while l.h.s.(11) ≤ r.h.s.(11)





13: f ij = f
i
j−
14: Increase Δci and maintain l.h.s.(10) < r.h.s.(10)
15: if CHECK(Situation 2) == TRUE
16: GOTO(Situation 2)
17: if CHECK(Termination) == TRUE
18: EXIT
19: /*Situation 2*/
20: while CHECK(Situation 2) == TRUE





22: f ij = f
i
j+
23: Increase Δci and maintain l.h.s.(11) ≤ r.h.s.(11)
24: if CHECK(Situation 1) == TRUE
25: GOTO(Situation 1)
26: if CHECK(Termination) == TRUE
27: EXIT
The complexity of our approach is conﬁned by the number of frequency levels
J and the number of tis, n. Consider the scenario that starting from the fJ−1, all
n tasks are scaled down to f0 to reach situation 2, then all n tasks scale up, to
73
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
fJ−1 to deal with situation 2. This is a non-repeatable scenario, since if there was a
second round scaling down from the highest frequency, it would already have been
done in the ﬁrst round. Hence, the loose upper bound of the complexity is O(nJ).
6.3 Slack Receiver Selection
A straightforward receiver selection process can be greedy-based, i.e., choosing the
direct descendent tasks of the slack generator. There can be two limitations in
this approach. First, the direct receivers may not fully utilize the slack time, as
illustrated in Fig. 6.2(a) where c© may not fully collect 40 slack time units due to
unknown execution status of d©. Second, in the task graph there can be additional
parallel candidates beyond the direct descendent tasks for slack distribution, like
d© shown in Fig. 6.2(b).
Note that in this work we consider distributing slacks to exactly ONE amongst
the precedence-constrained tasks, e.g., in Fig. 6.2(b) slacks are given to c©, and
either b© or d© but not both. This selection strategy includes as many slack re-
ceivers as possible, and respects the duplication eﬀect of slack time for parallel
distribution targets. Choosing multiple slack receivers on a distribution path, may
block the selection of other parallel receivers that can fully adopt τs due to com-
plex precedence relationships. As evidenced in Fig. 6.3, choosing { a©, b©} returns
totally 110 eﬀective slack time for distribution, while removing the constraint of
a© and choosing parallel { b©, c©, d©} generate 150 eﬀective slack time.
Our receiver selection process is designed to overcome the disadvantages brought
by the greedy approach: First, a logical graph decomposition is applied for each
74












   
G
F
Fig. 6.2: (a) Task d prevents c from receiving the full slack. (b) b and d compete for the slack
















ĲVE ĲVF ĲVG 
D
Fig. 6.3: (a) Total slack time is 110 since a© blocks c© and d©. (b) Total slack time gained is 150.
task (slack generator), to identify which group of receivers can and cannot adopt
the full slack. Second, a selection scheme based on graph coloring is applied to both
receiver groups, with the only diﬀerence being that the amount of τs,i adopted is
full or partial. It is important to mention that those two steps are conducted stat-
ically in the oﬄine stage, given our assumption that the task-processor binding P ,
as well as the precedence relationships of tasks, does not alter during execution. At
runtime, whenever slacks are generated, a ﬁnal receiver selection decision based on
the availability of oﬄine-selected receivers is made, and the slack times, full or par-
tial, are given to respective receivers, followed by invocation of the guided-search
algorithm.
75
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
6.3.1 Graph decomposition
The graph decomposition details are identical to the one described in Section
5.3.1. We still repeat them here for the sake of presentation completeness.
Graph decomposition is applied for each slack generator ts, based on the anal-
ysis of the degree of candidate slack receivers.
Deﬁnition 6.1: The degree d of a task is the number of incident edges of the
SPAS Gt.
Deﬁnition 6.2: A full candidate set (FCS) of ts is deﬁned as the set of slack
receivers, ti, that completely receives the full slack time of the slack generator,
namely τs,i = τs.
Deﬁnition 6.3: A partial candidate set (PCS) of ts is deﬁned as the set of slack
receivers, ti, that may not adopt the full slack time due to precedence constraints,
namely taus,i ≤ τs.
An illustration of FCS and PCS is shown in Fig. 6.4(a), where tasks { c©, d©,
g©, h©} belong to the FCS of a©, while tasks { i©, j©, l©, m©, n©} belong to the PCS
of a©. A natural but misleading observation is that a candidate receiver ti with
d(ti) ≥ 2 should be grouped into the PCS. However, it depends on whether the slack
generator is the sole slack source, i.e., whether there exists other independent slack
sources at the time of slack generation. For example, in Fig. 6.4(b), task j© should
be grouped in the PCS of h©, but in the FCS of a©. Thus, the graph decomposition
process should be logically applied for each possible slack generator, to identify and
76
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations






















Fig. 6.4: (a) Graph decomposition illustration for a©. Note that the link between d© and j© is
omitted due to precedence redundancy. Same as e© and m©. (b) A task can belong to PCS or
FCS of diﬀerent slack generators.
The graph decomposition algorithm is shown in Algorithm 6.2. Firstly, all
precedent tasks of ts is removed since in real-time they should have been completed.
Then, graph traversal is conducted. For each child ti of ts, ﬁnd its root ancestor.
If the root is other than ts, the child belongs to PCS, and all descendent nodes of
the child have to be peeled oﬀ from the graph, and put into PCS. If ts is the only
root, then the child belongs to FCS.
Algorithm 6.2: Graph Decomp(Gt, ts)
1: G ′t = RMOV PARENTS(Gt, ts)
2: FCS ts = PCS ts = ∅
3: TRAVERSE(G ′t), for each ti of ts
4: t′s = FIND ROOT(ti)
5: if t′s = ts
6: PCS ts = PCS ts ∪ {ti}∪ ALL CHILD(ti)
7: else
8: FCS ts = FCS ts ∪ {ti}
77
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
6.3.2 Receiver selection from FCS
Tasks in FCS can claim the full slack time, but precedence relationships in-
troduce mutual exclusiveness among the tasks. The selection process has to admit
as many parallel receivers as possible, to create and utilize maximal duplicated
slack time. Moreover, the cycle increase potential should also be regarded for the
parallel receivers to avoid including too many tasks with exhausted cycle budget
(ci approaching Wq|Qi|−1).
To be more precise, rather than obtaining the largest independent/parallel
task set from the FCS, the receiver selection process ﬁnds a list of mutually inde-
pendent sets, so that the slacks can be given to the set that has the largest cycle
increase potential,
∑
(Wq|Qi|−1 − ci), at runtime.
Deﬁnition 6.4: A full slack receiving candidate (FC) is deﬁned as a subset of
FCS, containing tasks that simultaneously adopts the full slack. Properties of FC
include:
• FC 0 ∪ FC 1 ∪ ... ∪ FCN−1 = FCS
• FC i ∩ FC j = ∅, ∀i, j ∈ [0, N − 1]
The problem of ﬁnding the mutually independent sets, namely FC s, can be
naturally converted to the graph coloring problem, that tries to color the graph with
minimum number of colors (hence maximum number of vertex in one color) such
that any two connected vertex are colored diﬀerently. Analogous to the coloring
problem, for receiver selection we hope to ﬁnd a minimum list of FC s to ensure
that maximum independent tasks are included in each FC.
78
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
Transformation to the coloring problem is illustrated in Fig. 6.5, and detailed
in Algorithm 6.3. Given an FCS, since every pair of precedence constrained tasks
are not parallel, we put each task in diﬀerent FC s by adding a link between them,
and ensure that linked tasks are never grouped. This is achieved by applying the
graph coloring methodologies to the FCS -transformed graph. Note that the it is in
general NP-complete in deciding whether k-coloring the graph is possible; however,
as mentioned before, this receiver selection process is conducted in the oﬄine stage,
thus eﬃciency may not be the key requirement. For example, k-colorability can be
decided using an inclusion-exclusion approach in O(2nn) time [71].
Algorithm 6.3: FCS To Coloring(FCS)
1: Convert to Undirected FCS
2: for each ti ∈ FCS
3: TRAVERSE(FCS ), for each t′i ∈ FCS
4: if NO LINK(ti, t
′
i)
5: ADD LINK(ti, t
′
i)
6.3.3 Receiver selection from PCS
The characteristics of tasks ti in PCS is that the slack time τs,i can only be
determined at runtime. For example, in Fig. 6.6, e© can receive τs,e = 0 if e©
starts following c©, and receive τs,e = MIN(τs, tl) if e© starts following b©. This
complicated precedence relationship poses great limitations for receiver selection
in PCS. First, there can be processor conﬂict between tasks in PCS and FCS, e.g.
d© and j© in Fig. 6.4(b). In this case, we select the task in FCS, and the one in
PCS is not considered due to the inability to guarantee full slack time adoption.
Second, some tasks in PCS are subject to external precedence limitations. E.g.,
in Fig. 6.4(b), i© receives partial slack time, but n© can adopt even less than it
79






















Fig. 6.5: (a) The FCS that fully adopts τs. (b) The resulted graph after transformation: all
precedence tasks are connected. (c) A coloring example that minimally uses three colors to













Fig. 6.6: The slack received for PCS tasks depends on the online execution status. (a) τs,e = 0.
(b) τs,e =MIN(τs, tl).
80
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
due to the unknown execution status of f©. Determining n© at runtime is time-
consuming (processor status update, inter-processor traﬃc overhead, etc) and with
little beneﬁt. Thus, to select tasks in PCS, an eﬃcient way is to use the FCS of
the PCS root. In Fig. 6.4(a), the root is i©, and its FCS is { j©, m©}, to which the
partial slack can be fully distributed. Moreover, inside the FCS, task conﬂicting
with FCS ts should also be excluded. E.g. j© is avoided due to processor conﬂict
with d©. Hence, the PCS is deﬁned as
PCS = ROOT(PCS )
⋃
FCSROOT (PCS)\ {Conﬂict Tasks}.
6.3.4 Runtime receiver selection
The above sections describe the static/oﬄine receiver preparation, and thus
far, we have a list of FC s for full slack time adoption, as well as several processed
PCS for partial slack adoption. The runtime receiver selection, as part of the
dynamic algorithm, then simply involves the following two parts: selecting one FC
to distribute τs, and deciding τs,r that is given to the corresponding PCS.
The criteria of FC selection, as mentioned previously, include both the cy-
cle increase potential
∑
(Wq|Qi|−1 − ci), and the degree of parallelism, namely the
number of simultaneous tasks. We solve the latter by a conversion to the graph
coloring problem. For cycle increase potential, a straightforward solution is to sort
the list of FC s, and ﬁnd the one with the largest
∑
(Wq|Qi|−1 − ci). However, a
greedy approach built upon the available FC s can be applied to obtain a task set
with larger cycle increase potential.
It can be observed that FC selection by graph coloring does not result in a
list of enumerated task combinations, but an instance of the possible combination.
81






























Fig. 6.7: (a) An FC selection instance by applying graph coloring, with their runtime residual
cycles. (b) The ﬁnal FC2 optimized by applying Algorithm 6.4.
Otherwise online sorting the FC list could take exponential time. However, com-
bination of tasks other than the FC could possess more cycle increase potential.
Fig. 6.7(a) shows the FC selection based on the task coloring in Fig. 6.5, where
the number besides each task represents its runtime residual cycles. As can be ob-
served, FC 2 in Fig. 6.7(a) contains the largest cycle potential, but it seems better
to include a© or e©.
To maximally obtain the available cycle increase potential without reducing
the degree of parallelism, the following algorithm is proposed and shown in Al-
gorithm 6.4. For each task ti in the largest
∑
(Wq|Qi|−1 − ci) FCm, we search
for the non-FCm task t
′
i that: (1) ti and t
′
i are connected in the FCS -transformed
graph; (2) t′i is not connected to other tasks in the FCm. If t
′
i has greater cycle
potential, then remove ti from the FCm, and replace it with t
′
i. In this manner,
the cycle potential greedily increases, and isolating t′i from other parallel tasks of ti
avoids excluding more tasks and retains maximal parallel tasks in the FCm. Fig.
6.7(b) shows the resulted FC by including e© and excluding b©. Note that a© is
not considered due to precedence constraints with other tasks in FC 2.
82
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
Algorithm 6.4: Find FCm(FCSclr)
1: /*FCSclr is the input colored graph*/
2: FCm = LARGEST POTENTIAL(FCSclr)
3: for each ti ∈ FCm
4: for each t′i ∈ FCS clr\ FCm,
5: s.t. CON(t′i, ti) ∧ (¬ CON(t′i, tj), ∀tj ∈ FCm)
6: if Wq|Qi′ |−1
− ci′ > Wq|Qi|−1 − ci
7: REPLACE ti with t
′
i
Note that in this searching process, each task in FCS can be traversed for
comparison. Because if there exists a task that is not connected to any ti in the
FCm, then it should belong to FCm. A theoretical but very loose upper limit in
timing complexity is O(n2), where each task searches and examines all other tasks
in the FCS.
For PCS slack distribution, ﬁrstly the slack time τs,r should be determined
as illustrated in Fig. 6.6. Then, τs,r is given to the FCS of the root task of the
PCS. To select the FC in the FCS, the procedure mentioned above can be directly
applied.
Hence, referring to the formulation (6.6)-(6.8), in this section, the determined
FC s from FCS and PCS constitute the tasks in D, and the respective τs and τs,r
are ﬁt to (6.7) as the slack time.
6.3.5 Implication to static scheduling
The actual slack time adopted by a PCS task is dependent on its latest ﬁnished
direct predecessor. To maximally eliminate the predecessor impact, the favored
static schedule should: (1) Choose a processor with the earliest available time, e.g.
in Fig. 6.8(b), 6© prefers P0 to P2. (2) Amongst the processors with identical ear-
83





























Fig. 6.8: (a) A static DAG mapping on a 6-processor system in favor of dynamic cycle generation.
(b) A static mapping creating PCS nodes, not preferred for dynamic scheduling.
liest available time, allocate the task to a processor that leads to least precedence
constraints, e.g. in Fig. 6.8(b), assuming 4© and 5© complete at the same time, 8©
prefers P4 to P3, since allocating 8© to P3 adds a precedence relationship between
8© and 4©. We should say that compared to Fig. 6.8(b), Fig. 6.8(a) provides a
favorable static schedule for dynamic scheduling.
6.4 Slack Distribution Considering Inter-Processor
Communication
In previous sections, slack time that is delivered to successors is assumed with
negligible transmission variance, i.e. the data transmission time is assumed with
negligible ﬂuctuation and slack time actually passed to receivers is not altered.
However, the assumption is less practical for applications with non-deterministic
communication volume that leads to signiﬁcant transmission time variation. Fur-
ther, for multiprocessor systems, variations caused by contention delays due to
transmission infrastructures can also be signiﬁcant but hard to precisely model.
84
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
In those cases, slack time obtained by successors would be either compensated or
extended, and scheduling performance would be deteriorated without awareness of
the variations.
In this section, we extend our algorithm to address the transmission over-
heads. As described above, the variation can be subject to both data-level and
infrastructure-level; however, the receiver could simply overlook the source of vari-
ation but adjust upon whatever slack change is present. After the transmission
has completed, the receivers should re-adjust their frequencies to fulﬁll the new
timing (and associated energy) requirements. Note that each adjustment is in a
local scope: Upon completion of the transmission and before execution starts, a
receiver ti re-adjusts the frequency without considering the status of other receiver
nodes. This is because if ti waits until all transmissions complete to make a globally
optimized decision, the waiting time would already be wasted.
Hence, our proposed dynamic scheduling procedure invoked after a task has
ﬁnished, should be two-staged: 1. Algorithm 6.1 should be called immediately
before data transmission, to make a global decision on all candidate nodes; 2. upon
completion of each data stream transmission, a local re-adjustment process should
be invoked to compensate the slack time ﬂuctuation and reﬁne the cycle increase.
The tasks called for local adjustment are all immediate descendants of the slack
generator, e.g. c© and d© of a© in Fig. 6.4. The re-adjustment targets maximizing











(ci +Δci +Δci)ij ≤ Etr,i + ((ci +Δci)ij), (6.11)
0 ≤ ci +Δci +Δci ≤ Wq|Qi|−1 , (6.12)
where Δci, f
i
j are scheduling results from Algorithm 6.1, τtr,i is the slack variation
due to transmission time ﬂuctuation, Etr,i is the corresponding energy change, and
f ij (
i
j) is the frequency (per-cycle energy) after the local re-adjustment. Note that
Δci can be positive or negative depending on whether the transmission time is
excessive (τtr,i < 0) or deﬁcient (τtr,i > 0).
The solution to this problem is straight-forward when considering only one
receiver ti. Observe that since Δci is on the l.h.s. of (6.10) and (6.11), the max-
imum ci + Δci + Δci can be found by matching f
i
j through all J frequency levels
and ﬁnd the maximum value satisfying both timing and energy constraints, i.e.,





+ τtr,i), (Etr,i + (ci + Δci)ij)/f ij)). Assuming each com-
parison takes O(1), the complexity of this approach is O(J).
In the above approach, the actual calculation time is dependent on the number
of frequency levels available. To an ideal DVS-enabled processor with very ﬁne-
grained frequency level changes, this procedure could be costly. In our view, the
eﬃciency of this local scaling process can be greatly improved. We observe that f ij
has been an optimal value after the step-by-step guided-search. If there are further
slack changes, i.e. τtr,i = 0 and Etr,i = 0, this searching procedure should continue.
The searching algorithm is listed in Algorithm 6.5. The starting frequency
f ij is optimal given τtr,i = 0. With non-zero τtr,i, the searching can be continued in
86
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
two directions: scaling up and down. In each scaling step, the algorithm obtains
the maximum possible cycle gain constrained by timing (line 6, 16), energy (line
7, 17), and cycles limits (line 8, 18), then adjust f ij if the cycle is larger than the
previous frequency levels (lines 10-12).
It is crucial to note that searching in either directions is greedy: If the next
step leads to a smaller cycle increase, then searching stops. That is, there is no
expectation that cycle increase could be larger again at some frequency beyond the
current one. This can be validated by analyzing (6.10) and (6.11). We still assume
that as f ij increases(decreases), the 
i
j monotonically increases(decreases). If f
i
j is
increased in (6.10), Δci can then be increased. If at some point Δci cannot be
increased further, it must be constrained by (6.11) or (6.12). Then further scaling
up f ij would only reduce Δci. The same analysis applies to scaling down situation.
Hence in Algorithm 6.5, the searching stops when cycles stops increasing (line 13,
23). Since the searching procedure may not traverse all J levels, the average search
time can be greatly reduced.
6.5 Results and Discussion
In this section we present the experimental results to justify our approach. The
performance of our dynamic algorithm shall be reﬂected in two aspects: scheduling
gain measured by the generated cycles from runtime slack, and runtime eﬃciency
measured by the actual algorithm execution time. Moreover, we design experiments
that run on a Network-on-Chip system to examine the capability of our algorithm
to cope with slack inaccuracy caused by drastically ﬂuctuating transmission time.
87
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
6.5.1 Setups
The simulation platform consists of an execution engine we developed and a set
of third-party tools glued as external facilities, as shown in Fig. 6.9. The execution
engine simulates a multi-processor system that intakes adaptive tasks as well as
system conﬁguration, implement statical scheduling, invoke dynamic scheduler and
traﬃc simulator at runtime, simulate task executions, and calculates and collect
intermediate results such as energy consumption and runtime cycle gain. We have
implemented several dynamic scheduling algorithms for comparison, of which the
details are described soon. To measure the runtime eﬃciency of those algorithms,
we use the cycle-accurate SESC [73] ISA simulator to collect the runtime property of
each algorithm. In order to examine the delay caused by underlying transmission
infrastructure, we utilize the NIRGAM [74] NoC traﬃc simulator. At runtime,
whenever the engine needs to determine a speciﬁc stream delay required by the
local scheduler, it conﬁgures NIRGAM by a snapshot of the NoC status: the set of
from/to nodes of all concurrent streams and the length each stream. Note that the
switching mechanism of NIRGAM is set to Wormhole [75], thus it is no necessity
to consider emerging transmissions beyond the snapshot: Once the connection has
been established, new streams should not interrupt the stream under consideration.
For comparison: (1) We design an altered version of our methodology that
adopts the same slack receiver selection process but evenly distributes energy to
receivers, and individually apply DVS to receivers for cycle generation. (2) To eval-
uate the eﬀects of receiver selection, we adopt the greedy-based dynamic algorithm
from [47] which eﬃciently decides frequencies of immediate successors. (3) The per-
formance is also measured under diﬀerent static schedule inputs. We implement
two list scheduling algorithms with the diﬀerence in processor selection criteria,
88
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
Table 6.2: Frequency and energy-per-cycle relationship of the experimental processor.
Freq.(MHz) 624 520 416 312 232.5
Ecyc(nJ) 1.48 1.43 1.37 1.25 1.2
one with dynamic scheduling awareness as described in the previous section, and
the other with randomly chosen processor.
The simulation engine is conﬁgured with varied number of processors, and
each processor is capable of ﬁnite discrete frequency scaling. The frequency levels
and associated energy parameters are adopted from the Intel XScale PXA270 CPU
speciﬁcations. TABLE 6.2 shows the available CPU frequencies and per-cycle en-
ergy consumption. The applications used in our experiment consist of a JPEG2000
decoder with adaptive feature, as well as synthesized task graphs for more extensive
performance tests.
6.5.2 Synthesized task simulation
We generate synthesized task graphs containing 100-200 tasks. The tasks are
synthesized by the task graph generator TGFF [72], in which the mean execution
time is set as 15ms. We test the task graphs on our platform with 8, 32, and 64
processors respectively. The results are shown in Fig. 6.10 in which average gained
execution cycles using the above three algorithms are normalized. We implement
the three algorithms under diﬀerent static schedules in which the dynamic schedul-
ing aware static schedule gains as large as 27.7% on the 32-processor platform,
compared to the list scheduling algorithm based on earliest available processor
time. Under either static schedules, our algorithm can achieve at least 31.2% cy-
cle increase compared to the even-distribution energy approach, and at leat 42.2%
89
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
compared to the greedy approach. Note that interestingly, the greedy approach has
better performance with less number of processors while the even-energy approach
leads to better performance when processor number is large. These can reﬂect
the feature of the greedy approach which does not scale well to the number of
processors, and our algorithm can lead to better performance by considering both
DVS scaling and slack receiver selection. Fig. 6.11 shows the execution cycle of
the guided-search algorithm. Compared to the extremely fast greedy approach, our
method runs with larger number of cycles, about tens of times. However, compared
to the typical task size, the runtime overhead is still extremely small, near 0.3%.
6.5.3 The JPEG2000 decoder
We use the JPEG2000 decoder example to show the applicability of our method-
ology. The JPEG2000 decoder is known as the adaptive application that allows
reconstruction of images in a progressive manner. This is possible by the use of
Discrete Wavelet Transform (DWT), which encodes the images into multiple sub-
bands so that the lower frequency subband contains ﬁner frequency resolution and
coarser time resolution. At the decoder, as more data are received, higher resolution
images can be decoded making use of subsequent higher frequency information.
In our experiment, we decode the “Ceveness” sample j2k ﬁle with the size of
14.4MB. The application is divided into three branches each of which decodes a
colour component. Each branch contains a DWT block which is able to decode
in multiple resolution levels. At the encoder side we have encoded six levels of
resolution. In the decoder, we statically set DWT to decode only level 1, leaving all
other ﬁve levels for online decision. We have proﬁled the execution cycles of DWT
to perform the six levels of transformation respectively, as shown in TABLE 6.3.
90
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
Table 6.3: DWT cycles to transform diﬀerent levels of resolution.
no. resolutions 1 2 3 4 5 6
cycles×106 2.67E-3 0.381 1.86 9.74 49.17 134.45
Table 6.4: Performance from scheduling a JPEG2000 decoder.
Avg. cyc. inc. Impr (%) Exe. cyc.
Dyn.DVS 25.8E+3 141.1 1029
Egr.div 10.7E+3 0 897
Greedy 19.6E+3 83.1 200
Note that, since the additional cycles are discrete, we round the derived optimal
cycles to the next largest value as our result. Since there are three application
branches we assume a platform with three processors.
The results are shown in TABLE 6.4, in which the quality gain of our algorithm
is around 2.5 times over the even-energy approach, and 31.6% better than the
greedy approach. The greedy approach outperforms the even-energy approach
since only 3 processors are used in this set of experiments. This eliminates the
disadvantage that greedy-approach is weak in receiver candidate selection. Note
that our algorithm runs fast in this example. The main reason is the small number
of processors.
6.5.4 Considering communication variation
We design the experiments to simulate the performance of Algorithm 6.5 under
both data- and infrastructure-level impacts. Synthesized tasks with similar settings
as in Section 6.5.2 are used. To demonstrate slack variation caused by varying
transmission volumes, we generate a Gaussian distributed transmission volume, so
91
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
would be the transmission time (assuming constant channel bandwidth). The mean
time is set as 30% of the triggering task execution time, with varying variances.
Fig. 6.12 illustrates the results of local scaling under diﬀerent standard deviations
(from 2 to 7) of the Gaussian distribution. The positive values are represented
as the total local cycle gain when the randomized time is below mean (slack time
is further extended), calculated as the cycle diﬀerence between with and without
applying local scaling. Correspondingly, negative values represent the total saved
cycles when randomized time is above mean (slack time is oﬀset), with the same
calculation as positive ones. Five task sets are simulated with task numbers (25, 38,
43, 59, and 69). As can be observed from the results, at deviation 2, there are nearly
no diﬀerences when applying local scaling, due to very limited timing space and
frequency levels (see TABLE 6.2) for scaling. As the timing diﬀerence increases,
the eﬀects of local scaling is increased. At deviation 7, the saved cycles are 9394,
16.9% compared to cycle loss when not applying local scaling. Note that the value
is not related to the number of tasks, e.g. 69 tasks may not necessarily results in
more diﬀerences than 25 tasks, so do their absolute values. This is because distinct
task sets creates speciﬁc precedence relationships, which may prevent slack intake
by successors.
To demonstrate how our approach can cope with slack variation due to trans-
mission congestion on NoC, for each data stream, we simulate the transmission
time with various routing schemes. The transmission times diﬀer under each rout-
ing scheme which results in distinct congestion patterns. The task set consists of
40 synthesized tasks. Fig. 6.13 illustrates the performance of Algorithm 2 under
the four routing schemes, namely XY [87], Odd-Even [88], DyAD [89], and PROM
[90]. The points connected by the dashed line shows the accumulated diﬀerence of
92
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
the transmission times with and without concurrent transmission contentions un-
der one routing method. The corresponding column pairs indicate the cycle loss, in
case of contentions, of the receiver tasks with (lower column) and without (higher
column) applying local scaling. Fig. 6.13(a)-(d) show the results under various
network sizes, respectively 3 × 4, 4 × 6, 5 × 6, 6 × 6. The largest cycle savings
are 12.9%(DyAD), 8.94%(PROM), 9.0%(PROM), and 8.9%(PROM) respectively
in the above settings, compared to not applying the local scaling algorithms. Note
that results in Fig. 6.13 do not serve as an evaluation of the routing schemes,
but rather to test our algorithm under various delays conditions. We can observe
the columns follow tightly with the lines, indicating our algorithm performs well
reacting to the delays due to various traﬃc congestion.
We also compare the eﬃciency of Algorithm 6.5 to using a for loop that exam-
ines every frequency level in the range to ﬁnd the maximal cycle gain (or minimal
cycle loss). Fig. 6.14 shows the execution cycles of the two approaches measured
by the ISA Simulator. The results are collected as the averaged instruction counts
of executing ten sets of 20 synthesized tasks. We vary the number of frequency
levels available for scaling by doubling values from 5 to 160. It can be observed
that the for loop approach in general doubles the execution count as the loop size
doubles (loop size is the number of frequencies to iterate), while the local scaling
approach leads to reduced instruction executed, namely 10.9% at 5 to 51.5% at
160 compared above. The extra cycles generated are almost identical.
93
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations
Algorithm 6.5: LOCAL SCALING(ti, τtr,i, Etr,i)
1: /*Calculate starting cyc. w/o freq. adj.*/
2: Δci = τtr,if
i
j ;
3: f ij = f
i
j ;
4: /*Try scaling up*/








(Etr,i + (ci +Δci)ij)− ci −Δci;
8: Δci
↑ = MIN(Δcia,Δcib,Wq|Qi|−1 − ci −Δci);
9: /*scale up frequency*/
10: if (Δci
↑ is increasing over Δci)
11: UPDATE(Δci
↑);
12: if (f ij = fJ−1) f ij ↗;
13: else break;
14: /*Try scaling down*/
15: while (f ij >= f0)
16: Δci




(Etr,i + (ci +Δci)ij)− ci −Δci;
18: Δci
↓ = MIN (Δcia,Δcib,Wq|Qi|−1 − ci −Δci);
19: /*scale down frequency*/
20: if (Δci
↓ is increasing over Δci)
21: UPDATE(Δci
↓);
22: if (f ij = f0) f ij ↘;
23: else break;




CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations


































































Fig. 6.10: Normalized cycle gain on (a)8, (b)32, (c)64 processors using three methods.
95
CHAPTER 6. Scheduling Generic Models on Multiprocessors with Realistic Considerations







































Fig. 6.12: Cycle diﬀerence between w/ and w/o local scaling, v.s. Gaussian distribution variances
in generating traﬃc time.
96
























































































































































Fig. 6.13: Performance of Algorithm 6.5 under diﬀerent NoC routing schemes, on various network
size. (a) 3× 4, (b) 4× 6, (c) 5× 6, (d) 6× 6.
97






























In this chapter, we present a static scheduling approach for multiprocessor systems,
to have our framework self-contained. As the dynamic algorithm in Chapter 6
deals with Network-on-Chip related architectural overhead, our static scheduling
speciﬁcally involves NoC intercommunication.
NoC-targeted static scheduling algorithms impose great impact on execution
eﬃciency of applications loaded. Both computation and communication elements
need to be carefully allocated to processing core and on-chip network paths. More
importantly, without coordinated scheduling on both computation and communica-
tion workloads, speculative mapping algorithms may not generate eﬀective runtime
behavior, and may even lead to adverse results. On one hand, aggregated task al-
99
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
location and link sharing reduce the data transmission distance, hence the overall
end-to-end application execution time can be reduced. On the other hand, the
compact mapping can cause runtime traﬃc jam if not properly scheduled. This
may be quite severe on data-intensive applications such as multimedia processing.
The above issues can be tackled with a combined mapping and scheduling
algorithm which speciﬁcally determines the task/channel mapping as well as their
runtime behavior. In this chapter, we present an NoC-targeted static mapping and
scheduling algorithm, which is able to eﬃciently route and schedule the transmis-
sions in the process of task mapping with low computational complexity. In our
algorithm, a routing-aware list-scheduling method to schedule each task onto the
best ﬁt processor is designed, with the objective of minimizing overall application
execution time. The “best ﬁt” processor selection is based on the earliest available
time of the processor, as well as the communication cost to that processor. As the
result, the task node mapping, as well as transmission route and schedule, can then
be determined.
7.1 Preliminaries
In this section we extend the system model based on Chapter 3, to ease the elab-
oration of the static algorithm.
The target NoC architecture is deﬁned by the set of NN processors N and the
set of all NL communication channels L = {ln1→n2 |n1, n2 ∈ N} between two neigh-
boring processors, namely (N,L). In the following, we use “channel” and “link”
exchangeably referring to L. The processors are assumed to be homogeneous in this
study with identical frequency and power characteristics, and the channels are also
100
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
homogeneous. Almost all the contemporary NoC prototypes[76, 77] can be repre-
sented by this homogeneity. In this work, we adopt wormhole switching[75] (with
no virtual channels) as the network ﬂow control mechanism, yet other switching
methods can also be adopted with little impact on our algorithm.
Multiple applications, modeled by DAGs, are executed on the platform. We
deﬁne the total number of tasks of all applications to be NA. We use j ≺ i to deﬁne
the precedence relationship of two tasks i and j. A task i has a ﬁxed execution time
exet(i), and a ready time ready(i) for its availability before any scheduling. Note
that in the adaptive application context, the execution time should be deﬁned to
fulﬁll the minimal quality requirement. E.g., in terms of the IC-task, the execution
time could be deﬁned using mandatory execution volume. Each logical application
communication link ej→i is associated with a weight j→i as the data volume of
transmission (number of bits) from j to i.
A schedule in our study is regarded as the process of task-processor mapping,
transmission-channel mapping, and task/transmission starting time decision.
Mapping
The task and communication mappings are represented as (φ,Γ).
• φ(i) ∈ N is the processor allocated for i. If i has m (m > 1) predecessors
{jα|jα ≺ i, α ∈ {1, ...,m}}, we set Φ(j) = {φ(jα)|α ∈ {1, ...,m}}. Precedence
relationship of independent tasks should be added if they are scheduled onto
the same processor, with  = 0.
• Γ(Φ(j), φ(i)) is the set of transmission routes γ(Φ(j), φ(i)) which represent
diﬀerent ways to transmit from Φ(j) to φ(i).
101












Fig. 7.1: A transmission scenario to illustrate the hierarchical deﬁnitions. Γ(Φ(j), φ(i)) =
{γ1(Φ(j), φ(i)), γ2(Φ(j), φ(i))} is the set of two routes of routing {j1, j2} to i. The route
γ1(Φ(j), φ(i)) = {p1,1, p1,2} is one way of routing by using path p1,1 to connect φ(j1) and φ(i),
while using path p1,2 to connect φ(j2) and φ(i). γ2(Φ(j), φ(i)) = {p2,1, p2,2} represents another
route. Each path px,y from φ(jα=1or2) to φ(i) consists of two links.
• Each γ(Φ(j), φ(i)) is a set of paths p(φ(jα), φ(i)) between each processor pair
assigned with jα and i.
• Each p(φ(jα), φ(i)) consists of a series of links lp→q, where p, q ∈ N and are
neighboring processors on the path, terminating at the two ends φ(jα), φ(i).
The notions of route set Γ(Φ(j), φ(i)), route γ(Φ(j), φ(i)), path p(φ(jα), φ(i)), and
link l are hierarchical. An example is illustrated in Fig. 7.1.
Timing
With task i scheduled on a processor n ∈ N , we deﬁne the starting time of i as
st(i) where st(i) ≥ ready(i), and the ﬁnishing time of i as ft(i). eat(n), deﬁned as
the earliest available time of n, should be updated by
eat(n) = max(eat(n), st(i)) + exet(i) + tr(j, i). (7.1)
We will mention how to calculate tr(j, i) in the following section. The earliest
available time of any link l ∈ p(φ(jα), φ(i)), namely Ψ(l), is updated by Ψ(l) =
102
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
max(Ψ(l), ft(jα)) + tr(jα, i). After assignment, the starting time of i is the eat of
its assigned processor n, i.e., st(i) = eat(n). To evaluate a schedule, we examine
the makespan of each application, which is its end-to-end execution time, including
the task execution time and eﬀective packet transmission time.
Taken application execution times and precedence relationships, as well as
platform properties, as input parameters, our algorithm is a combined process that
decides φ(i), st(i), and γ(Φ(j), φ(i)) of each task i, and aims at achieving the
shortest makespan of each application at polynomial time.
7.2 Algorithm Description
The algorithm is based on list-scheduling to prioritize the tasks and each task is
then assigned to the processor with the minimum eﬀective data transmission time.
The processor selection is based on the earliest available time of the processor, as
well as the communication cost to that processor. With diﬀerent ways of routing
and scheduling, the communication costs vary a lot to a target processor. Our
goal is to ﬁnd the minimum transmission cost to each processor for a task to be
scheduled, and the minimum cost among all the processors can then be identiﬁed
and the task is assigned to the minimal cost processor.
We need to prioritize the tasks to decide the assignment sequence in order to
assign it to a processor. Task prioritization aims at providing a task selection or-
dering which preserves the precedence relationship of the tasks, as well as arranging
a ready task to start earliest possible in the priority queue. In our algorithm, we
103
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
set the priority of a task i as
prio(i) = max(ready(i),max∀jα≺i(prio(jα)) + exet(i). (7.2)
The precedence relationship is reserved bymax∀jα≺i(prio(jα)), i.e., task i must
have a larger prio value than any immediate precedent task jα where j ≺ i. In
case of a tie, we give higher priority to the task with longer execution time, since
shorter tasks can be more ﬂexibly allocated. The prio(i) is deﬁned based on its
immediate parent prio(jα), thus ensuring the earliest availability of i. The tasks
are then sorted and selected in increasing order of their prio values.
For each ready task, we assign it to the best-ﬁt processor whose eat (deﬁned
in Eqn. (7.1)) is the smallest. The tr(j, i) is deﬁned as the eﬀective transmission
time deﬁned below. Most of the list scheduling algorithms choose the processor
with minimum eat for task assignment. In our algorithm, additionally, we consider
both the routing eﬃciency and the eat of each processor as the criteria. To assign
a task i, a processor with the minimum eat may not show proximity to where tasks
jα (∀jα ≺ i) are assigned, thus longer route eﬀectively increases the makespan. We
model the routing eﬃciency of i assigned on each candidate processors φ′(i) by its
Eﬀective Transmission Time (ETT) from all its predecessors (noted as Φ(j)) to
φ′(i), namely ETT (Φ(j), φ′(i)). The ETT (Φ(j), φ′(i)) is deﬁned as the minimum
ETT among all routes (refer to Section II for route deﬁnition) from Φ(j) to φ′(i),
i.e.,
ETT (Φ(j), φ′(i)) = min(ETTγ(Φ(j), φ′(i))), (7.3)
∀γ ∈ Γ(Φ(j), φ′(i)). ETTγ(Φ(j), φ′(i)) is deﬁned as the maximum transmission
time among all path pα with a route γ. Each ETTγ(Φ(j), φ
′(i)) and the resulted
104
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
ETT (Φ(j), φ′(i)) can be calculated as follows:
• Assume the network is idle, that is, no channel is currently transferring any
data. Also assume task i has only one predecessor j ≺ i. Then ETT (Φ(j), φ′(i)) =
ETTγ(Φ(j), φ
′(i)) = j→i× te× |φ(j), φ′(i)| (for simplicity, we ignore the ﬁrst
bit propagation time, which is tiny compared to total transmission time),
where te is the time to transmit one bit across a link (assuming data link
width as 1), and |φ(j), φ′(i)| is the shortest distance between two processors
φ(j) and φ′(i). In case of tiled NoC architecture, it is represented by the
manhattan distance. This is the simplest case.
• Continue with the above case, and assume an occupied link l, l ∈ p ∈ γ
(See Section II for deﬁnition), is currently transferring another ﬂow of data
with remaining communication time τ . Then ETTγ(Φ(j), φ
′(i)) = j→i ×
te × |φ(j), φ′(i)| + τ , i.e., the transmission is delayed by τ . This conforms
the behavior of a wormhole router. To ﬁnd the minimum ETTγ(Φ(j), φ
′(i))
in all routes Γ(Φ(j), φ′(i)), Dijkstra’s algorithm can be applied. The weight
wl of each link l is deﬁned as the remaining transmission time upon starting
of (Φ(j), φ′(i)) transmission, i.e., wl = max(Ψ(l) − ft(j), 0). Ψ(l) is the
earliest available link time deﬁned in Section II. An important implication
of Dijkstra’s algorithm is that, it automatically determines the path when
calculating the minimal transmission time.
• The above two scenarios deﬁne the solutions if i has only one input transmis-
sion. Modiﬁcations are needed if i has m (m > 1) input transmissions (due
to multiple jα, α ∈ {1, ...,m} and jα ≺ i). Firstly sort all ft(jα) in increasing
order (since earlier ﬁnished jα implies earlier started data transmission, and
105
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
considering earlier communication ﬁrst would minimally aﬀect the channel
impact to later transmission), and then ﬁnd the ETTα starting from the ear-
liest ﬁnished jα as described above. We deﬁne ETTα as the ETT of path
α in route γ, each path α corresponding to a jα, and all paths α making
up the route γ. Then update the link weights and ﬁnd ETTα+1. Note that
path α can overlap in certain links with the determined paths {1, ..., α− 1},
but {ETT1,...,ETTα−1} do not have to be updated, since transmission in
ETTα happens after them. Then ETTγ(Φ(j), φ
′(i)) = max∀α∈γ(ETTα),
and ETT (Φ(j), φ′(i)) = ETTγ(φ(j), φ′(i)) since ETTα already represents the
shortest path.
• Choose n = φ′(i) such thatmax(eat(n), st(i))+EET (Φ(j), n) is the smallest.
Assign i to n, and update eat(n) as in Eqn.(7.1).
Note that in our algorithm, calculation of ETT (Φ(j), φ′(i)) does not only de-
termine the route (by Dijkstra’s algorithm) but also schedules the transmission
by deciding its starting time. This would properly regulate the runtime traﬃc to
avoid enlarged makespan due to aggregated channel assignments. The complete
primary assignment algorithm is shown in Algorithm 7.1, with details explained
above. The ﬁrst two scenarios are not speciﬁed since they are subset of the last
one. To estimate the complexity of Algorithm 7.1: we have all NA tasks consid-
ered, and each task i is compared by being assigned to all NN processors. Each
comparison goes through m predecessors of i and for each jα Dijkstra’s algorithm
is applied. Given the complexity of Dijkstra’s algorithm O(|NL| log(NN)) imple-
mented by priority queue[94], a loose upper bound is O(|NA|2|NN ||NL| log(NN))
by considering that every task can have NA predecessors. But on average the
106
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
Algorithm 7.1: Primary Assignment(G), G=input Apps
1: /*push prioritized tasks to queue Q*/
2: Q = ASSIGN PRIO(G)
3: for (each task i ∈ Q)
4: for (each φ′(i) ∈ N)
5: /*sort jα by increasing ft(jα)*/
6: SORT jα ≺ i, s.t. ft(jα) ≥ ft(jα′),
7: ∀α ≥ α′ and α, α′ ∈ [1,m]
8: for (each jα, from j1)
9: /*calculate ETTα*/
10: ETTα ← DIJKSTRA(φ(jα),φ′(i))
11: UPDATE LINK WEIGHT
12: ETTγ(Φ(j), φ
′(i)) = max∀α∈γ(ETTα)
13: ETT (Φ(j), φ′(i)) = ETTγ(Φ(j), φ′(i))
14: /*find n with smallest summation*/
15: eat(n) + ETT (Φ(j), n)
16: = min∀φ′(i)∈N(eat(φ′(i)) + ETT (Φ(j), φ′(i)))
17: φ(i) = n
18: UPDATE eat(n), Ψ(l), st(i)
complexity is O(|NA||NN ||NL| log(NN)).
7.3 Results and Discussion
In this section we present our experimental results derived from simulations of a
combination of three real life applications. Our algorithm is compared with two
other NoC-targeted mapping algorithms, and results are illustrated and analyzed
in details.
The three applications we used consist of a JPEG encoder, an Electrocar-
diography (ECG) processing core, and an AES Encrypter. The JPEG encoder
compresses a 40×30 raw RGB ﬁle to JFIF format output, with baseline-DCT, no
down-sampling (4:4:4). It consists of sub-tasks such as FDCT, Quantization, and
107
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
Table 7.1: Facts about applications. Critical path is the longest execution path in the task graph,
no transmission delay. Level of parallelism is the maximum level of parallel execution.
App. critical path period no. nodes lv. parallelism
Encrypt 5, 680ns 30μs 6 4
ECG 157.92μs 200μs 6 2
JPEG 358.96μs 400μs 7 3
Huﬀman coding. The ECG core is used to process a 32-point sample with a pe-
riod of 200μs, consisting of sub-blocks including FFT, InvFFT, and Filter. The
Encrypter operates on blocks of 4× 4 bytes in a period of 30μs, consisting of sub-
blocks including AddRoundKey, MixColumn, and ShiftRow. We choose the three
implementations since the data transmission time is relatively heavy-weighted and
comparable with the task execution time. The facts of the applications are sum-
marized in TABLE 7.1. Note that for experimental purpose, we also manually set
the period of JPEG to be 400μs.
To run the applications, we use the multiprocessor simulator Simics[95] to
setup a Serengeti processor cluster with UltraSparc-III CPUs at the frequency of
300MHz. Simics is used as a multi-processor simulator without out-of-order mod-
eling details. Speciﬁcally it supports multi-threaded implementation of the appli-
cations. Each sub-task in the above applications are re-written as a single thread
under the POSIX thread (pthread) environment. The thread-processor aﬃnity is
set using processor bind() function under the Solaris OS. Since Simics can conﬁgure
tunable number of processors, we vary the number of CPUs at 6, 9, 12, and 16,
in order to examine the performance of algorithms with scaling numbers of CPUs.
The processors are connected in a mesh network implemented by the Princeton in-
terconnection model Garnet[96]. Each memory request from the Simics simulated
108
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach

Fig. 7.2: Simulation results of averaged makespan on the three applications by applying the three
algorithms.
CPU is then handled by the memory model, and data transmission is realized as
L2 cache data access over the Garnet network. We modify the router with the
wormhole feature, set the link frequency identical as the processor frequency, and
disable any virtual channel to avoid complexity in our experiment.
The algorithms we choose for comparison here include two NoC-speciﬁc map-
ping algorithms, namely the compiler-based mapping (abbr. CBM) [79] and “minimum-
path” bandwidth-constrained mapping (abbr. BCM) [80] algorithms. The CBM
and BCM algorithms are communication-centric, i.e., tasks in high communication
demands are placed in proximity. The major diﬀerence of the algorithms is that
CBM selects the earliest available processor for mapping while BCM selects the
best topologically centered processor for initial mapping. We combine the threads
of the three applications and use the algorithms to schedule the threads onto the
processors. Since CBM and BCM do not specify the scheduling mechanism when
two threads are assigned to the same processor, we set the threads to identical
priority and use the round robin scheme to solve conﬂict.
Fig. 7.2 illustrates the average end-to-end time (makespan) of each application
109
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
in the above combined execution using the three algorithms respectively. The
makespan of an application is deﬁned as total exe time
no of executions
. The results are obtained
from 4 sets of mesh conﬁgurations, namely 2×3, 3×3, 3×4, and 4×4, which pose
several interesting points to be mentioned. First of all, we notice that all the plots
have the trend of decreasing makespan with increasing number of processors. This
is understandable since contentions are less severe with plenty of resources. We can
especially observe that on a 4 mesh structure, both our algorithm and CBM can
obtain a result quite near to their critical paths. That means transmission time
takes up most of the diﬀerence between the makespan and critical path, and delay
due to task contention is almost avoided using these two algorithms. On the other
hand, BCM results in almost ﬂat curves under diﬀerent number of CPUs in the case
of Encrypter application (Similar with the other two). Because there is no speciﬁc
rule for independent tasks in BCM, tasks are assigned in a per application manner.
The most communication-heavy tasks of the three applications are placed at the
center of the mesh. This introduces a large amount of processor contention among
the three tasks. Hence, in a round robin manner, a task in the light-weighted
Encrypter has to wait for the huge JPEG task to ﬁnish on the same processor,
resulting in a extremely long makespan for Encrypter. ECG also shows similar
result by BCM. We also note that our algorithm results in a shorter makespan as
the number of processors reduces, especially in the case of 3×3 mesh. This reﬂects
the fact that our algorithm does not always stick one task to a speciﬁc processor,
but maps each task instance according to the instantaneous situation of the overall
system execution. This results in better performance since the both BCM and
CBM implement ﬁxed assignment, where the contention among tasks from diﬀerent
applications deteriorates the performance with smaller mesh size. With the mesh
110
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach

Fig. 7.3: Simulation results of average transmission time on a 3× 3 mesh using 3 algorithms on
3 applications.
size shrinks further (2×3), our algorithm can still generate better performance but
limited by the scarce resource. The makespans of the lighter-weighted Encrypter
and ECG tend to converge to the heavy-weighted JPEG application.
We also measure the average transmission time spent per execution of each
application on a 3 × 3 mesh. The transmission time reﬂects a combined eﬀect
of transmission distance and runtime network queueing resulted from the three
algorithms. The results are shown in Fig. 7.3, and our algorithm can achieve
at least 38.3% less transmission time (in the case of JPEG) on a resource-scarce
3× 3 mesh structure. Note that JPEG application results in similar transmission
delays under the three algorithms mainly due to its long execution time, so it is
less aﬀected by other application delays (e.g. the “tiny” Encrypter). Meanwhile,
an Encrypter task can be waiting for an extremely long JPEG task to ﬁnish in a
round robin scheme, hence the extremely long transmission time for the Encrypter
111
CHAPTER 7. Supplement: A Communication-Aware Static Scheduling Approach
shown in Fig. 7.3.
112
Chapter 8
Conclusions and Future Work
In this thesis, we have systematically presented a dynamic scheduling framework for
adaptive applications on embedded systems, addressing the contemporary schedul-
ing challenges of workload ﬂexibility, multiprocessing, leakage power, and platform-
induced overheads. Moreover, a NoC-based static scheduling approach is presented
to make our work complete. We describe our methodology in a logical manner from
the simple assumption to more realistic factors in the process of problem deﬁnition,
formulation, and solution. To be more speciﬁc, we report our approach ﬁrstly by
presenting a single-processor based imprecise-computation algorithm that theoret-
ically proves the optimal way of slack distribution to achieve maximal dynamic
quality. The single-processor work is extended to multiprocessor scenario, where
single-processor theorem becomes invalid due to the fact of slack time duplication.
In respect to that fact, we also explore the optimal theoretical formulation that
optimally utilize slack time and energy for QoS maximization, and found that the
QoS slope is the sole factor to determine the slack allocation amount. Having
113
CHAPTER 8. Conclusions and Future Work
tackled the imprecise computation scheduling on multiple processors, we direct
our research into a more challenging level that comprehends realistic factors to
the problem formulation, such as generalized adaptive application representation,
leakage power, and platform-introduced overheads. Finally, we also devise a static
scheduling approach that takes network-on-chip platform communication into the
overall timing decision. By deciding the starting time and the processor mapping
of the applications, the static algorithm may serve as the starting point of dynamic
approaches, while our dynamic approaches are able to adopt any static scheduling
results.
In Chapter 4, we propose a novel low complexity single-processor based dy-
namic scheduling algorithm, named gradient curve shifting, for imprecise compu-
tation modeled tasks. From the single-processor scenario, we start describing our
framework wherein application requirements are real-time and need to strike a
trade-oﬀ between available energy and QoS demands. We had shown that our
GCS algorithm is able to decide the best allocation of slack cycles and operating
voltages to optional tasks, while its complexity remains low compared to other
dynamic scheduling solutions.
The multiprocessor extension, described in Chapter 5, targets dependent tasks
compared to simpliﬁed independent task assumptions. Unlike most dynamic schedul-
ing algorithms that are rule-of-thumb based, the algorithm optimally calculates
the optional cycle increase and dedicate the slack to each task based on global
inspection. Simulation results reveal that our approach outperforms contemporary
methods with small execution overhead.
An immediate extension to the work reported here would consider to include
the voltage transition delays in the model. One of the ways to model is to include
114
CHAPTER 8. Conclusions and Future Work
an additive parameter that captures all possible voltage transition delays that exist
during the scheduling process. Capturing such delays are subject to the underlying
platform and hence one may adopt an empirical approach in measuring such voltage
transition delays and reﬂect it in the model. Also, this being an overhead, one way
to compensate is to increase the computation volume of the task node. To a large
extent, a voltage transition can take several thousand cycles, hence the granularity
of a task can be chosen to be signiﬁcantly larger than the transition overhead.
Despite the overheads caused by the voltage transition that has been extensively
studies in literatures, we tackle another type of overheads caused by platforms,
that has been rarely studied.
In Chapter 6, we propose a novel heuristic for multiprocessor dynamic schedul-
ing with generalized adaptive applications, combining the leakage power model and
making use of runtime slack to enhance the execution quality under timing and en-
ergy constraints beyond statically scheduled. Our methodology is composed of a
heuristic guided-search algorithm that eﬃciently decides maximized cycle increase
on a given set of receiver candidates, as well as a dedicated receiver candidate selec-
tion method that boosts the performance of the guided-search algorithm. Moreover,
we improve the algorithm practicability by extending the framework to consider
the quality degradation brought by inter-processor communications, and propose
a local scaling approach that complements the performance of the guided-search
algorithm. We use both synthesized and JPEG2000 applications to validate our
work, and also test the performance of the local scaling approach on Gaussian dis-
tributed transmission time variation, as well as on various NoC routing schemes.
Results show that the guided-search algorithm, aided by slack receiver selection,
can achieve at least 25% cycle gain improvement, and local scaling can contribute
115
CHAPTER 8. Conclusions and Future Work
as large as 16.9% more cycle gain compared with not applying local complementary
methods.
Our current framework models application adaptiveness with cycle scalabil-
ity. However, other adaptive application models, such as multi-version tasks and
imprecise-computation can also be incorporated into our framework to extend its
practicality. On the other hand, the current framework can be further improved
in terms of the receiver selection process. More eﬃcient heuristic approaches could
be studied, relying on more detailed study on graph analysis techniques.
For the static scheduling approach described in Chapter 7, we propose an
NoC-targeted algorithm which determines transmission routing and scheduling in
the process of task mapping and scheduling. We use three real life applications to
evaluate our algorithm and the results appear signiﬁcantly better than contempo-
rary NoC-targeted mapping algorithms. However, our work achieves predictable
performance gain with the prerequisite that the communication should exhibits reg-
ular access patterns (hence data transmission time per instance does not have large
variation). In the future, we are planning to investigate the scheduling algorithm
to deal with irregular access patterns and cache inﬂuences.
116
Bibliography
[1] Jane W. S. Liu et al., “Imprecise Computations,” Proc. of IEEE, vol. 82(1),
pp. 83-94, 1994.
[2] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the Scalable Video Cod-
ing Extension of the H.264/AVC Standard,” IEEE Trans. on Circuits Syst.
Video Techn., vol. 17(9), pp. 1103-1120, Sept. 2007.
[3] T. Acharya and P. S. Tsai, JPEG2000 Standard for Image Compression: Con-
cepts, Algorithms and VLSI Architectures, Wiley 2004.
[4] T. F. Abdelzaher, E. M. Atkins, and K. G. Shin, “QoS Negotiation in Real-
Time Systems and Its Application to Automated Flight Control,” IEEE Trans.
Computers, vol. 49(11), pp. 1170-1183, 2000.
[5] George Cybenko, “Dynamic Load Balancing for Distributed Memory Multipro-
cessors,” J. Parallel Distrib. Comput., vol. 7(2), pp. 279-301, 1989.
[6] M. H. Willebeek-LeMair and A. P. Reeves, “Strategies for dynamic load bal-
ancing on highly parallel computers,” IEEE Trans. Parallel Distrib. Syst., vol.
4(9), pp. 979-993, 1993.
[7] K. Kennedy et al., “Toward a Framework for Preparing and Executing Adaptive
Grid Programs,” Proc. IPDPS’02, pp. 171-175, Aug. 2002.
[8] X. Chen and A. M. K. Cheng, “An imprecise algorithm for real-time compressed
image and video transmission,” Int’l Conf. Compt. Comm. and Networks (IC-
CCN’97), pp. 390-397, Sept. 1997.
[9] B. Smith and R. Oswald, “Meeting RealCTime Traﬃc Flow Forecasting Re-
quirements with Imprecise Computations,” Computer-Aided Civil and Infras-
tructure Engineering, vol. 18(3), pp. 201-213, May 2003.
[10] M. Amirijoo, J. Hansson, and S. H. Son, “Speciﬁcation and Management
of QoS in Real-Time Databases Supporting Imprecise Computations,” IEEE
Trans. Computers, vol. 55(3), pp. 304-319, 2006.
117
BIBLIOGRAPHY
[11] E. K. P. Chong and W. Zhao, “Task Scheduling for Imprecise Computer Sys-
tems with User Controlled Optimization,” Proc. Int’l Conf. on Computers and
Information, May 1989.
[12] K. B. Kenny and K.-J. Lin, “Structuring large real-time systems with perfor-
mance polymorphism,” Proc. IEEE Real-Time Systems Symposium, pp. 238-
246, Dec. 1990.
[13] H. Zou and F. Jahanian, “A Real-Time Primary-Backup Replication Service,”
IEEE Trans. Parallel Distrib. Syst., vol. 10(6), pp. 533-548, 1999.
[14] G. Buttazzo, G. Lipari, M. Caccamo, and L. Abeni, “Elastic scheduling for
ﬂexible workload management,” IEEE Trans. Computers, vol. 51(3), pp. 289-
302, Mar. 2002.
[15] M. Hamdaoui and P. Ramanathan, “A dynamic priority assignment technique
for streams with (m,k)-ﬁrm deadlines,” IEEE Trans. Computers, vol. 44(12),
pp. 1443-1451, Dec. 1995.
[16] P. Ramanathan, “Graceful Degradation in Real-Time Control Applications
Using (m, k)-Firm Guarantee,” Proc. IEEE 27th Int’l Symp. Fault-Tolerant
Computing (FTCS), pp. 132-141, June 1997.
[17] C. L. Liu and J. W. Layland, “Scheduling Algorithms for Multiprogramming
in a Hard Real-Time Environment,” Journal of the ACM, vol. 20(1), pp. 46-61,
Jan. 1973.
[18] J. Y. T. Leung and J. Whitehead, “On the complexity of ﬁxed-priority schedul-
ing of periodic, real-time tasks,” Performance Evaluation, vol. 2(4), pp. 237-250,
Dec. 1982.
[19] A. K. Mok, Fundamental Design Problems of Distributed Systems for the Hard
Real-Time Environment, Ph.D. Thesis, Massachusetts Institute of Technology,
1983.
[20] R. Rajkumar, Synchronization in Real-Time Systems: A Priority Inheritance
Approach, Kluwer Academic, 1991.
[21] L. Sha, R. Rajkumar, and J. P. Lehoczky, “Priority Inheritance Protocols: An
Approach to Real-Time Synchronisation,” IEEE Trans. Computers, vol. 39(9),
pp. 1175-1185, 1990.
[22] T. P. Baker, “Stack-Based Scheduling of Real-Time Processes,” Real-Time
Systems, vol. 3(1), pp. 67-100, Mar. 1991.
118
BIBLIOGRAPHY
[23] F. Cottet et al., Scheduling in Real-Time Systems, Wiley, ISBN: 0-470-84766-
2, 2002.
[24] M. G. Harbour, M. H. Klein, and J. P. Lehoczky, “Fixed Priority Scheduling
of Periodic Tasks with Varying Execution Priority,” Proc. IEEE Real-Time
Systems Symposium, pp. 116-128, Dec. 1991.
[25] S. K. Dhall and C. L. Liu, “On a Real-Time Scheduling Problem,” Oper.
Research, vol. 26(1), pp. 127-140, 1978.
[26] M. Garey and D. Johnson, “Two-Processor Scheduling with Start-Times and
Deadlines,” SIAM J. Comput., vol. 6(3), pp. 416-426, 1977.
[27] M. L. Dertouzos and A. K. Mok, “Multiprocessor Online Scheduling of Hard
Real-Time Tasks,” IEEE Trans. Software Engineering, vol. 15(12), pp. 1497-
1505, 1989.
[28] T. L. Adam, K. M. Chandy, and J. R. Dickson, “A Comparison of List Sched-
ules for Parallel Processing Systems,” Comm. ACM, vol. 17(12), pp. 685-690,
Dec. 1974.
[29] T. Yang and A. Gerasoulis, “List Scheduling with and without Communication
Delays,” Parallel Computing, vol. 19(12), pp. 1321C1344, Sept. 1993.
[30] C. V. Ramamoorthy, K. M. Chandy, and M. J. Gonzalez, “Optimal Scheduling
Strategies in a Multiprocessor System,” IEEE Trans. Computers, vol. 21(2), pp.
137-146, Feb. 1972.
[31] J.-J. Hwang et al., “Scheduling precedence graphs in systems with interpro-
cessor communication times,” SIAM J. Comput., vol. 18(2), pp. 244-257, Apr.
1989.
[32] Y. Kwok and H. Ahmad, “Dynamic Critical-Path Scheduling: An Eﬀective
Technique for Allocating Task graphs to Multiprocessors,” IEEE Trans. Parallel
Distrib. Syst., vol. 7(5), pp. 506-521, May 1996.
[33] H. Topcuoglu, S. Hariri and M.-Y. Wu, “Performance-Eﬀective and Low-
Complexity Task Scheduling for Heterogeneous Computing,” IEEE Trans. Par-
allel Distrib. Syst., vol. 13(3), pp. 260-274, Mar. 2002.
[34] T. Yang and A. Gerasoulis, “DSC: Scheduling Parallel Tasks on an Unbounded




[35] M. Srivastava, A. Chandrakasan, and R. Brodersen, “Predictive System Shut-
down and Other Architectural Techniques for Energy Eﬃcient Programmable
Computation,” IEEE Trans. VLSI Syst., vol. 4(1), pp. 42-55, Mar. 1996.
[36] C.-H. Hwang and A. Wu, “A Predictive System Shutdown Method for En-
ergy Saving of Event-driven Computation,” IEEE Int’l. Conf. Computer-Aided
Design (ICCAD), pp. 28-32, Nov. 1997.
[37] T. D. Burd, T. Pering, A. Stratakos, and R. Brodersen, “A Dynamic Voltage
Scaled Microprocessor System,” IEEE J. Solid-State Circuits, vol. 35(11), pp.
1571-1580, 2000.
[38] T. D. Burd and R. W. Brodersen, “Energy Eﬃcient CMOS Microprocessor
Design,” Proc. Hawaii Int’l. Conf. Syst. Sci., pp. 288-297, Jan. 1995.
[39] R. Ernst and W. Ye, “Embedded Program Timing Analysis based on Path
Clustering and Architecture Classiﬁcation,” IEEE Int’l Conf. Computer-Aided
Design (ICCAD), pp. 598-604, 1997.
[40] F. Yao, A. Demers, and S. Shenker. “A Scheduling Model for Reduced CPU
Energy,” Proc. IEEE Symposium on Foundations of Computer Science, pp.
374-382, Oct. 1995.
[41] F. Gruian, “System-Level Design Methods for Low-energy Architectures Con-
taining Variable Voltage Processors,” Proc. 1st Int’l Workshop on PACS, pp.
1-12, Nov. 2000.
[42] Y. Zhang, X. Hu, and D. Z. Chen, “Task Scheduling and voltage selection for
energy minimization,” Proc. Design Automation Conference, pp. 183-188, June
2002.
[43] L. Goh, B. Veeravalli, and S. Viswanathan, “Design of Fast and Eﬃcient
Energy-aware Gradient-Based Scheduling Algorithms for Heterogeneous Em-
bedded Multiprocessor Systems,” IEEE Trans. Parallel Distrib. Syst. (TPDS),
vol. 20(1), pp. 1-12, Jan. 2009.
[44] R. Mishra, N. Rastogi, D. Zhu, D. Mosse, and R. Melhem, “Energy Aware
Scheduling for Distributed Real-Time Systems,” Proc. Int’l Parallel and Dis-
tributed Processing Symposium (IPDPS’03), 2003.
[45] M. T. Schmitz, and B. M. Al-Hashimi, “Considering Power Variations of DVS
Processing Elements for Energy Minimisation in Distributed Systems,” Proc.
Int’l Symp. Syst. Synthesis, pp. 250-255, 2001.
120
BIBLIOGRAPHY
[46] D. Mosse´, H. Aydin, B. Childers, and R. Melhem, “Compiler-Assisted Dy-
namic Power-Aware Scheduling for Real-Time Applications, Workshop on Com-
piler and OS for Low Power, Philadelphia, Oct. 2000.
[47] D. Zhu, R. Melhem, and B. Childers, “Scheduling with Dynamic Volt-
age/Speed Adjustment Using Slack Reclamation in Multi-Processor Real-Time
Systems,” IEEE Trans. Parallel Distrib. Syst., vol. 14(7), pp. 686-700, 2003.
[48] J. Luo and N. K. Jha, “Power-Conscious Joint Scheduling of Periodic Task
Graphs and Aperiodic Tasks in Distributed Real-time Embedded Systems,”
IEEE Int’l Conf. Computer-Aided Design (ICCAD), pp. 357-364, Nov. 2000.
[49] D. Shin, J. Kim, and S. Lee, “Intra-Task Voltage Scheduling for Low-Energy
Hard Real-Time Applications,” IEEE Design and Test of Computers, vol. 18(2),
pp. 20-30, 2001.
[50] J. Seo, T. Kim, and N. D. Dutt, “Optimal Integration of Inter-Task and Intra-
Task Dynamic Voltage Scaling Techniques for Hard Real-Time Applications,”
Int’l Conf. Computer-Aided Design (ICCAD), pp. 450-455, 2005.
[51] D. Bergstrom, M. Hattendorf, J. Hicks, J. Jopling, J. Maiz, S. Pae, C. Prasad,
J. Wiedemer, “45nm Transistor Reliability,” Intel Technology J., vol. 12(2),
June 2008.
[52] M. Pedram, “Leakage Power Modeling and Minimization,” Tutorial, IC-
CAD’04, 2004.
[53] S. M. Martin, K. Flautner, T. Mudge, and D. Blaauw, “Combined Dynamic
Voltage Scaling and Adaptive Body Biasing for Low Power Micropossers under
Dynamic Work Loads,” Int’l Conf. Computer-Aided Design (ICCAD), pp. 721-
725, 2002.
[54] W. Zhang et al., “Exploiting VLIW Schedule Slacks for Dynamic and Leakage
Energy Reduction,” IEEE/ACM Int’l Symp. Microarchitecture (MICRO’01),
pp. 102-113, 2001.
[55] S. Irani, S. Shukla, and R. Gupta. “Algorithms for Power Savings,” Proc.
ACM-SIAM Symp. Discrete Algorithms, pp. 37-46, 2003.
[56] J.-J. Chen and T.-W. Kuo, “Procrastination determination for periodic real-




[57] J.-J. Chen, H.-R. Hsu, and T.-W. Kuo, “Leakage-Aware Energy-Eﬃcient
Scheduling of Real-Time Tasks in Multiprocessor Systems,” IEEE Real-time
and Embedded Technology and Applications Symposium (RTAS), pp. 408-417,
2006.
[58] A. Andrei, P. Eles, and Z. Peng, ”Energy Optimization of Multiprocessor
Systems on Chip by Voltage Selection,” IEEE Trans. VLSI Syst. , vol. 15(3),
pp. 262-275, 2007.
[59] C. Xian, Y.-H. Lu, and Z. Li, ”Dynamic Voltage Scaling for Multitasking
Real-Time Systems With Uncertain Execution Time,” IEEE Trans. on CAD
of Integrated Circuits and Systems (TCAD), vol. 27(8), pp. 1467-1478, 2008.
[60] W.-K. Shih, J. W. S. Liu, and J.-Y. Chung. “Fast Algorithms for Scheduling
Imprecise Computations,” Proc. Real-Time Systems Symposium (RTSS), pp.
12C19, 1989.
[61] J. Y. Chung, J. W. S. Liu, and K. J. Lin, “Scheduling Periodic Jobs that Allow
Imprecise Results,” IEEE Trans. Computers, vol. 19(9), pp. 1156-1173, Sept.
1990.
[62] W.-K. Shih, J. W. S. Liu, and J.-Y. Chung. “Algorithms for Scheduling Im-
precise Computations with Timing Constraints,” SIAM Journal of Computing,
1991.
[63] J. Hu and R. Marculescu, ”Energy-Aware Communication and Task Schedul-
ing for Network-on-Chip Architectures under Real-Time Constraints,” Design,
Automation and Testing in Europe (DATE), pp. 234-239, 2004.
[64] G. Varatkar and R. Marculescu, ”Communication-Aware Task Scheduling and
Voltage Selection for Total Systems Energy Minimization,” IEEE Int’l Conf.
on Computer-Aided Design (ICCAD), pp. 510-517, 2003.
[65] P. Eles, A. Doboli, P. Pop, and Z. Peng, ”Scheduling with bus access optimiza-
tion for distributed embedded systems,” IEEE Trans. VLSI Syst., vol. 8(5), pp.
472-491, 2000.
[66] J. Y. Chung, J. W. S. Liu, and K. J. Lin, “Scheduling Periodic Jobs that Allow
Imprecise Results,” IEEE Trans. Computers, vol. 39(9), pp. 1156-1174, 1990.
[67] L. A. Corte´s, P. Eles, and Z. Peng, “Quasi-Static Assignment of Voltages and
Optional Cycles in Imprecise-Computation Systems with Energy Considera-
tions,” IEEE Trans. VLSI, vol. 14(10), pp. 1117-1129, 2006.
122
BIBLIOGRAPHY
[68] H. Aydin, R. Melhem, D. Mosse, and P. Mejia-Alvarez, “Optimal Reward-
Based Scheduling for Periodic RealTime Tasks,” IEEE Trans. Computers, vol
50(2), pp. 111-130, Feb. 2001.
[69] C. Rusu, R. Melhem, and D. Mosse, “Maximizing Rewards for Real-Time Ap-
plications with Energy Constraints,” ACM Transactions on Embedded Comput-
ing Systems (TECS), vol. 2(4), pp. 537-559, Nov. 2003.
[70] R. M. Karp, R. E. Miller, J. W. Thatcher, “Reducibility Among Combinatorial
Problems,” The Journal of Symbolic Logic, vol. 40(4), pp.618-619, 1975.
[71] A. Bjo¨rklund, T. Husfeldt, M. Koivisto, “Set partitioning via inclusion-
exclusion”, SIAM J. on Computing, vol. 39(2), pp. 546-563, 2009.
[72] R. P. Dick, D. L. Rhodes, and W. Wolf, “TGFF: Task Graphs for Free,”
CODES’98, pp. 97-101, 1998.
[73] http://sesc.sourceforge.net
[74] http://www.nirgam.ecs.soton.ac.uk
[75] L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques in
direct netowrks,” Computer, vol. 26(2), pp. 62-76, Feb. 1993.
[76] M.B. Taylor et al, “The RAW microprocessor: a computational fabric for
software circuits and general-purpose programs,” IEEE Micro, vol. 22(2), pp.
25-35, March 2002.
[77] S. Vangal et al, “An 80-Tile 1.28 TFLOPS Network-on-Chip in 65nm CMOS,”
IEEE JSSC, vol.43(1), pp. 29-41, Jan. 2008.
[78] W.J. Dally and B. Towles, Principles and Practices of Interconnection Net-
works, Morgan Kaufmann, c2004.
[79] G. Chen, F. Li, S.W. Son, and M. Kandemir, “Application mapping for chip
multiprocessors,” Proc. Design Automation Conference, pp. 620-625, June 2008.
[80] S. Murali and G. De Micheli, “Bandwidth-constrained mapping of cores onto
NoC architectures,” Proc. Design Automation and Test Europe (DATE), pp.
896-901, Feb. 2004.
[81] J, Hu and R. Marculescu, “Energy-aware communication and task scheduling
for Network-on-Chip Architectures under Real-Time Constraints,” Proc. Design
Automation and Test Europe (DATE), pp. 234-239, Feb. 2004.
123
BIBLIOGRAPHY
[82] G. Varatkar and R. Marculescu, “Communication-Aware Task Scheduling and
Voltage Selection for Total Systems Energy Minimization,” IEEE Int’l Conf.
on Computer-Aided Design (ICCAD), pp. 510-517, 2003.
[83] A. Jantsch and H. Tenhunen, Networks on Chip, Kluwer Academic Publishers
2003.
[84] Z. Lu and A. Jantsch, “Slot Allocation for TDM Virtual-Circuit Conﬁguration
for Network-on-Chip,” Int’l Conf. on Computer-Aided Design (ICCAD), pp.
18-25, 2007.
[85] K. Goossens, J. Dielissen, and A. Radulescu, “Æthereal network on chip: Con-
cepts, architectures and implementations,” IEEE Design Test Comput., vol.
22(5), pp. 414-421, 2005.
[86] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch, “Guaranteed bandwidth
using looped containers in temporally disjoint networks within the nostrum
network-on-chip,” Design, Automation and Testing in Europe (DATE), pp.
890C895, 2004.
[87] J. Duato, S. Yalamanchili, and L.M. Ni, Interconnection Networks: An Engi-
neering Approach, Morgan Kaufmann, 2003.
[88] G.-M. Chiu, “The Odd-Even Turn Model for Adaptive Routing,” IEEE Trans.
Parallel Distrib. Syst., vol. 11(7), pp. 729-738, 2000.
[89] J. Hu, R. Marculescu, “DyAD: smart routing for networks-on-chip,” Proc.
Design Automation Conference, pp. 260-263, 2004.
[90] M. H. Cho et al, “Path-Based, Randomized, Oblivious, Minimal Routing,”
Int’l Workshop on Network on Chip Arch., pp. 23-28, 2009.
[91] E. Bolotin, I. Cidon, R. Ginosaur, and A. Kolodny, “QNoC: QoS architecture
and design process for network-on-chip,” Jnl. Syst. Archit., vol. 50(2-3), pp.
105-128, 2004.
[92] D. Andreasson and S. Kumar, “Slack-time aware routing in NoC systems,”
Int’l Symp. on Circuits and Syst. (ISCAS), pp. 2353-2356, 2005.
[93] E. Beigne, et al “An asynchronous NOC architecture providing low latency
service and its multi-level design framework,” Int’l Symp. on Async. Circuits
and Syst. (ASYNC), pp. 54-63, 2005.




[96] N. Agarwal, L.-S. Peh, and N. Jha, “GARNET: A Detailed Interconnection
Network Model inside a Full-system Simulation Framework,” Technical Report
CE-P08-001, 2008.
125
