Energy-Aware Scheduling for Streaming Applications by Xu, Ruibin
ENERGY-AWARE SCHEDULING FOR
STREAMING APPLICATIONS
by
Ruibin Xu
B.E., Guangdong University of Technology, P.R.China, 1996
M.S., Zhongshan University, P.R.China, 1999
Submitted to the Graduate Faculty of
the Arts and Science in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
University of Pittsburgh
2010
UNIVERSITY OF PITTSBURGH
DEPARTMENT OF COMPUTER SCIENCE
This dissertation was presented
by
Ruibin Xu
It was defended on
January 4th 2010
and approved by
Dr. Rami Melhem
Dr. Daniel Mosse´
Dr. Bruce Childers
Dr. Jun Yang
Dissertation Advisors: Dr. Rami Melhem,
Dr. Daniel Mosse´
ii
Copyright c© by Ruibin Xu
2010
iii
ENERGY-AWARE SCHEDULING FOR STREAMING APPLICATIONS
Ruibin Xu, PhD
University of Pittsburgh, 2010
Streaming applications have become increasingly important and widespread, with applica-
tion domains ranging from embedded devices to server systems. Traditionally, researchers
have been focusing on improving the performance of streaming applications to achieve high
throughput and low response time. However, increasingly more attention is being shifted to
power/performance trade-off because power consumption has become a limiting factor on
system design as integrated circuits enter the realm of nanometer technology.
This work addresses the problem of scheduling a streaming application (represented by
a task graph) with the goal of minimizing its energy consumption while satisfying its two
quality of service (QoS) requirements, namely, throughput and response time. The available
power management mechanisms are dynamic voltage scaling (DVS), which has been shown
to be effective in reducing dynamic power consumption, and vary-on/vary-off, which turns
processors on and off to save static power consumption.
Scheduling algorithms are proposed for different computing platforms (uniprocessor and
multiprocessor systems), different characteristics of workload (deterministic and stochastic
workload), and different types of task graphs (singleton and general task graphs). Both
continuous and discrete processor power models are considered. The highlights are a unified
approach for obtaining optimal (or provably close to optimal) uniprocessor DVS schemes for
various DVS strategies and a novel multiprocessor scheduling algorithm that exploits the
difference between the two QoS requirements to perform processor allocation, task mapping,
and task speed scheduling simultaneously.
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1.0 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.0 BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . . . 6
2.1 Streaming Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Power Management Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Energy-Aware Uniprocessor Scheduling . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Inter-task DVS Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Intra-task DVS Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.3 Hybrid DVS Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Energy-Aware Multiprocessor Scheduling . . . . . . . . . . . . . . . . . . . . 11
3.0 MODELS AND PROBLEM DESCRIPTION . . . . . . . . . . . . . . . . 14
3.1 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Processor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Ideal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Realistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Communication Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.1 Uniprocessor Scheduling Problems . . . . . . . . . . . . . . . . . . . . 19
3.5.1.1 The STREAM-UP-D-ST and STREAM-UP-D-TG Problems . 19
3.5.1.2 The STREAM-UP-S-ST Problem . . . . . . . . . . . . . . . . 20
v
3.5.1.3 The STREAM-UP-S-TG Problem . . . . . . . . . . . . . . . . 21
3.5.2 Multiprocessor Scheduling Problems . . . . . . . . . . . . . . . . . . . 21
3.5.2.1 The STREAM-MP-D-ST Problem . . . . . . . . . . . . . . . . 21
3.5.2.2 The STREAM-MP-D-TG Problem . . . . . . . . . . . . . . . 22
3.5.2.3 The STREAM-MP-S-ST and STREAM-MP-S-TG Problems . 22
3.6 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6.1 Workload Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6.2 Processor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.0 SCHEDULING IN UNIPROCESSOR SYSTEMS . . . . . . . . . . . . . 28
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Solutions under Ideal Processor Model . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Optimal Intra-Task Scheme . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Optimal Inter-Task Scheme . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Optimal Hybrid Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.4 A Unified View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Solutions under Realistic Processor Model . . . . . . . . . . . . . . . . . . . 38
4.3.1 The Intra-task Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1.2 Patching the Schemes Obtained under the Ideal Processor Model 39
4.3.1.3 The PPACE Scheme . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1.4 Analysis of PPACE . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 The Inter-task Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.3 The Hybrid DVS Schemes . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.4.1 Evaluation of Intra-Task DVS Schemes . . . . . . . . . . . . . 54
4.3.4.2 Evaluation of the DVS Schemes for General Frame-based Systems 59
4.4 A Unified Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2 The Basic Idea for the Unified Approach . . . . . . . . . . . . . . . . 66
4.4.2.1 The SIDVS Scheme . . . . . . . . . . . . . . . . . . . . . . . . 67
vi
4.4.2.2 Properties of the SIDVS Scheme . . . . . . . . . . . . . . . . . 67
4.4.2.3 Obtaining the SIDVS Scheme . . . . . . . . . . . . . . . . . . 69
4.4.3 The Details of the Unified Approach . . . . . . . . . . . . . . . . . . . 71
4.4.3.1 On Step Functions . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.3.2 The Algorithm for SIDVS . . . . . . . . . . . . . . . . . . . . 72
4.4.3.3 The Algorithm for IDVS . . . . . . . . . . . . . . . . . . . . . 75
4.4.3.4 The Algorithm for HDVS . . . . . . . . . . . . . . . . . . . . 76
4.4.4 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.4.1 Evaluation of Inter-task DVS Schemes . . . . . . . . . . . . . 79
4.4.4.2 Evaluation of Hybrid DVS Schemes . . . . . . . . . . . . . . . 80
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.0 SCHEDULING IN MULTIPROCESSOR SYSTEMS . . . . . . . . . . . 83
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Scheduling A Single Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Deterministic Workload . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.1.1 Ideal Processor Model . . . . . . . . . . . . . . . . . . . . . . 88
5.2.1.2 Realistic Processor Model . . . . . . . . . . . . . . . . . . . . 90
5.2.2 Stochastic Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Scheduling A Task Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.1 Scheduling for Linear Task Graphs with Deterministic Workload . . . 94
5.3.1.1 Y-Oriented Load . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1.2 The Scheduling1D Algorithm . . . . . . . . . . . . . . . . . . 97
5.3.2 Scheduling for General Task Graphs With Deterministic Workload . . 102
5.3.2.1 X-Oriented Load . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2.2 Scheduling Heuristics . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.2.3 The Scheduling2D Algorithm . . . . . . . . . . . . . . . . . . 103
5.3.3 Scheduling General Task Graphs with Stochastic Workload . . . . . . 107
5.3.3.1 The Oﬄine Part of SScheduling2D . . . . . . . . . . . . . . . 108
5.3.3.2 The Online Part of SScheduling2D . . . . . . . . . . . . . . . 110
vii
5.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.0 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.0 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
APPENDIX A. AN ILLUSTRATIVE EXAMPLE OF SPEED ROUNDING
EFFECT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
APPENDIX B. AN ILLUSTRATIVE EXAMPLE OF DVS SCHEMES . . 124
APPENDIX C. PROOF OF LEMMA 2 . . . . . . . . . . . . . . . . . . . . . . 126
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
viii
LIST OF TABLES
1 The eight optimization problems under consideration . . . . . . . . . . . . . . 19
2 Synthetic task graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 XScale speed settings and power consumptions . . . . . . . . . . . . . . . . . 26
4 PowerPC 405LP speed settings and power consumptions . . . . . . . . . . . . 26
5 The road map of our investigation; cited work was done by other researchers
prior to this dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 The road map of our investigation . . . . . . . . . . . . . . . . . . . . . . . . 85
7 Energy savings(%) of Scheduling2D over baseline . . . . . . . . . . . . . . . . 112
8 Energy savings (%) of SScheduling2D over Scheduling2D . . . . . . . . . . . . 115
9 The parameters for the 3 tasks in the illustrative example . . . . . . . . . . . 125
10 The comparison of the DVS schemes for the illustrative example . . . . . . . 125
ix
LIST OF FIGURES
1 Chip multiprocessor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Probability functions: uniform, unimodal1, unimodal2, unimodal3, bimodal1,
bimodal2 (from left to right). The Y-axis is probability and the X-axis is
number of execution cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Task graph of ATR when three targets are detected . . . . . . . . . . . . . . 25
4 Approximate analytical power function vs. actual power function . . . . . . . 27
5 Graphical representation of the mathematical program (4.6)-(4.8) . . . . . . . 41
6 Comparing intra-task DVS schemes for bimodal1 distribution (the relative er-
rors are relative to optimal solutions) . . . . . . . . . . . . . . . . . . . . . . 56
7 Efficiency of PPACE (bimodal1 distribution) . . . . . . . . . . . . . . . . . . 57
8 Effect of speed scaling points (bimodal1 distribution) . . . . . . . . . . . . . . 58
9 Effect of speed change overhead (bimodal1 distribution) . . . . . . . . . . . . 60
10 Comparison of DVS schemes for general frame-based systems ( the relative
errors are relative to the clairvoyant scheme) . . . . . . . . . . . . . . . . . . 63
11 Experimental results for ATR . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12 The SIDVS Scheme for the ideal model . . . . . . . . . . . . . . . . . . . . . 67
13 The SIDVS Scheme for the realistic model . . . . . . . . . . . . . . . . . . . . 68
14 Function approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
15 Evaluation of Inter-task DVS Schemes (normalized to IDVS) . . . . . . . . . 80
16 Evaluation of hybrid DVS Schemes (normalized to HDVS) . . . . . . . . . . . 81
17 The master-slave Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
18 Applying the master-slave scheme to a streaming application for which D = 2.5T 88
x
19 Energy savings for 70nm technology . . . . . . . . . . . . . . . . . . . . . . . 93
20 Divisible load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
21 An example of scheduling for general task graphs . . . . . . . . . . . . . . . . 104
22 An execution scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
23 Energy savings for 70nm technology on k series parallel xover . . . . . . . . . 113
xi
LIST OF ALGORITHMS
4.1 OITDVS-Oﬄine([W1,W2, . . . ,WN ], [P1(x), P2(x), . . . , PN(x)] . . . . . . . . . 35
4.2 GOPDVS-Oﬄine([W1,W2, . . . ,WN ], [P1(x), P2(x), . . . , PN(x)]) . . . . . . . 36
4.3 TRIM(L = [l1, l2, . . . , l|L|],δ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 PPACE(²) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 AdjustContinuousSpeed(i, d) . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 PITDVS-online(i, d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 PITDVS2-online(i, d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.8 PGOPDVS-online(i, d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.9 PIT-PPACE-online(i, d, ²) . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.10SIDVS(²) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.11TRIM(F = [P1,P2, . . . ,P|P|],δ) . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.12 IDVS(²) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.13HDVS(²) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Computing Ei(t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Scheduling1D(²) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Scheduling2D(²) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 XMAP(i, j, d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
xii
ACKNOWLEDGEMENT
First and foremost, I would like to thank my advisers, Dr. Rami Melhem and Dr. Daniel
Mosse´, for their guidance, support, and help throughout my Ph.D. study. They were always
there to listen and give advice. They taught me how to do scientific research and tackle hard
problems. No matter what career path I take, industry or academia, they are forever my
role models.
A special thanks to Dr. Bruce Childers, who always asked good questions about my
work, which in turn led to improvement of my work. I would also like to thank Dr. Jun
Yang for agreeing to serve on my Ph.D. committee and for providing helpful comments on
my work.
I am grateful to all members of the collaboration in the Power-Aware Real-Time Systems
(PARTS) research group. Specifically, I want to thank Cosmin Rusu, Dakai Zhu, Navine
AbouGhazaleh, Matt Craven, and Alexandre Ferreira for their kind help. I really enjoyed
working with these wonderful people.
On the personal level, I would like to thank my wife, Yuanbing Du, for her constant love,
encouragement, and support during my educational career. Without her, I would have been
a very different person and would not have finished my Ph.D. study. Last but not least, I
would like to dedicate this work to my mother, Jinlan Huang, who passed away just before
I came to the US to pursue my Ph.D.. She had always been telling me that having higher
education would be my only way out. She was so right. I believe my Ph.D. graduation would
have made her the happiest person in the world.
xiii
1.0 INTRODUCTION
Streaming applications are those that operate on continuous streams of data. They have
become increasingly more important and widespread. Examples include Internet audio and
video streaming. These streaming media applications have already been consuming a signif-
icant portion of the Internet bandwidth and their use continues to grow. Other examples of
streaming applications include automatic target recognition (ATR) found in radar systems
[43], which does pattern matching of targets in images that are continuously fed from sensors.
Traditionally, researchers have focused on improving the performance of streaming appli-
cations. There are two typical performance metrics for streaming applications: throughput
and response time. The data stream that a streaming application operates on can be ab-
stracted as a series of requests (e.g., a frame in video streaming is a request) and the streaming
application is servicing the requests (e.g., decoding frames in video streaming) in succession.
Thus, throughput is defined as the number of requests that a streaming application services
in one second and response time is defined as the maximum time allowed to service one
request. Clearly, high performance for a streaming application is synonymous with high
throughout and low response time.
In general, there are two approaches to improving the performance of streaming appli-
cations: a hardware approach and a software approach. The hardware approach refers to
designing faster computer systems while the software approach is through developing smarter
implementation and/or scheduling algorithms. Over the past three decades, the hardware
approach has overshadowed the software approach due to Moore’s Law. Processor frequen-
cies have gone from 4K Hz to 4G Hz. As a result, the performance of streaming applications
has improved dramatically. However, the price paid is that of drastically increased power
consumption. The increased power consumption results in increased energy consumption,
1
which is especially detrimental for battery-powered systems, and generates excessive amount
of heat that calls for expensive sophisticated packaging and cooling techniques. The gen-
erated heat, if not effectively removed, can also reduce system reliability. Thus, there is a
fundamental trade-off between performance and energy. Since power/energy consumption
can no longer be ignored in modern systems, increasingly more research attention has been
shifted from just performance to energy-performance trade-off, which is the focus of this
dissertation. Because streaming applications are continuous in nature and usually compute-
intensive, most of them are power-demanding and energy-hungry. Thus, there is a great
need to optimize energy consumption for streaming applications, while satisfying two typi-
cal quality-of-service (QoS) requirements, namely, throughput and response time.
Modern systems usually provide two power management mechanisms at the operating
system level. The first one is vary-on/vary-off (or on-off for short), which refers to turn-
ing off or putting processors into sleep mode to save static power consumption resulting
from leakage current. The second one is dynamic voltage scaling (DVS), which involves
dynamically adjusting the voltage and frequency/speed1 of a processor to reduce dynamic
power consumption resulting from switching activities in circuits. When dealing with multi-
processor systems, another fundamental trade-off between static power and dynamic power
consumption enters the picture. Assuming perfect parallelism, executing a streaming appli-
cation on more processors increases the static power consumption of a multiprocessor system,
while decreasing its dynamic power consumption. Hence, when running a streaming appli-
cation in multiprocessor systems, it is necessary to combine these two power management
mechanisms to optimize the energy consumption of the streaming application.
This dissertation assumes a given computer system (either uniprocessor or multiproces-
sor) that provides the aforementioned two power management mechanisms and a streaming
application that is represented by a task graph. The goal is to design scheduling algorithms
to schedule this streaming application to execute on this computer system (i.e., the software
approach) to minimize the energy consumption of the streaming application while satisfying
two QoS requirements, throughput and response time. The scheduling outcome includes the
number of processors assigned to execute the streaming application, the task to processor
1In this dissertation, frequency and speed are used interchangeably.
2
mapping, and the execution speed of each task. A scheduling algorithm for uniprocessor
systems is also called DVS scheme since DVS is the main power management mechanism.
Note that in the context of real-time systems, the streaming application is equivalent to a
periodic application, where the response time requirement is the deadline and the reciprocal
of throughput is the period.
The following requirements should be satisfied in the scheduling algorithms: (i) effec-
tiveness : the algorithms need to achieve good performance in terms of energy savings; (ii)
efficiency : the oﬄine computation of the scheduling algorithms must be tractable (i.e., at
most polynomial time) and the online computation of the scheduling algorithms must be
done in no more than linear time to minimize the scheduling overhead; (iii) practicality : the
scheduling algorithms must take into consideration practical issues such as limited number
of discrete speeds available in a processor and speed change overhead, which may not be
ignored in practice.
Bearing the above requirements in mind, a comprehensive treatment on energy-aware
scheduling for streaming applications is given. Different underlying computing platforms
(uniprocessor and multiprocessor systems), different characteristics of the workload (deter-
ministic and stochastic workload), and different types of task graphs (single task and general
task graphs) are taken into account. Furthermore, two processor power models are consid-
ered. The first one is an ideal model in which the processor speed can be tuned continuously
and unrestrictedly, and the speed change overhead is ignored. The second one is a realistic
model, in which the processor has a limited number of discrete speeds and the speed change
overhead is considered. The ideal model is of great simplicity and makes it possible to derive
elegant and optimal scheduling algorithms through which insight into the problems under
investigation is gained. However, the ultimate goal is to obtain scheduling algorithms under
the realistic model.
The contributions of this doctoral work are as follows.
1. For uniprocessor scheduling on stochastic workload under the ideal processor model,
an optimal inter-task DVS (for multiple tasks and speed can be changed only at task
boundaries) scheme is proposed and a unified view of the optimal intra-task DVS (for a
single task and speed can be changed inside a task), inter-task DVS, and hybrid DVS
3
(for multiple tasks and speed can be changed inside a task) schemes is provided [65, 67].
2. For uniprocessor scheduling on stochastic workload under the realistic processor model
• A new intra-task DVS scheme called PPACE (Practical Processor Acceleration to
Conserve Energy) is proposed. PPACE is a fully polynomial time approximation
scheme (FPTAS) that can give performance guarantees and achieve energy savings
very close to the optimal solution [68].
• A new inter-task DVS scheme called PITDVS2 (Practical Inter-Task DVS using 2
speeds) is proposed and experimental results show that it outperforms the existing
DVS schemes. It is also showed that simple extensions to optimal DVS schemes
obtained under the ideal processor model do not necessarily generate DVS schemes
that perform well in practice [67].
• A unified practical approach for obtaining optimal (and provably close to optimal)
stochastic inter-task, intra-task, and hybrid scheduling algorithms is proposed. The
approach is based on a function approximation technique that also falls in the cat-
egory of fully polynomial time approximation schemes. As a result, tight upper
bounds on energy savings for stochastic DVS schemes is established and this ap-
proach can be used to evaluate existing DVS schemes [65].
3. For multiprocessor scheduling of a single task, a simple master-slave scheme that executes
different instances of the task on multiple processors is proposed. Based on this scheme,
scheduling algorithms for deterministic and stochastic workload, and for the ideal and
realistic processor model are devised.
4. For multiprocessor scheduling of a task graph under the realistic processor model
• A novel scheduling algorithm called Scheduling2D for deterministic workload is pro-
posed. This algorithm exploits the difference between the two QoS requirements and
performs processor allocation, task mapping, and task speed scheduling simultane-
ously. The design of this algorithm shows that the static power of processors has an
important impact on the scheduling of streaming applications and high static power
could lead to servicing requests faster than the response time requirement in order
to save energy. Experimental results show that Scheduling2D achieves significant
4
energy savings over existing scheduling algorithms that only consider the response
time requirements [66].
• The Scheduling2D algorithm is further extended to the SScheduling2D algorithm to
deal with stochastic workload.
To the best of my knowledge, this dissertation is the first work to address energy-aware
scheduling on multiprocessor systems considering both throughput and response time con-
straints.
This dissertation is organized as follows. First, the background and related work for this
dissertation are described in Chapter 2. The models, problem description, and evaluation
methodology are presented in Chapter 3. Chapters 4 and 5 present scheduling algorithms
and evaluation results for uniprocessor and multiprocessor systems, respectively. Chapter 6
concludes this dissertation and Chapter 7 elaborates future research to extend this work.
5
2.0 BACKGROUND AND RELATED WORK
2.1 STREAMING APPLICATIONS
Continuous processing on a stream of data is the main characteristic of streaming applica-
tions. Stephens et al. [57] gave an excellent survey of stream processing. Thies et al. [58]
pointed out a number of important properties of streaming applications. There have been
a number of programming languages that are related to stream processing. In general, task
graphs can be extracted from programs written in these languages.
The latest language designed for stream processing is MIT Streamit [58]. In Streamit,
programs are represented as graphs, where nodes represent computation and edges represent
FIFO-ordered communication of data. The basic programmable unit is a filter and filters can
be composed into stream graphs by using three hierarchical structures: pipeline, split-join,
and feedback loops.
Previous works on scheduling streaming applications for multiprocessor systems focused
on performance issues. Bokhari et al. [12] gave an optimal algorithm for mapping a chain
of tasks to multiprocessor systems. In [26], a stream compiler for Streamit is presented for
scheduling streaming applications on communication-exposed architectures (e.g., the MIT
RAW). Agarwalla et al. [4] developed a scheduler called Streamline to schedule streaming
applications on Grid systems. Various heuristics [50, 55, 49] were proposed to use parallel
processing and pipelining to maximize throughput or minimize the number of processors
in synthesizing task graphs for multiprocessor systems. However, these works did not take
energy consumption into consideration and this dissertation fills this void.
Streaming applications can be modeled as real-time applications, as described next.
6
2.2 REAL-TIME SYSTEMS
In real-time systems, the correctness of an application depends not only on the correctness
of its output, but also the time it takes to produce its output. A real-time application can
only start execution after its release time (defined to be the instant of time at which the
application becomes available for execution) and must finish the execution correctly before
its deadline (defined to be the instant of time by which the application is required to be
completed). The response time of a real-time application is the length of time from its
release time to the instant when it finishes. A real-time system can be either hard real-time
or soft real-time. The former means that the deadline is a strict requirement and deadline
messes may result in system failure, while the latter means that occasional deadline misses
are tolerable and the performance requirement is a statistical one.
A real-time application generally consists of a set of tasks, which, together with the
precedence constraints, constitute a task graph. Each task has a worst-case execution time
(WCET), which can be obtained through profiling or analysis. In the context of energy-aware
real-time scheduling, the task workload is often characterized by worst-case execution cycles
(WCEC). The traditional WCET can be computed by dividing WCEC by the maximum
frequency of the processor. Streaming applications that this dissertation deals with are
real-time applications because they have response time requirement.
The most common task model in real-time systems is the periodic task model. In this
model, each task is executed repeatedly at regular time intervals in order to provide a function
of the system on a continuing basis. The time between consecutive release times of a task
is called the period of the task. Deadlines are generally relative to the beginning of periods.
If all tasks share a common period and an identical first release time, we call this type
of real-time systems frame-based systems. Each period is called a frame. When tasks in an
application have precedence constraints, a partial order on the execution of tasks is imposed.
In addition, tasks can be preemptive or non-preemptive. While the execution of a preemptive
task can be interrupted by another task and resume later, a non-preemptive task cannot be
interrupted before its completion.
A streaming application can be modeled as tasks executing in a frame-based real-time
7
system. The throughput, defined as the number of requests per second, is equal to the
reciprocal of the frame length. The response time requirement, defined as the maximum
time allowed to service a single request, is equivalent to the task deadline.
The scheduling in real-time systems is to decide which task is executed on which processor
at what time. For energy-aware scheduling, the speed at which a task is executed also needs
to be determined. A schedule is said to be feasible if the precedence constraints, timing
constraints, as well as any other constraints are satisfied.
2.3 POWER MANAGEMENT MECHANISMS
Power consumption in a CMOS processor can be divided into three parts: dynamic, static,
and short-circuit power [44]. Short-circuit power is only consumed during signal transitions
and is generally negligible [44]. Dynamic power is due to switching activities in the circuit
and is roughly proportional to the input voltage squared times the operating frequency [60].
But input voltage and operating frequency are not independent and there is a minimum
voltage required to drive the circuit at the desired frequency. The minimum voltage is
approximately proportional to the frequency, leading to the conclusion that the power is
proportional to the frequency cubed [14]. Since the time taken to run a program is inversely
proportional to the operating frequency, quadratic energy savings can be achieved at the
expense of just linear performance loss through DVS [30].
Static power consumption stems from leakage current that exists even in the absence of
switching activities in a circuit. In traditional CMOS circuits, static power can be ignored
because dynamic power dominates. However, this is not true any more considering the
trends of CMOS circuit technologies. A five-fold increase in the leakage power is estimated
with each technology generation [13]. Thus, power management schemes must take static
power into consideration. Turning a processor off or putting it in sleep mode are general
mechanisms to save static power.
8
2.4 ENERGY-AWARE UNIPROCESSOR SCHEDULING
In energy-aware uniprocessor scheduling, DVS is the main power management mechanism
to be considered. This is because whether to turn off the only processor in the system is
straightforward, that is, we turn off the processor when it has nothing to run. Thus, we
also call the scheduling algorithms for uniprocessor systems DVS schemes. There are three
types of DVS schemes: inter-task, intra-task, and hybrid [34]. Inter-task DVS schemes focus
on allotting system time to multiple tasks and schedule speed changes only at each task
boundary (i.e., the execution speed for a task is constant for each instance executed), while
intra-task DVS schemes focus on how to schedule speed changes within a single task instance
given an allotted amount of time. Hybrid DVS schemes are combination of intra-task and
inter-task DVS schemes. There has been a large amount of work done in real-time systems
related to DVS schemes. Next, we will review the related work for each type of DVS schemes.
2.4.1 Inter-task DVS Schemes
Inter-task DVS schemes differ in the way slack is allotted to tasks in the system. Slack
includes static slack due to system under-utilization when each task is assumed to run for
its worst-case execution cycles (WCEC) and dynamic slack due to early completion of tasks.
The concept of speculative speed reduction was introduced in [48], which proposed three DVS
schemes (i.e., Greedy, Proportional, and Statistical schemes) with different speed reduction
aggressiveness for frame-based real-time systems. The Proportional scheme distributes the
slack proportionally among all unexecuted tasks, while the Greedy scheme is much more
aggressive and gives all the slack to the next ready-to-run task. The Statistical scheme
uses the average-case execution cycles (ACEC) of tasks to predict the future slack. Many
existing DVS schemes proposed for real-time systems can be classified as Proportional or
Greedy. The cycle-conserving scheme and the look-ahead scheme for periodic real-time
systems with variable workloads were proposed in [51]. When used in frame-based systems,
the cycle-conserving scheme is equivalent to the Proportional scheme, while the look-ahead
scheme is equivalent to the Greedy scheme. The scheme for fixed-priority real-time systems
9
proposed in [56], when used in frame-based systems, is equivalent to the Greedy scheme.
To be able to navigate the full spectrum of speculative speed reduction, Aydin et al.
proposed a DVS scheme in which system designers can set a parameter to control the degree
of speed reduction aggressiveness [8]. In fact, the optimal speed reduction aggressiveness
depends on the variability of the workloads, as will be shown in Section 4.2.2. The Statis-
tical scheme attempts to capture the variability of the workloads by using the average-case
execution cycles (ACEC) of each task, which does not contain sufficient probabilistic infor-
mation. Our inter-task DVS schemes in Section 4.2.2 automatically choose the degree of
speed reduction aggressiveness to minimize the expected energy consumption, based on the
probability distribution of the tasks’ execution cycles.
2.4.2 Intra-task DVS Schemes
For a single task and a given deadline, Lorch et al. showed that if a task’s computational
requirement is only known probabilistically, there is no constant optimal speed for the task
and the expected energy consumption is minimized by gradually increasing the speed as the
task progresses [40, 42]. They call this approach PACE (Processor Acceleration to Conserve
Energy). DVS schemes similar to PACE have also been proposed in [71, 27]. They differ in
the way of patching the speed schedule obtained under an ideal processor model (i.e., the
speed can be adjusted continuously and there is no speed change overhead) to fit a realistic
processor model (i.e., considering discrete speed and speed change overhead). Among these
intra-task DVS schemes, PACE and GRACE (Global Resource Adaptation through CoopEr-
ation) [71] can be used for soft real-time systems as they have a parameter to control the
percentage of deadlines to be met. When this parameter is set to 100%, they are targeted
at hard real-time systems.
2.4.3 Hybrid DVS Schemes
A hybrid DVS scheme can be obtained by using an inter-task DVS scheme as the basis and
plugging in an intra-task DVS scheme [34, 71]. However, such hybrid schemes are inherently
suboptimal because they ignore the interaction between inter-task and intra-task DVS. The
10
optimal hybrid DVS scheme under the ideal processor model extends PACE to multiple
tasks in frame-based real-time systems [72]. However, the authors of [72] did not provide
any solution to patch their scheme to be used in practice.
Another interesting result, applicable to both hybrid and inter-task DVS, is that different
ordering of tasks in real-time systems results in different energy consumption [28, 29], which
led to a number of heuristics to obtain a “good” ordering of tasks. This is complementary
to our work because we focus on finding speed schedules given the ordering of tasks.
2.5 ENERGY-AWARE MULTIPROCESSOR SCHEDULING
Multiprocessor systems are quickly becoming the dominant computer platform as more and
more chip multiprocessors (CMPs) available in the market. By combining multiple small
processor cores on a single chip, CMPs continue to push the processor performance growth
beyond the clock rate limit. Several chip makers have already released CMPs, such as
IBM/Sony/Toshiba’s 9-core CELL [1] and the 80-core prototype by Intel [2]. The trend is
that more and more processor cores will be seen on a single computer system.
There are four key elements in energy-aware multiprocessor scheduling for streaming
applications: (i) streaming applications are represented by task graphs; (ii) multiprocessors;
(iii) quality of service constraints (i.e., throughput and response time); (iv) energy-aware
scheduling using on-off and DVS. We will review related work that contains these elements.
Much research has been done on energy-aware scheduling of multiple independent tasks
for multiprocessor systems using DVS (e.g., [62, 38, 16]). This dissertation focuses on schedul-
ing of task graphs. There is also a lot work on energy-aware scheduling of task graphs using
DVS and assuming equal period and deadline. In [23, 61], scheduling algorithms for simple
special types of task graphs were proposed. For general task graphs, Mishra et al. proposed
several heuristics to obtain the execution speed of each task assuming that the number of
processors and task mapping are given [45]. Andrei et al. proposed a convex programming
based approach, again, assuming that the number of processors and task mapping are given
[5]. Cong et al. proposed a priority based heuristic to perform task mapping and a mathe-
11
matical program based approach to perform speed scheduling [18]. All these algorithms can
be used as a component in our scheduling algorithm for general task graphs in Section 5.3.2.
By simply trying every possible number of processors [37], all these work can be extended
to consider on-off. In Section 5.3.2, we propose a better approach based on hill-climbing.
Before DVS emerged as an important power management mechanism, various approaches
[50, 55, 49, 10] were proposed to use parallel processing and pipelining to maximize through-
put or minimize the number of processors in scheduling task graphs on multiprocessor sys-
tems. Because energy consumption was not under consideration, straightforward application
of these approaches cannot fulfill the need for energy optimization. On one hand, schedul-
ing approaches that maximize throughput tend to use more processors, which is not energy
efficient for high static power, and do not guarantee to comply with the response time require-
ment. On the other hand, scheduling approaches that minimize the number of processors
tend to use fewer processors, which is not good for high dynamic power. Our scheduling al-
gorithms in Section 5.3 also apply parallel processing and pipelining. However, we use them
as energy reduction techniques and focus on finding appropriate number of processors to
embrace the trade-off between static and dynamic power, and allotting appropriate amount
of time to each task to stretch its execution.
Combining on-off and DVS to exploit the trade-off between static and dynamic power
consumption has been used in multiprocessor-like settings by a number of researchers. In
[22], Elnozahy et al. proposed a power management policy to determine the optimal number
of online servers and corresponding operating frequency to minimize the energy consump-
tion of clusters. In [69], Xu et al. tackled a similar problem considering several practical
issues. In [9], Anderson et al. studied energy-efficient synthesis of periodic task systems on
multiprocessor platforms. However, all of the above research dealt with independent tasks
and considered only deadline constraint. Thus, they cannot be applied straightforwardly to
task graph scheduling and multiple-constraint scheduling.
Kim et al. [33] explored the effectiveness of the simultaneous application of pipelining and
parallel processing as a total power reduction technique in uniprocessor design. Our work in
Section 5.3 is different from theirs in several aspects: (i) They focused on uniprocessor that
is under a single voltage domain while we focus on multiprocessor systems and processors
12
can operate at different voltages and frequencies and have the capability of turning on/off;
(ii) They assumed idealized unlimited parallelism in instruction streams while we focus on
task graph mapping; (iii) They only considered throughput constraint while we consider
two QoS requirements; (iv) Their parallel processing width (instruction issue width) is fixed
for all stages while we can use different number of processors at different stages. (v) Their
goal was to minimize the power consumption while we are trying to minimize the energy
consumption.
13
3.0 MODELS AND PROBLEM DESCRIPTION
3.1 APPLICATION MODEL
A streaming application is modeled as a task graph G(V,E), which is a directed acyclic
graph (DAG), in order to exploit parallelism inside the application. The vertex υi ∈ V
represents task τi of whichWi is the worst-case execution cycles (WCEC). We assume that the
actual execution cycles of τi follows the probability distribution denoted by function Pi(· · · ).
Specifically, Pi(X) is the probability that task τi executes for X (1 ≤ X ≤ Wi) cycles.
Obviously,
∑Wi
x=1 Pi(x) = 1 and Pi(Wi) 6= 0. The corresponding cumulative distribution
function is cdfi(x) = Prob(X ≤ x) =
∑x
j=1 Pi(j) and cdfi(0) = 0. In practice, a histogram
is used to represent the probability function considering that a task usually takes millions
of cycles. Let the number of bins in the histogram that represents the probability density
function be denoted by ri and denote the bin boundaries by Bi(k), k = 1, 2, . . . , ri. In this
case, function Pi(·) is a function of bin number, that is, Pi(k) (1 ≤ k ≤ ri) denotes the
probability that task τi executes for Bi(k) cycles.
The directed edge eij represents dependency between task τi and τj, that is, τj is ready
to begin execution only after τi finishes execution (τi is called the predecessor of τj and τj
is called the successor of τi). A communication volume vij is associated with edge eij, and
determines the time and energy cost when τi and τj are scheduled on two different processors;
vij = 0 if the communicating tasks are scheduled on the same processor. That is, we assume
that the communication cost is zero if the communicating tasks are scheduled on the same
processor. In a task graph, the source is the only vertex that has no predecessors and the
sink is the only vertex that has no successors. In streaming applications, the source receives
the requests and the sink emits the output of servicing the requests. The period is T (i.e.,
14
the streaming application is invoked every T time units and thus must sustain a throughput
of 1
T
), and the deadline for emitting output (i.e., response time requirement) is D.
There are times when detailed task graph of a streaming application is not available.
In this case, the streaming application is simply represented by a single task, which can be
regarded as a special case of task graph (singleton task graph) that has only a single vertex.
3.2 SYSTEM MODEL
We consider two system models in this dissertation: uniprocessor and multiprocessor. Each
processor in our system models have the ability to dynamically adjust its speed/frequency
and voltage. For multiprocessor systems, we consider a typical homogeneous chip multipro-
cessor (CMP) architecture with distributed memory (Figure 1). Each processor core consists
of a processing unit, a local memory, and a switch. We assume that there is an infinite supply
of processor cores on the system. This assumption is based the trend that more and more
processor cores are available on a single system, as mentioned in Section 2.5.
PU
LM
. . .
Interconnection Network
S
PU
LM
S
processor
core 1
processor
core 2
PU:processing unit    LM:local memory    S:switch
Figure 1: Chip multiprocessor model
We assume that the program for each task in a streaming application is written in
stream programming style [58], that is, the program goes through three steps for servicing
each request: (i) data gathering from communication network to local memory; (ii) data
15
processing in local memory; (iii) data (i.e., processing results) dissemination from local
memory to communication network.
3.3 PROCESSOR MODEL
We consider two processor models in this dissertation.
3.3.1 Ideal Model
The first model, the ideal processor model, assumes that the processor speed can be adjusted
continuously from zero to infinity and there is no speed change overhead. The processor
power consumption when executing task τi at frequency f is
pi(f) = c0 + cif
α
where α (α > 1) reflects the convex power-frequency relationship, c0 denotes the processor
static power consumption when the processor is idle (i.e., f = 0, since the processor is
not executing any task and the overhead is ignored), and ci reflects the effective switching
capacitance of task τi. This form of analytical power function is due to the fact that dynamic
power consumption can be approximately computed by Ce×V 2dd×f (Vdd is the supply voltage
and Ce is the effective switching capacitance) and the frequency f is almost linearly related
to the supply voltage [60]. If a processor (either in uniprocessor or multiprocessor system) is
on (or alternatively, active), the amount of the idle power is always consumed in the system.
Therefore, the power function of the processor can sometimes be simplified as
pi(f) = cif
α
in the analysis without affecting the analysis results. We assume that all tasks are CPU-
bound. Thus, the execution time of a task is inversely proportional to its operating frequency.
Although the ideal processor model is simplistic, it captures the essence of a DVS system
(i.e., the convex power-frequency relationship). Because the simple model is easy to ma-
nipulate mathematically, it is possible to derive optimal DVS schemes, which gives us great
insight into the problem and provides the basis for designing practical DVS schemes.
16
3.3.2 Realistic Model
The second model, the realistic processor model, considers practical issues. The processor
only provides M discrete operating frequencies, f1 < f2 < · · · < fM . All frequencies are
efficient, which means that using a frequency to execute a task always results in lower
energy consumption than using higher frequencies [46]. The processor power consumption
when idle is pidle (the system is not executing any task, and thus consumes constant power
(not necessarily f = 0 as in the ideal processor model). The processor power consumption
when executing task τi at frequency fj is pi(fj). As in the case of the ideal processor model,
if a processor is active, we sometimes ignore the idle power in deriving DVS schemes and use
pˆi(fj) = pi(fj)− pidle
as the power function in the analysis. As in the ideal model, the execution time of a task is
inversely proportional to its operating frequency.
In the realistic processor model, when changing the frequency of the processor from fi
to fj, the time cost is
PT (fi, fj) = ξ1|fi − fj| (3.1)
and the energy cost is
PE(fi, fj) = ξ2|f 2i − f 2j | (3.2)
where ξ1 and ξ2 are constants determined by the voltage regulators. Equations (3.1) and (3.2)
are taken from [15] and are considered to be an accurate modeling of speed change overheads.
It is common in the literature (e.g., [47]) to simplify Equations (3.1) and (3.2) by considering
the worst-case frequency swing and assuming that PT (fi, fj) = ξ1(fM − f1) and PE(fi, fj) =
ξ2(f
2
M − f 21 ). Note that PT (f, f) = 0 and PE(f, f) = 0, where f ∈ {f1, f2, . . . , fM}.
17
3.4 COMMUNICATION MODEL
For multiprocessor systems, we adopt a linear communication cost model, that is, when
transferring B bits of data, the communication delay is tp + λB and the communication
energy is γB, where tp is the propagation delay, λ is the reciprocal of the operating data rate
of the interconnection network, and γ is the energy spent to transfer one bit of data. In this
dissertation, we assume fixed data rate, that is, fixed γ and λ. Although the adopted linear
model does not account for network contention, it was shown to work very well for high-
bandwidth interconnection networks [49], which is typical under current multi-core processor
technology (e.g., CELL [1]).
3.5 PROBLEM DESCRIPTION
The problems that this dissertation addresses are essentially constrained optimization prob-
lems. All problems share the same goal, which is to minimize the energy consumption of
a targeted streaming application. They also share the same constraints, namely two QoS
requirements of throughput and response time. The requests are coming at a rate of one
request every T seconds, and thus requiring that the streaming application can sustain a
throughput of 1
T
. Each request is required to be processed in a time not greater than D,
and thus requiring that the streaming application responds to a request within time D. The
throughput performance is measured only when the response time limit is met, and the
response time is measured only when the throughput requirement is satisfied.
However, the problems differ in their conditions, which can be categorized through three
dimensions. The first dimension is the underlying computing platform : uniprocessor and
multiprocessor. The second dimension is the characterization of the task workload : de-
terministic and stochastic. In the deterministic case, each task consumes its worst-case
execution cycles (WCEC) while in the stochastic case, the number of execution cycles of
each task is variable and the variability can be captured by a probability distribution func-
tion. The third dimension is the type of task graphs : general task graphs and singleton task
18
graphs (only one vertex). Eight problems can be identified through these three dimensions.
Table 1 shows these eight problems and where they are solved. The name of each problem
consists of four parts separated by dash, with the first part being STREAM and the next
three parts corresponding to the aforementioned three dimensions. Next, we describe these
problems in detail.
Table 1: The eight optimization problems under consideration
STREAM- Deterministic Stochastic
Single Tasks Task Graphs Single Tasks Task Graphs
Uniprocessor UP-D-ST UP-D-TG UP-S-ST UP-S-TG
([7]) ([7]) (Section 4.3.1) (Section 4.3.2&4.3.3&4.4)
Multiprocessor MP-D-ST MP-D-TG MP-S-ST MP-S-TG
(Section 5.2) (Section 5.3) (Section 5.2) (Section 5.3)
3.5.1 Uniprocessor Scheduling Problems
In uniprocessor systems, there is no cost associated with communication among tasks. The
main focus is to save processor dynamic energy consumption through DVS. This is why the
scheduling algorithms for uniprocessor systems are also called DVS schemes. Furthermore,
for uniprocessor systems, the two QoS requirements essentially collapse into one in the sense
that we only need to deal with one requirement. This is because if the required response
time D is less than the request interarrival time T , each request must be processed in time
D; if D > T , for a total of N requests, we have a total of N · T + D ≈ N · T (N is a large
number) time to process all these requests, which means each request must be processed in
time T . Thus, the deadline for processing a request is min(D,T ). Without loss of generality,
we assume that D ≤ T so that we can just consider the response time requirement.
3.5.1.1 The STREAM-UP-D-ST and STREAM-UP-D-TG Problems Both the
STREAM-UP-D-ST and STREAM-UP-D-TG problems deal with scheduling a streaming
19
application whose workload is deterministic on a uniprocessor system. Their difference is
whether or not the streaming application has a detailed task graph. However, because
the workload is deterministic and there is no cost associated with inter-task in-processor
communication, the STREAM-UP-D-TG problem can be reduced to the STREAM-UP-D-
ST problem by transforming the task graph into a single task. The STREAM-UP-D-ST
problem has been well studied. If the workload of a streaming application is deterministic,
one can always use a constant processor speed to execute that streaming application to
achieve the minimum energy consumption without missing the deadline. This is due to
the convexity of the processor power function. In fact, this is a well-establish result in
energy-aware scheduling theory. Interested readers can refer to [7] for a formal proof. Thus,
the solution to the problems of STREAM-UP-D-ST and STREAM-UP-D-TG will not be
considered in this dissertation.
3.5.1.2 The STREAM-UP-S-ST Problem The STREAM-UP-S-ST problem deals
with scheduling a streaming application represented by a single task whose workload is
stochastic on a uniprocessor system. More often than not, the computational requirement of
a task is variable and not known a-priori. One could attempt to predict the computational
requirement of the task. However, for many streaming applications (e.g., MPEG decoder),
the computational requirement cannot be predicted based on the recent history. The vari-
ability and unpredictability of the workloads are mainly caused by different inputs to tasks,
and possibly by randomization inside tasks. For variable workloads, we focus on stochas-
tic DVS schemes that use probability distribution functions to capture the variability of the
workloads. Thus, the goal of such DVS schemes is to minimize the expected energy consump-
tion. The basic question of the STREAM-UP-S-ST problem is: given a task and a deadline,
how to decide the execution speed for the task such that the expected energy consumption
is minimized while meeting the deadline? Note that the solution to the STREAM-UP-S-ST
problem can be applied to frame-based real-time systems with only a single task. This is
because the processing of a request can be viewed as a task and the response time can be
treated as the frame length. In Chapter 5, the solution to the STREAM-UP-S-ST prob-
lem also serves as an important building block for the solutions to scheduling problems in
20
multiprocessor systems.
3.5.1.3 The STREAM-UP-S-TG Problem The STREAM-UP-S-TG problem deals
with scheduling a streaming application represented by a task graph whose workload is
stochastic on a uniprocessor system. Topological sort can be performed on the task graph
to obtain a chain of tasks to be executed in the system in succession. The basic question of
this problem is: given a chain of tasks and a deadline, how to decide the execution speed
for each task such that the expected energy consumption of all tasks is minimized while all
tasks will finish by the deadline? Since the workload of each task is variable, dynamic slack
reclamation plays an important role in this problem. By knowing more information (i.e., the
probability distribution of the computational requirement of each task), more energy savings
are expected to be obtained than only knowing the probabilistic information of the whole
task graph. The solution to this problem can be applied to general frame-based real-time
systems.
3.5.2 Multiprocessor Scheduling Problems
Multiprocessor scheduling problems differ from uniprocessor scheduling problems mainly in
two aspects. First, unlike uniprocessor systems, we have the freedom of choosing the number
of processors to execute a streaming application in multiprocessor systems. The two power
management mechanisms (i.e., on-off and DVS) must be used in the scheduling because
of the trade-off between static and dynamic power consumption. Thus, determining the
appropriate number of processors is one of the keys in solving multiprocessor scheduling
problems. Second, because parallel processing and pipelining techniques can be used in
multiprocessor systems, we can take advantage of the case where response time requirement
D is greater than request interarrival time T to save more energy.
3.5.2.1 The STREAM-MP-D-ST Problem The STREAM-MP-D-ST problem deals
with scheduling a streaming application represented by a single task whose workload is
deterministic on a multiprocessor system. Since the streaming application is only represented
21
by a single task, the whole application must be executed in a single processor. However,
we could potentially execute different instances of the streaming application on different
processors to serve different requests. Specifically, we have a processor that acts as a master
to receive requests and distribute them in a round-robin fashion to other processors, each
acting as a slave and running an instance of the streaming application. The master can
be placed on the administrative processor (e.g., the PPE in CELL [1]), or with a slave on
a processor because the master has very light workload. Therefore, we ignore the energy
consumption of the master. Thus, the basic question of the STEAM-MP-D-ST problem is
how to decide the number of processors (slaves) and the speed for each processor.
3.5.2.2 The STREAM-MP-D-TG Problem The STREAM-MP-D-TG problem deals
with scheduling a streaming application represented by a task graph whose workload is de-
terministic on a multiprocessor system. Since the streaming application has a detailed task
graph exposing the parallelism inside the application, classic pipelining and parallel process-
ing techniques can be applied in multiprocessor task graph scheduling. Pipelining is used to
exploit the parallelism in time (indicated by predecessor and successor relationship in task
graphs) and parallel processing is employed to take advantage of the parallelism in space
(indicated by sibling relationship in task graphs). The basic questions of the STREAM-MP-
D-TG problem are how to decide: (i) the number of active processors to execute the task
graph; (ii) the mapping from tasks to active processors; (iii) the execution speed of each
task. Note that these three questions are correlated and will be addressed simultaneously.
3.5.2.3 The STREAM-MP-S-ST and STREAM-MP-S-TG Problems These two
problems are similar to their deterministic counterparts except that the computational re-
quirement of each task is variable and unpredictable. As in the uniprocessor case, the
variability of the workload can be captured by a probability distribution function, and the
objective is to minimize the expected energy consumption. Dynamic slack reclamation tech-
nique is indispensable in dealing with a stochastic workload. Thus, the scheduling algorithms
for these two problems must consider reclaiming dynamic slack generated in the system. The
basic question of these two problems is whether we can extend the algorithms for their de-
22
Table 2: Synthetic task graphs
Task graph # of tasks # of edges
kseries parallel 20 - 62 19 - 61
creds1 9 - 13 10 - 18
simple 11 - 24 12 - 32
kbasic tables 19 21
kseries parallel xover 21 - 38 24 - 41
bugtest 49 60
kbasic task 18 - 64 18 - 79
kextended 21 - 23 25 - 28
packets 6 - 8 5 - 8
terministic counterparts or we need to design the algorithms for these two problems from
scratch.
3.6 EVALUATION METHODOLOGY
The solutions to the problems described in Section 3.5 are evaluated through simulations.
The purpose of the evaluation is two-fold. First, simulation results can provide insight into
the solutions. Second, we want to quantify the gains of our solutions over previously known
solutions.
Next, we describe the workload generation and process models used in our simulations.
3.6.1 Workload Generation
Both synthetic and real-world workloads are used in the simulations. Synthetic task graphs
(Table 2) are generated by TGFF v3.0 [53] using the sample input files that come with the
23
software package. For stochastic workloads, we use six representative probability distribu-
tions (Figure 2) to generate the number of execution cycles of a task. The distributions
include one uniform distribution, three unimodal distributions, and two bimodal distribu-
tions.
(a) (b) (c)
(d) (e) (f)
Figure 2: Probability functions: uniform, unimodal1, unimodal2, unimodal3, bimodal1,
bimodal2 (from left to right). The Y-axis is probability and the X-axis is number of execution
cycles.
For real-world task graphs, we used automatic target recognition (ATR) [43], which is
a streaming application that does pattern matching of targets in images/pictures. In ATR,
the regions of interest (ROI) in one image are detected and each ROI is compared with a
number of stored templates. The number of targets detected in each frame varies from 0
to 8. Image processing time is proportional to the number of detections within an image.
A typical platform for ATR is unmanned autonomous vehicle (UAV). ATR must sustain a
required incoming rate of images and process each image within a required amount of time
24
to meet UAV mission requirement1.
A
B C D
GE F
H I J
MK L
N
level 1
level 2
level 3
level 4
level 5
level 6
Figure 3: Task graph of ATR when three targets are detected
The task graph of ATR is different for different number of target detections in an image.
Figure 3 shows the one corresponding to 3 target detections. Each of the three paths,
B → E → H → K, C → F → I → L, and D → G → J → M , corresponds to the
processing for one target detection. Note that these three paths have the same structure and
workload since ATR does the same processing for different target detections. .
3.6.2 Processor Models
We use the following three processor models (all belong to the realistic model) in our simu-
lations.
1. Synthetic processor. It strictly conforms to the p(f) = f 3 power-frequency relation and
has 10 discrete frequencies ranging from 100MHz to 1GHz with 100MHz step; its idle
power is zero and there is no speed change overhead.
1The throughput and response time QoS requirements are explicitly specified in the ATR documentation
25
2. Intel XScale (Table 3) [64]. The idle power of XScale is 40 mW [19], which is one half
the power consumed at the minimum frequency. Figure 4(a) shows that the approximate
analytical power function obtained by curve fitting is a good approximation of the actual
power function. The time and energy penalties for each speed change are reported as
12µs and 1.2µJ, respectively, in [63]. We assume that these numbers are worst-case speed
change overheads, which are equivalent to the overheads incurred when changing from
the minimum speed to the maximum speed in our power model described in Section 3.
Accordingly, we derived the values of the constants ξ1 and ξ2 in Equation (3.1) - (3.2) to
be used in our experiments.
3. IBM PowerPC 405LP (Table 4) [54]. The approximate analytical power function ob-
tained by curve fitting is shown in Figure 4(b), which indicates that the approximation
is not as good as XScale. The idle power is assumed to be half the power consumed
at the minimum frequency. The time and energy penalties for each speed change are
reported as 1ms and 750µJ, respectively, in [54]. We also translated these numbers into
our power model as in the case of XScale.
Table 3: XScale speed settings and power consumptions
Speed (MHz) Idle 150 400 600 800 1000
Voltage (V) 0.75 0.75 1.0 1.3 1.6 1.8
Power (mW) 40 80 170 400 900 1600
Table 4: PowerPC 405LP speed settings and power consumptions
Speed (MHz) Idle 33 100 266 333
Voltage (V) 1.0 1.0 1.0 1.8 1.9
Power (mW) 9.5 19 72 600 750
26
 0
 500
 1000
 1500
 2000
 2500
 0  200  400  600  800  1000
Po
w
er
 (m
W
)
Frequency (MHz)
actual power
40+1560(f/1000)3
(a) XScale
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
-50  0  50  100  150  200  250  300  350
Po
w
er
 (m
W
)
Frequency (MHz)
actual power
9.5+740.5(f/333)3
(b) PowerPC 405LP
Figure 4: Approximate analytical power function vs. actual power function
The power consumptions listed in Tables 3 and 4 are obtained by measuring the pro-
cessor power when running certain benchmarks. In practice, different tasks have different
instruction mixes, thus resulting in different dynamic power consumptions. We associate
each task with a power scaling factor to simulate this reality. For example, if a task’s power
scaling factor is 0.9, it will consume 40 + (400 − 40) × 0.9 = 364 mW when executing at
frequency 600MHz for XScale.
27
4.0 SCHEDULING IN UNIPROCESSOR SYSTEMS
4.1 OVERVIEW
In this chapter, we consider energy-aware uniprocessor scheduling problems for streaming
applications. Since the STREAM-UP-D-ST and STREAM-UP-D-TG problems have been
well studied, we will focus on the problems of STREAM-UP-S-ST and STREAM-UP-S-TG
(i.e., scheduling stochastic workload in uniprocessor systems).
The problems of STREAM-UP-S-ST and STREAM-UP-S-TG are closely related to
frame-based hard real-time systems, which are special cases of periodic real-time systems.
In frame-based real-time systems, all tasks share the same period (also called the frame)
and deadlines are equal to the end of the period. In each frame, tasks are executed in a
predetermined fixed order. Thus, the STREAM-UP-S-ST problem can be modeled as a
frame-based system in which there is only a single task, while the STREAM-UP-S-TG can
be modeled as a regular frame-based system in which the task execution order is obtained by
using topological sort on the task graph. For frame-based systems, the problem of designing
a DVS scheme can be reduced to determining the amount of time allotted to a task (and
accordingly, deciding the speed(s) to execute it) before it is dispatched to the system.
In the STREAM-UP-S-ST and STREAM-UP-S-TG problems, the workload of streaming
applications is variable and unpredictable. In these applications, tasks usually run for less
than their worst-case execution cycles (WCEC1), creating the opportunity for dynamic slack
reclamation to slow down future tasks. Furthermore, the actual number of execution cycles
of tasks are unknown and cannot be predicted before execution. Because of the above
1The traditional worst-case execution time (WCET) in real-time systems is defined as the running time
at the maximum processor speed when a task takes its WCEC.
28
characteristics, it is impossible for any DVS scheme to guarantee to minimize the energy
consumption in the system. However, if the variability of the workloads can be captured
by the probability distribution of the computational requirement of each task in the system
(e.g., through profiling), it is possible to design DVS schemes that minimize the expected
system energy consumption. This means that for large number of frames, such DVS schemes
will achieve the most total energy savings even though they do not necessarily result in lower
energy consumption than other DVS schemes for any given frame.
In this chapter, we investigate DVS schemes for frame-based hard real-time systems with
the goal of minimizing the expected energy consumption in the system while meeting the
deadlines, given probabilistic information of the workloads. We carry out the investigation
in two dimensions. The first dimension is the DVS strategy, which can be categorized as
inter-task, intra-task, and hybrid DVS (refer to Section 2.4 for definitions of these strategies).
Intra-task DVS schemes are solutions to the STREAM-UP-S-ST problem, while inter-task
and hybrid DVS schemes are solutions to the STREAM-UP-S-TG problem. The second
dimension is the processor power model, for which we have the ideal and realistic models.
Because of the great simplicity of the ideal model, it is possible to derive elegant and optimal
DVS schemes through which we gain insight into the problems that we are dealing with.
Furthermore, the DVS schemes derived under the ideal processor model could serve as the
basis for deriving DVS schemes under the realistic model. Our ultimate goal is to obtain
good DVS schemes under the realistic model.
Table 5 shows the road map of our investigation in this chapter. We first present optimal
DVS schemes for each DVS strategy under the ideal processor model in Section 4.2.1, 4.2.2,
and 4.2.3. We then give a unified view of all these optimal schemes in Section 4.2.4. DVS
schemes for each DVS strategy under the realistic processor model are presented in Section
4.3. These schemes are designed using different approaches. In Section 4.4, we propose a
unified approach for obtaining optimal (or provably close to optimal) DVS schemes for all
DVS strategies under the realistic processor model.
29
Table 5: The road map of our investigation; cited work was done by other researchers prior
to this dissertation
Problem STREAM-UP-S-ST STREAM-UP-S-TG
Model/DVS strategy Intra-task DVS Inter-task DVS Hybrid DVS
Ideal model
PACE OITDVS GOPDVS
[40] (Section 4.2.2) [72]
(Section 4.2.1) (Section 4.2.3)
Realistic model
PPACE PITDVS PGOPDVS
(Section 4.3.1) PITDVS2 PIT-PPACE
(Section 4.3.2) (Section 4.3.3)
A unified approach (Section 4.4)
4.2 SOLUTIONS UNDER IDEAL PROCESSOR MODEL
4.2.1 Optimal Intra-Task Scheme
The ideal processor model assumes the ability to set speed/frequency for each cycle executed
by a task, since there is no speed change overhead. Thus, the problem of deriving an intra-
task DVS scheme is to find the execution speed for each cycle of a task (also called the speed
schedule of the task) to minimize the expected energy consumption while ensuring that the
task will finish by the deadline D. An intra-task DVS scheme is of great importance because
it could be used as a building block for general frame-based real-time systems containing
multiple tasks: a time Di is allotted to each task, and the intra-task DVS scheme is applied
for each task. It can also be used as the DVS scheme for a special case of frame-based
systems, where there is only a single task in a frame (e.g., for specialized embedded devices
like MP3 players).
The optimal intra-task DVS scheme, PACE, was derived in [40]. We will briefly review
its derivation for completion. Since there is only a single task, we will omit the subscripts
30
of the parameters of the task unless confusion arises. Let the execution speed of the ith
cycle of task τ be denoted by si. The problem can be formally expressed by the following
mathematical program [40]
Minimize
∑
1≤i≤W
Fi · c · sα−1i (4.1)
Subject to
∑
1≤i≤W
1
si
≤ D (4.2)
where Fi = 1− cdf(i−1) represents the probability of executing the ith cycle of task τ . Note
that c in (4.1) reflects the effective switching capacitance of task τ . By using Lagrangian
technique [36] or Jensen’s inequality [35], the optimal solution to the problem in (4.1)-(4.2)
is
si =
∑W
j=1 F
1
α
j
F
1
α
i D
(4.3)
and the optimal expected energy consumption is
e∗ =
c
∑W
j=1 F
1
α
j
Dα−1
(4.4)
Several claims can be made based on (4.3) and (4.4): (i) the optimal expected energy
consumption of a task is inversely proportional to Dα−1; (ii) s1 ≤ s2 ≤ . . . ≤ sW [40]
since cdf(·) is a nondecreasing function. Using lower speeds for early cycles makes perfect
sense because early cycles have higher probability to be executed than later cycles. (iii) when
changing speed during the execution of a task, the optimal speed to be used is always inversely
proportional to the remaining time until the deadline. This claim is not straightforward and
needs a little clarification. Let ηi = F
1
α
i /
∑W
j=1 F
1
α
j and thus (4.3) can be rewritten as si =
1
ηiD
.
One can view ηi as the fraction of the time D to be allotted to the i
th cycle. Claim (iii) is
obviously true for executing the 1st cycle and we will show that it is also true for the ith
cycle, where i > 1. Before executing the ith cycle, the remaining time until the deadline is
D −∑i−1j=1 ηjD = D(1 −∑i−1j=1 ηj) and the time allotted to the ith cycle is ηiD. If βi is the
fraction of the remaining time dedicated to τi, that is,
βi =
ηi
1−∑i−1j=1 ηj
31
we have
si =
1
βiD(1−
∑i−1
j=1 ηj)
Thus claim (iii) above can be made for all cycles of the task.
4.2.2 Optimal Inter-Task Scheme
The key for inter-task DVS schemes is how to allot the slack in the system (or equivalently,
the available system time) to the tasks. In doing so, one needs to take into consideration the
variability of the workload and possible dynamic slack reclamation. Next, we will describe
the derivation of our optimal scheme under the ideal processor model.
The optimal DVS scheme is based on an important property that the optimal expected
energy consumption of a sequence of tasks is inversely proportional to tα−1, where t is the
amount of allotted time. To introduce this property, we start by establishing a preliminary
lemma on the energy consumption of a single task.
Lemma 1. If a task τ cannot change speed during the execution, its optimal expected energy
consumption is inversely proportional to tα−1, where t is the amount of time allotted to
execute τ .
Proof. Suppose that W is the worst-case number of execution cycles of τ , and P (x) is the
probability that τ executes for x cycles. Obviously, we should use the lowest possible speed,
W
t
, such that τ will finish in time t in the worst case. Therefore, the optimal expected energy
consumption of executing τ is
W∑
x=1
P (x)p
(
W
t
)
x
W
t
=
W α−1c
∑W
x=1 P (x)x
tα−1
(4.5)
which is inversely proportional to tα−1 (recall that p(·) is the power function).
Interestingly, the result of Lemma 1 still holds for multiple tasks.
Theorem 1. If tasks cannot change speed during their execution, the optimal expected energy
consumption of executing N tasks τ1, τ2, . . . , τN consecutively is inversely proportional to t
α−1,
where t is the amount of time allotted to execute these tasks.
32
Proof. Suppose that the worst-case number of execution cycles of τi isWi, and the probability
that τi executes for x cycles is Pi(x).
Let the optimal expected energy consumption of executing tasks τi, τi+1, . . . , τN consec-
utively with allotted time d be denoted by E(i, d). We will prove by induction that E(1, d)
is inversely proportional to dα−1.
The induction is on i. The base case for E(N, d) is obviously true by Lemma 1.
In the induction step, assume that E(i + 1, d) is inversely proportional to dα−1, that is,
E(i + 1, d) = Ci+1
dα−1 , where Ci+1 only depends on Wi+1, . . . ,WN and Pi+1(x), . . . , PN(x). To
compute E(i, d), we first compute a helper function, E˜(i, d, β), which denotes the expected
energy consumption of executing tasks τi, τi+1, . . . , τN with allotted time d when allotting a
fraction β (0 < β ≤ 1) of time d to task τi and allotting the remaining time, (1 − β)d, to
tasks τi+1, . . . , τN . The running speed for τi is obviously
Wi
βd
, and the time left for executing
tasks τi+1, τi+2, . . . , τN is d− x/Wiβd when τi executes x cycles. Therefore,
E˜(i, d, β) =
Wi∑
x=1
Pi(x)
(
p
(
Wi
βd
)
x/
Wi
βd
+
Ci+1
(d− x/Wi
βd
)α−1
)
=
Wi∑
x=1
Pi(x)
(
xci
(
Wi
βd
)α−1
+
Ci+1
(d− xβd
Wi
)α−1
)
=
∑Wi
x=1 Pi(x)
(
xci
(
Wi
β
)α−1
+ Ci+1
(1− xβ
Wi
)α−1
)
dα−1
=
Wα−1i ci
PWi
x=1 Pi(x)x
βα−1 + Ci+1
∑Wi
x=1
Pi(x)
(1− xβ
Wi
)α−1
dα−1
=
φi(β) + ϕi(β)
dα−1
where φi(β) =
Wα−1i ci
PWi
x=1 Pi(x)x
βα−1 and ϕi(β) = Ci+1
∑Wi
x=1
Pi(x)
(1− xβ
Wi
)α−1
.
Let
Ci = min
0<β≤1
(φi(β) + ϕi(β))
and
βi = argmin
0<β≤1
(φi(β) + ϕi(β))
33
Then, the minimum expected energy consumption, E(i, d), is
E(i, d) = E˜(i, d, βi) =
Ci
dα−1
Substituting 1 for i and t for d in E(i, d) will complete the proof.
Theorem 1 shows that the optimal expected energy consumption of a sequence of tasks
is of the same form as that of a single task, that is, both are inversely proportional to the
(α − 1)st power of the allotted time. This is a very powerful result because it can be used
to optimize the expected energy consumption for a sequence of tasks. When a sequence of
tasks is to be executed, the tasks are partitioned into two parts: the first task and the rest
of the tasks, which can be treated as one supertask. Thus, the problem of allotting time to
multiple tasks is reduced to allotting time to just two tasks, which can be efficiently solved
thanks to the nice form of the power function in the ideal processor model. In fact, this is
the basic idea of the proof of Theorem 1.
The proof of Theorem 1 indicates that in order to minimize the expected energy con-
sumption of executing a sequence of tasks within a given amount of time t, one should allot
to the first task a fixed fraction of time t and set the speed such that the first task is guar-
anteed to finish within the time allotted to it in the worst case. When the first task finishes,
the same procedure can be applied recursively to the rest of the tasks, with the time left
until the deadline.
The proof of Theorem 1 also leads to the computation of the time allocation fraction
for each task. As in the proof, let Ci denote the constant in the optimal expected energy
consumption of executing τi, τi+1, . . . , τN consecutively and βi denote the optimal time allo-
cation fraction for τi. We compute Ci and βi in the reverse order. That is, first compute
CN , βN , then CN−1, βN−1, . . . , and last C1, β1. The efficiency of the algorithm depends on
how to find the minimum value of φi(β) + ϕi(β). We do not have a closed form formula
for it. However, by computing the first and second derivatives of φi(β) and ϕi(β), we find
that φi(β) is a convex decreasing function and ϕi(β) is a convex increasing function. It is
easy to show that φi(β) + ϕi(β) is a convex function with only one global minimum when
0 < β ≤ 1. Thus, finding the minimum value of φi(β) + ϕi(β) can be efficiently solved to
34
any desirable precision by using many existing numerical optimization methods, such as the
gradient descent. The algorithm for computing the time allocation fractions, β1, β2, . . . , βN ,
is shown in Algorithm 4.1.
Algorithm 4.1 OITDVS-Oﬄine([W1,W2, . . . ,WN ], [P1(x), P2(x), . . . , PN(x)]
1: βN := 1
2: CN := W
α−1
N cN
∑WN
x=1 PN(x)x
3: for i := N − 1 downto 1 do
4: F (β) =
(
Wi
β
)α−1
ci
∑Wi
x=1 Pi(x)x+
∑Wi
x=1
Pi(x)Ci+1
(1− xβ
Wi
)α−1
5: Ci := min
0<β≤1
F (β)
6: βi := argmin
0<β≤1
F (β)
7: end for
8: return [β1, β2, . . . , βN ]
Assuming that the probability distributions are stable, the computation of βi can be
done oﬄine for all the values of i. The online part, to select the speed at which a task needs
to execute, is straightforward: when starting the execution of task τi and having time d left
for executing τi, τi+1, . . . , τN , we allocate time βid to τi and the speed is set to
Wi
βid
.
The oﬄine and online parts described above form the Optimal Inter-Task DVS scheme
under the ideal processor model, which we call the OITDVS scheme.
4.2.3 Optimal Hybrid Scheme
Since intra-task DVS only focuses on speed scheduling within a single task, one would wonder
whether it could be applied to the case of multiple tasks. A naive extension to an intra-task
DVS scheme (e.g., PACE [40]) would treat all the tasks as a single supertask and derive its
parameters (WCEC and probability function) from those of the original tasks. Unfortunately,
for this supertask, using PACE will generally result in more energy consumption than the
DVS schemes that simply use dynamic slack reclamation (e.g., the Proportional scheme [48]).
Appendix B provides an illustrative example to clarify this point. The reason why this naive
extension fails is that treating all tasks as a single supertask results in loss of information
(e.g., the naive extension cannot determine when tasks terminate), losing the opportunity for
35
dynamic slack reclamation, which is an indispensable element for DVS schemes for multiple
tasks. Therefore, intra-task DVS needs to be combined with inter-task DVS for possible
further energy savings over DVS schemes that use only intra-task DVS or inter-task DVS
alone.
The optimal hybrid DVS scheme was proposed by Zhang et al. in [72] and it is called
Global OP-DVS scheme (OP stands for optimal). Throughout this dissertation, we call this
scheme GOPDVS. The derivation of GOPDVS is similar to that of OITDVS described in
Section 4.2, as a task can be treated as W one-cycle subtasks, where W is the WCEC of
the task. For the sake of completeness of this dissertation, we rewrite the oﬄine part of
GOPDVS using our notation in Algorithm 4.2. In contrast to our OITDVS scheme, the
output of the procedure GOPDVS-oﬄine is the time allocation fractions βij for the j
th cycle
of task τi, where i = 1, 2, . . . , N and j = 1, 2, . . . ,Wi. More technical details can be found in
[72].
The online part of GOPDVS goes as follows: when starting the execution of the jth cycle
of task τi and having time d remaining in the frame, allocate time βijd to the j
th cycle of
task τi and the speed is set to
1
βijd
.
Algorithm 4.2 GOPDVS-Oﬄine([W1,W2, . . . ,WN ], [P1(x), P2(x), . . . , PN(x)])
1: CN+1 := 0
2: for i := N downto 1 do
3: C := Ci+1
4: for j := Wi downto 1 do
5: βij :=
α
√
ci
α
√
ci+
α√C
6: q := 1−cdfi(j−1)
1−cdfi(j−2) {let cdfi(−1) = 0}
7: C := q( α
√
ci +
α
√
C)α + (1− q)Ci+1
8: end for
9: Ci := C
10: end for
11: return [βij], i = 1, 2, . . . , N, j = 1, 2, . . . ,Wi
36
4.2.4 A Unified View
We have presented the optimal DVS schemes for different DVS strategies under the ideal
processor model. All of them are targeted at frame-based real-time systems. PACE works for
frame-based systems with only a single task, and OITDVS and GOPDVS work for general
frame-based systems. In fact, PACE is a special case of GOPDVS.
For frame-based real-time systems, the optimal schemes using different DVS strategies
look surprisingly similar, which can be attributed to the assumption of unrestricted con-
tinuous frequency, zero speed change overhead, and the nice form of the power-frequency
relation. We give a unified view of these optimal DVS schemes in order to appreciate their
commonality.
1. The optimal expected energy consumption of a frame, whether using a fixed speed or
intra-task DVS for each individual task, is inversely proportional to Dα−1, where D is
the frame length;
2. When scheduling speed changes in a frame, whether at the task boundary in inter-task
DVS or for each cycle in intra-task DVS, the optimal speed to be used is inversely
proportional to the remaining time in the frame;
3. If a task is executed using intra-task DVS, no matter what position it is in the frame,
the optimal speed is always nondecreasing during the execution of the task to minimize
the expected energy consumption.
It is noteworthy that although the optimal hybrid DVS scheme achieves better energy
savings than the optimal inter-task DVS scheme for general frame-based systems, inter-task
DVS schemes are easier to implement than hybrid DVS schemes. In hybrid DVS schemes,
timer-like interrupt mechanisms are needed to perform speed changes during the execution
of a task.
37
4.3 SOLUTIONS UNDER REALISTIC PROCESSOR MODEL
In the previous section, we have seen that elegant and optimal DVS schemes exist for different
DVS strategies under the ideal processor model. However, the assumptions of the ideal
processor model are oversimplified and do not hold in practice. This implies that those
optimal DVS schemes might be problematic when used in the real world. In this section, we
investigate DVS schemes for different DVS strategies under the realistic processor model.
There are generally two approaches in designing DVS schemes under the realistic pro-
cessor model. The first approach is to patch the DVS schemes obtained under the ideal
processor model, while the second approach is to design DVS schemes considering, from
the onset, the realistic model. The advantage of the first approach is its simplicity but it
could be far from optimal in terms of energy savings due to its post-processing nature. The
second approach is expected to outperform the first approach but is also expected to have
higher complexity because the realistic processor model is more complicated than the ideal
processor model. Both approaches will be discussed below.
4.3.1 The Intra-task Scheme
To derive DVS schemes under the realistic processor model, simply patching the speed
schedule obtained from the ideal processor model may result in a significant deviation from
the optimal solution. Thus, our solution, PPACE (Practical PACE), is to attack the problem
under the realistic processor model directly.
4.3.1.1 Problem Formulation For the ideal processor model, the speed for each cycle
of a task is computed, which is obviously too overwhelming considering that a task usually
takes millions of cycles. Furthermore, the ability to change speed for any cycle is unreasonable
because of the speed change overhead. Real-world operating systems have a granularity
requirement for changing speeds [3, 41]. Thus, we need a schedule with a limited number
of speed scaling pointsat which speed may change. This implies that the speed remains
constant between any two adjacent speed scaling points. As in Section 4.2.1, we will omit
38
the subscripts of the parameters of the task unless confusion arises. We denote the ith speed
scaling point of task τ by bi. Therefore, given r speed scaling points, we partition the range
[1,W ] (W is the WCEC of τ) into r+1 phases : Phase 0 = [b0, b1−1], Phase 1 = [b1, b2−1], . . .,
Phase r = [br, br+1−1], where b0 = 1 and br+1 = W +1 are used to simplify the formulation.
Let us redefine speed schedule as the speeds of all phases. Our goal is to find a speed
schedule that minimizes the expected energy consumption while still meeting the deadline
D. Let the speed for Phase i (0 ≤ i ≤ r) be denoted by si, where si ∈ {f1, . . . , fM}. Let the
energy consumed by a single cycle in Phase i be denoted by e(si), where e(si) =
pˆ(si)
si
(recall
from Section 3.3 that pˆ(·) does not account for the idle power). Thus, the energy consumed
by Phase i is Fie(si), where Fi =
∑
bi≤j<bi+1(1 − cdf(j − 1)). Let PC(i) = 1 − cdf(bi − 1),
denoting the probability that the execution will reach bi cycles and hence will go to Phase
i. Assuming that the processor is idle and operating at the minimum frequency at the
beginning of the frame, we obtain the following mathematical program:
Minimize PE(f1, s0) + F0e(s0) +
∑
1≤i≤r
(PC(i)PE(si−1, si) + Fie(si)) (4.6)
Subject to PT (f1, s0) +
w0
s0
+
∑
1≤i≤r
(
PT (si−1, si) +
wi
si
)
≤ D (4.7)
si ∈ {f1, f2, . . . , fM} (4.8)
where PE(·, ·) and PT (·, ·) are the energy and time cost for speed changes (refer to Section
3.3.2) and wi = bi+1 − bi.
4.3.1.2 Patching the Schemes Obtained under the Ideal Processor Model PACE2
[40] and GRACE [71] proposed patching the schemes obtained under the ideal processor
model. Both PACE and GRACE apply the well-known cubic-root rule of the power func-
tions [14] and hence use e(f) = c′0 + c
′
1f
2 (c′0 and c
′
1 are constants, f is the running speed)
to approximate the actual energy/cycle function. Then they relax the constraint on si and
assume that si is unrestricted and continuous. They also ignore the speed change overhead.
2We use the term PACE to refer to both DVS schemes under the ideal and realistic processor models
proposed in [40].
39
This is equivalent to using the ideal processor model described in Section 3 in which α is set
to 3. Thus, the minimization problem becomes
Minimize c′0D +
∑
0≤i≤r
Fic
′
1s
2
i (4.9)
Subject to
∑
0≤i≤r
wi
si
≤ D (4.10)
0 ≤ si ≤ ∞ (4.11)
Notice that c′0 and c
′
1 have no effect on deciding the speed schedule. Using the Lagrange
technique [36] or Jensen’s inequality [35], the solution to (4.9)-(4.11) is
si =
∑
0≤j≤r
wjF
1
3
j
DF
1
3
i
However, si needs to be changed to some available discrete frequency. This is where
PACE and GRACE differ. GRACE is conservative in the sense that it rounds si up to the
closest higher discrete frequency, whereas PACE rounds si to the closest discrete frequency
(rounds up or down). Both schemes have shortcomings. For GRACE, si could be larger
than the highest discrete frequency if Fi is very small. If this happens, si will have to
be rounded down to the highest discrete frequency fM , and therefore the deadline could
be missed (considering that most of si’s do not have this problem and they are rounded
up, the probability of missing deadline should be reasonably small). For PACE, chances of
missing the deadline are higher because PACE may round down. To solve this problem,
PACE proposes to linearly scan all the phases to adjust the speeds [39]. GRACE does
not deal with the speed change overhead, while PACE proposes to subtract the maximum
possible time penalties from the allotted time. Appendix A uses an illustrative example to
demonstrate the impact of speed rounding on the quality of the solution.
40
4.3.1.3 The PPACE Scheme We now present what we call the PPACE (Practical
PACE) scheme under the realistic processor model. The heart of PPACE is a fully polynomial
time approximation scheme (FPTAS) that can obtain ²-optimal solutions, which are within
a factor of 1 + ² of the optimal solution and run in time polynomial in 1/². To better
understand the problem expressed in (4.6)-(4.8), we give a graphical interpretation of the
problem. First, we need the following definition:
Definition 1. An energy-time label l is a 2-tuple (e, t), where e and t are nonnegative reals
and denote energy and time respectively. We write the energy component as l.e and the time
component as l.t.
(PE(f1,fM)+F0e(fM), PT(f1,fM)+w0/fM)
(F0e(f1), w0/f1)
(PE(f1,f2)+F0e(f2), PT(f1,f2)+w0/f2)
(PC(1)PE(s0,fM)+F1e(fM), PT(s0,fM)+w1/fM)
(PC(1)PE(s0,f1)+F1e(f1), PT(s0,f1)+w1/f1)
(PC(1)PE(s0,f2)+F1e(f2),PT(s0,f2)+w1/f2)
(PC(r)PE(sr-1,fM)+Fre(fM), PT(sr-1,fM)+wr/fM)
(PC(r)PE(sr-1,f1)+Fre(f1), PT(sr-1,f1)+wr/f1)
(PC(r)PE(sr-1,f2)+Fre(f2), PT(sr-1,f2)+wr/f2)
v0 v1 v2 vr+1vr
Phase 0 Phase 1 Phase r
.
.
.
.
.
.
.
.
.
. . .
. . .
Figure 5: Graphical representation of the mathematical program (4.6)-(4.8)
The mathematical program (4.6)-(4.8) can be expressed as a graph G = (V,E) shown in
Figure 5. The vertex υi (1 ≤ i ≤ r+1) represents the point by which the first i phases have
been already executed. The M edges between vi and vi+1 (0 ≤ i ≤ r) represent different
speed choices (which frequency to use) for Phase i. Each choice is represented by an energy-
time label, indicating the expected energy consumption and the worst-case running time
for that phase. We also associate each path in the graph with an energy-time label, where
the energy of a path is defined as the sum of the energies of all edges over the path and
the time of a path is defined as the sum of the times of all edges over the path. When an
energy-time label l is associated with a path, we denote the most recently used frequency by
l.f . Therefore, the problem is reduced to finding a path from v0 to vr+1 such that the energy
41
of the path is minimized while the time of the path is no greater than the deadline D.
Since a path can be summarized as an energy-time label, a straightforward approach is
to start from v0 and work all the way from left to right in an iterative manner to generate
all paths. Each vertex υi is associated with M energy-time label sets LABEL(i, j), where
j = 1, 2, . . . ,M . If a label l ∈ LABEL(i, j), then we have l.f = fj. For succinct presentation,
we will use LABEL(i, ∗) to denote all label sets in vi (i.e., LABEL(i, ∗) is shorthand for⋃M
j=1 LABEL(i, j)) and LABEL(i, ·) to denote any label set in vi (i.e., LABEL(i, ·) is
shorthand for LABEL(i, j), 1 ≤ j ≤ M). Initially all label sets are empty except for
LABEL(0, 1), which contains only one energy-time label (0, 0). The whole process consists of
r+1 iterations. In the ith iteration, we generate all paths from v0 to vi: from LABEL(i−1, ∗)
we add the values of each edge, creating new paths, and store them in LABEL(i, ∗). At the
end, we just select the energy-time label with the minimum energy and with time no greater
than D, from LABEL(r + 1, ∗), as the solution to the problem.
Unfortunately, the size of LABEL(i, ·) may suffer from exponential growth in this
naive approach. To prevent this from happening, the key idea is to reduce and limit the
size of LABEL(i, ·) after each iteration by eliminating some of the energy-time labels in
LABEL(i, ·). We devise two types of eliminations: one that does not affect the optimality
of the solution and one that may affect optimality but still allows for performance guarantee.
There are three eliminations that do not affect the optimality of the solution:
1. We can eliminate any energy-time label l in LABEL(i, ·) if l cannot lead to a feasible
solution (i.e., a solution that is guaranteed to meet the deadline but the resulting energy
consumption is not necessarily optimized) by using a single frequency that is no less than
the current frequency after Phase (i − 1). This can be easily proved by contradiction.
Suppose that l leads to a feasible solution l′ for which the maximum frequency of Phase
i, i+ 1, . . . , r is fmax (fmax is not necessarily fM because if the deadline is large enough,
fmax < fM would be sufficient to meet the deadline). Then we can replace the frequencies
of Phase i, i + 1, . . . , r with fmax and obtain another feasible solution, which leads to
contradiction. Formally, the necessary condition for label l being able to lead to a
42
feasible solution is that there exists a frequency s, where l.f ≤ s ≤ fM such that
l.t+ PT (l.f, s) +
∑
i≤j≤r wj
s
≤ D (4.12)
By manipulating Inequality (4.12) into a quadratic equation and solving it for s, we
transform the necessary condition into
(D − l.t+ ξ1l.f)2 − 4ξ1
∑
i≤j≤r
wj ≥ 0 (4.13)
D − l.t+ ξ1l.f −
√
(D − l.t+ ξ1l.f)2 − 4ξ1
∑
i≤j≤r wj
2ξ1
≤ fM (4.14)
D − l.t+ ξ1l.f +
√
(D − l.t+ ξ1l.f)2 − 4ξ1
∑
i≤j≤r wj
2ξ1
≥ l.f (4.15)
where ξ1 is from (3.1).
2. For the second optimality-preserving elimination, we need to compare two energy-time
labels.
Definition 2. Let l1 = (e1, t1) and l2 = (e2, t2) be two energy-time labels from LABEL(i, ·).
We say that l1 dominates l2, denoted by l2 ≺ l1, if e1 ≤ e2 and t1 ≤ t2.
The dominance relation on a set of energy-time labels is clearly a partial ordering on the
set. If l2 ≺ l1, this means l2 will not lead to any solution better than the best solution
that l1 leads to. Therefore, we eliminate all energy-time labels that are dominated by
some other energy-time label in the same label set. Note that energy-time labels from
different label sets cannot use Definition 5.2 to define dominance relation because of
speed change overhead. This is also the reason why we have M label sets in each vertex.
On the other hand, if speed change overhead can be ignored, we can combine M label
sets into one.
Performing the second elimination can be reduced to the maxima problem in compu-
tational geometry [52]. Specifically, the energy-time labels in label sets are stored in
decreasing order of the energy component, breaking ties with smaller time component
coming ahead. Thus, the elimination for a label set can be accomplished by using the
dimension-sweep technique in time O(n lnn) [52], where n is the number of labels in the
set.
43
3. For any energy-time label l in LABEL(i, ∗) surviving the previous two eliminations,
we compute a lower bound of the feasible solutions that l leads to (denoted by l.LB)
and an upper bound of the best solution that l leads to (denoted by l.UB). Let U =
min
l′∈LABEL(i,∗)
l′.UB. Then U is an upper bound of the optimal solution. For an energy-
time label l, if l.LB > U , that means l will not lead to the optimal solution and thus
can be eliminated. A simple method to compute l.LB for a label l is to compute the
expected energy consumption assuming that the minimum frequency is used from Phase
i on. Similarly, a simple method to compute l.UB is to first find the optimal continuous
frequency assuming that the task will run for the worst-case cycles, round it up and then
compute the expected energy consumption assuming that this frequency is used from
Phase i on. Note that these two methods only take constant time.
With the above eliminations, the size of LABEL(i, ·) decreases substantially. Note that
at this point the optimal solution is guaranteed to be found. However, the running time of
the algorithm still has no polynomial time bound guarantee. Inspired by the fully polynomial
time approximation scheme (FPTAS) of the subset-sum problem [20], we obtain a FPTAS
for our problem, further reducing the size of LABEL(i, ·).
The intuition for the FPTAS is that we need to further trim each LABEL(i, ·) at the end
of each iteration. A trimming parameter δ (0 < δ < 1) will be used to direct the trimming.
To trim an energy-time label set L by δ means to remove as many energy-time labels as
possible, in such a way that if Lˆ is the result of trimming L, then for every energy-time
label l that was removed from L, there is an energy-time label lˆ in Lˆ such that l.t > lˆ.t and
lˆ.e−l.e
l.e
≤ δ (or, equivalently, lˆ.e ≤ (1 + δ)l.e). Such an lˆ can be thought of as “representing”
l in the new energy-time label set Lˆ. Note that Lˆ ⊆ L. Let the performance guarantee be
² (0 < ² < 1), which means that the solution will be within a factor of 1 + ² of the optimal
solution. After the first type of eliminations (i.e., the optimality-preserving eliminations),
LABEL(i, ·) is trimmed using a parameter δ = (1+ ²) 1r+1 − 1. The choice of δ shall be clear
later in the proof of Theorem 2.
The procedure TRIM (shown in Algorithm 4.3) performs the second type of elimination
for label set L. Note that the energy-time labels in label sets are stored in decreasing order
on the energy component (or equivalently, in increasing order on the time component).
44
The PPACE scheme is shown in Algorithm 4.4.
Algorithm 4.3 TRIM(L = [l1, l2, . . . , l|L|],δ)
1: Lˆ := {l1}
2: last := l1
3: for i := 2 to |L| do
4: if last.e > (1 + δ)li.e then
5: append li onto the end of Lˆ
6: last := li
7: end if
8: end for
9: return Lˆ
4.3.1.4 Analysis of PPACE We now show the time complexity of the procedure PPACE(²)
in Algorithm 4.4. First, notice that line 14 in Algorithm 4.4 corresponds to the TRIM proce-
dure, that is, the optimality-preserving eliminations. Let LABEL′(i, ∗) be the label sets ob-
tained if lines 18-20 in Algorithm 4.4 are omitted. Note that LABEL(i, j) ⊆ LABEL′(i, j),
where 1 ≤ j ≤ M and the optimal solution is in LABEL′(r + 1, ∗). By comparing
LABEL(i, ∗) and LABEL′(i, ∗), we have the following lemma:
Lemma 2. For every energy-time label l′ ∈ LABEL′(i, j), where 1 ≤ j ≤M , there exists a
label l ∈ LABEL(i, j) such that l′.e ≤ l.e ≤ (1 + δ)il′.e and l′.t ≥ l.t.
Lemma 2 shows how the error accumulates after each iteration when comparing label
sets obtained with the second type of elimination and label sets obtained without the second
type of elimination. The details of the proof is presented in Appendix C.
Theorem 2. The procedure PPACE(²) is a fully polynomial-time approximation scheme,
that is, the solution that PPACE(²) returns is within a factor of 1+ ² of the optimal solution
and the running time is polynomial in 1/².
Proof. Let l∗ denote the optimal solution. Obviously l∗ ∈ LABEL′(r+ 1, j), where 1 ≤ j ≤
M . Then, by Lemma 2 there is a l ∈ LABEL(r + 1, j) such that
l∗.e ≤ l.e ≤ (1 + δ)r+1l∗.e
45
Algorithm 4.4 PPACE(²)
1: for i := 1 to r + 1 do
2: for j := 1 to M do
3: LABEL(i, j) := φ
4: end for
5: end for
6: LABEL(0, 1) := {(0, 0)}
7: for i := 1 to r + 1 do
8: for each label l ∈ LABEL(i− 1, ∗) do
9: for j := 1 to M do
10: LABEL(i, j) := LABEL(i, j) ∪ (l.e + PC(i − 1)PE(l.f, fj) + Fi−1e(fj), l.t +
PT (l.f, fj) +
wi−1
fj
))
11: end for
12: end for
13: remove all l ∈ LABEL(i, ∗) such that l does not satisfy (4.13)-(4.15)
14: for j := 1 to M do
15: remove all l ∈ LABEL(i, j) such that l ≺ l′, where l′ 6= l and l′ ∈ LABEL(i, j)
16: end for
17: compute l.LB and l.UB for all l ∈ LABEL(i, ∗) and then remove all l ∈ LABEL(i, ∗)
such that l.LB > min
l′∈LABEL(i,∗)
l′.UB
18: for j := 1 to M do
19: LABEL(i, j) := TRIM(LABEL(i, j), (1 + ²)
1
r+1 − 1)
20: end for
21: end for
22: if no label in LABEL(r + 1, ∗) then
23: return no solution
24: else
25: return the label l ∈ LABEL(r + 1, ∗) with the minimum energy component
26: end if
46
Because we chose δ = (1 + ²)
1
r+1 − 1,
(1 + δ)r+1 =
(
1 + (1 + ²)
1
r+1 − 1
)r+1
= 1 + ²
then
l.e ≤ (1 + ²)l∗.e
Therefore, the energy returned by PPACE(²) is not greater than 1 + ² times the optimal
solution, satisfying the first part of the theorem.
To show that its running time is polynomial in 1/², we first need to derive the upper
bound on the size of LABEL(i, j), where 1 ≤ j ≤M . Let LABEL(i, j) = [l1, l2, . . . , lk] after
trimming. We observe that the energies of any two successive energy-time labels differ by a
factor of more than (1 + δ) (otherwise, we would have already eliminated it). In particular,
l1.e > (1 + δ)l2.e > (1 + δ)2l3.e · · · > (1 + δ)k−1lk.e
Moreover, clearly l1.e ≤ e(fM)
∑
0≤j≤i sjFj and lk.e ≥ e(f1)
∑
0≤j≤i sjFj. Let λ =
e(fM )
e(f1)
(i.e., the ratio of the energies when running with the highest and lowest frequencies), then
(1 + δ)k−1 <
l1.e
lk.e
≤ λ
or, equivalently
k < 1 +
lnλ
ln(1 + δ)
= 1 +
(r + 1) lnλ
ln(1 + ²)
= O(
r lnλ
²
) (4.16)
Now we can derive the running time of PPACE(²) in Algorithm 4.4. There are r + 1
iterations (line 7). In each iteration, the processing time is dominated by the dimension-
sweep used for the second optimality-preserving elimination (line 15). Since the number of
energy-time labels generated for each label set is O( r lnλ
²
), the processing time for each label
set is O( r lnλ ln(r lnλ)
²
). Because there are M label sets in each vertex and there are r + 1
iterations, the total running time is O( r
2M lnλ ln(r lnλ)
²
).
In fact, the running time of PPACE depends entirely on the total number of energy-
time labels stored in all the vertices, which is
∑r+1
i=0
∑M
j=1 |LABEL(i, j)|. The FPTAS is
conservative, that is, the approximation guarantee reflects the performance of the algorithm
only on the most pathological instances. In practice, using ² = 5% usually gives a solution
that is very close to the optimal solution (see experimental results in Section 4.3.4).
47
4.3.2 The Inter-task Scheme
For inter-task DVS under the realistic processor model, we propose a scheme called PITDVS
(Practical Inter-Iask DVS) that uses the optimal inter-task DVS scheme (OITDVS) from
the ideal processor model as the basis and patches it to comply with the realistic processor
model. Although PITDVS is not optimal, we hope that the advantage of using probabilistic
information about the workload will outweigh the disadvantage of patching so that we can
achieve better energy savings over the existing schemes.
Like OITDVS, the oﬄine part of PITDVS is to compute the time allocation fraction βi
for each task τi. For the realistic processor model, a probability function is represented by
a histogram. By treating the bins of a histogram as supercycles, it is trivial to transform
the procedure OITDVS-oﬄine in Algorithm 4.1 to become the oﬄine part of PITDVS. The
online part of PITDVS takes into account the issues of the realistic model; below we discuss
these issues and provide corresponding solutions.
Patch 1 (Speed Change Overhead) Available processors have speed change overhead,
including time penalty and energy penalty. When there are n tasks in the frame, the number
of speed changes is at most n because the processor is expected to change speed only before
the execution of each task. Also, the maximum time penalty is at most PT (f1, fM). Thus,
we take a conservative approach. Before computing the speed for task τi to be executed, we
subtract the maximum possible time penalty, (N − i + 1)PT (f1, fM) (recall that N is the
number of tasks in the system), from the remaining time in the frame.
Patch 2 (Maximum and Minimum Speeds) The processor speed in the ideal processor
model can be tuned from zero to infinity. In reality, however, every processor has a maxi-
mum speed fM and a minimum speed f1. The speed that is used to execute a task cannot
violate these constraints, and thus we adjust the allotted time for τi at dispatch time as
follows. Before starting to execute task τi and having time d left, if there is more time than
needed to execute with the lowest speed (βid >
Wi
f1
), then we allot time Wi
f1
to τi (equivalent
to using minimum speed f1 to execute τi). Similarly, if there is less time than needed with
48
the maximum speed (βid <
Wi
fM
), then we allot time Wi
fM
to τi (equivalent to using maximum
speed fM to execute τi). Also, we need to compute the greedy-derived time tgreedy, that is,
the conservative time obtained using Greedy scheme [48] to allow the rest of the tasks to
finish by the deadline using the maximum speed. If βid >
Wi
tgreedy
, then we allot time Wi
tgreedy
to
τi; this is because the Greedy scheme guarantees deadlines in the most aggressive form [48].
It is easy to see that the resulting schedule is still valid as long as the tasks can be scheduled
using the maximum speed when all the tasks take their WCECs.
Patch 3 (Discrete Speeds) The processor speed in the ideal processor model can be tuned
continuously. But real-world processors only provide a finite set of discrete speeds. Since we
will use a constant speed to execute a task, the most straightforward way to fix this problem
is to round the continuous speed up to the closest higher discrete speed [32].
To summarize the above patches, we show the online part of PITDVS in Algorithm 4.6,
which calls Algorithm 4.5. The procedure PITDVS-online(i, d) is called before task τi is
executed and there is time d remaining in the frame. This procedure can be regarded as
running in constant time because the complexity of lines 2-4 in Algorithm 4.5 can be reduced
to constant time by using an extra array to store the partial sums of Wi values. Thus, Patch
1 and 2 can be done in constant time. Patch 3 takes O(logM), where M is the number of
available discrete speeds. Because M is usually small in practice, we can treat it as constant
time.
We also propose a variant of the PITDVS scheme, PITDVS2, that uses up to two speeds
to execute a task within the allotted time. This is due to the fact that any continuous speed
can be emulated by using its two adjacent discrete speeds [32]. Intuitively, using two adjacent
discrete speeds to emulate a continuous speed is expected to do better than rounding up the
speed. For a continuous speed s, let the closest higher and lower discrete speeds be denoted
by dse and bsc, respectively. To emulate s for time t, we let the system operate at bsc for
time t1 and at dse for time t− t1 − PT (bsc, dse). Thus, we need to satisfy
bsct1 + dse(t− t1 − PT (bsc, dse) = s · t (4.17)
49
Algorithm 4.5 AdjustContinuousSpeed(i, d)
1: W ′ := 0 {W ′ is the remaining cycles after τi}
2: for j := i+ 1 to N do
3: W ′ := W ′ +Wj
4: end for
5: d′ := d− (N − i+ 1)PT (f1, fM) {Patch 1}
6: if Wi+W
′
d′ ≥ fM then
7: return fM
8: end if
9: t := d′βi
10: {start of Patch 2}
11: if t > Wi
f1
then
12: t := Wi
f1
13: end if
14: if t < Wi
fM
then
15: t := Wi
fM
16: end if
17: tgreedy := d
′ −W ′/fM
18: if t > tgreedy then
19: t := tgreedy
20: end if
21: return Wi
t
Algorithm 4.6 PITDVS-online(i, d)
1: s := AdjustContinuousSpeed(i, d)
2: round s to the next available higher speed {Patch 3}
3: return s
50
Solving Equation (4.17) will give the speed schedule for PITDVS2, taking into account the
speed change overhead. Algorithm 4.7 shows the online part of PITDVS2. Note that,
although PITDVS2 changes speed during the execution of a task, we still categorize it as
inter-task DVS scheme because its main idea is to emulate the optimal inter-task DVS scheme
from the ideal processor model.
Algorithm 4.7 PITDVS2-online(i, d)
1: s := AdjustContinuousSpeed(i, d)
2: t := Wi
s
3: s1 :=the closest discrete speed lower than s
4: s2 :=the closest discrete speed higher than s
5: t1 :=
s2(t−PT (s1,s2))−s·t
s2−s1
6: t2 := t− t1 − PT (s1, s2)
7: return [[s1, s1t1], [s2, s2t2]]
4.3.3 The Hybrid DVS Schemes
For hybrid DVS schemes under the realistic processor model, we can either patch the optimal
hybrid DVS scheme obtained under the ideal processor model (GOPDVS), or combine the
PITDVS and PPACE schemes, as described below.
The first new scheme is called PGOPDVS (Practical GOPDVS). The oﬄine part of
PGOPDVS assumes the ideal processor model. Its first step is to compute the time allocation
fraction βij for each Phase j of task τi. By treating the phases of a task as supercycles, the first
step can be done by the procedure GOPDVS-oﬄine in Algorithm 4.2 with slight modification.
The second step of the oﬄine part of PGOPDVS is to compute the time allocation fractions
βˆi for the whole task τi. The latter can be computed from all the allocated times for each
phase of task τi, as follows. When task τi is ready to execute and there is time d remaining
in the frame, the time allocated for the first phase is d1 = βi1d; the time allocated for the
second phase is d2 = (d− βi1d)βi2 = (1− βi1)βi2d. We can see that the time allocated to the
jth phase is dj = dβij
∏j−1
k=1(1− βik). Thus, the time allocation fraction for the whole task τi
51
is
βˆi =
∑ri
j=1 dj
d
=
ri∑
j=1
βij
j−1∏
k=1
(1− βik)
where ri is the number of phases of task τi. Once we have derived βˆi, we know the optimal
(under the ideal processor model) time to be allocated to task τi before τi starts executing.
The online part of PGOPDVS performs the patches to fit the realistic processor model.
The patches are similar to those of PITDVS described in Section 4.3.2. We first perform
the same patches as Patch 1 and 2 of PITDVS. Then we follow the approach of PACE,
that is, first round each speed to the closest available discrete speed, and then perform a
linear scan and adjustment to make sure τi will finish in the allotted time in the worst case.
Algorithm 4.8 shows the online part of PGOPDVS. The procedure PGOPDVS-online(i, d)
is called before task τi is executed and there is time d remaining in the frame.
The second new scheme can be regarded as a variant of PITDVS. It differs from PITDVS
only in that PPACE speed schedule is used to execute a task. Specifically, after the allot-
ted time for a task is decided, procedure PPACE(²) in Algorithm 4.4 is called to compute
the speed schedule because different allotted times corresponds to different speed schedules.
Thus we call this scheme PIT-PPACE. Algorithm 4.9 shows the online part of PIT-PPACE.
Note that we are calling the oﬄine part of the PPACE scheme (which has high time com-
plexity) online in PIT-PPACE, which is by no means practical. This can be remedied by
precomputing the PPACE speed schedules for all possible allotted time d for each task and
storing them in memory for online lookup. We need to discretize the allotted time d because
it is continuous. The degree of discretization, nd, depends the characteristics of the task. In
general, larger nd results in better speed schedule at the expense of higher space overhead.
The space overhead is proportional to nd because for each task we need to precompute nd
speed schedules and store them for online use.
4.3.4 Evaluation
For the ideal processor model, we have described DVS schemes that are provably optimal
for different DVS strategies. However, this is not the case for the realistic processor model.
We have presented the PPACE scheme for frame-based systems with a single task, and the
52
Algorithm 4.8 PGOPDVS-online(i, d)
1: W ′ := 0 {W ′ is the remaining cycles after τi}
2: for j := i+ 1 to N do
3: W ′ := W ′ +Wj
4: end for
5: d′ := d− (N − i+ 1)PT (f1, fM) {Patch 1}
6: if Wi+W
′
d′ ≥ fM then
7: return [sj = fM ], j = 1, 2, . . . , ri
8: end if
9: t := d′βˆi
10: {start of Patch 2}
11: if t > Wi
f1
then
12: t := Wi
f1
13: end if
14: if t < Wi
fM
then
15: t := Wi
fM
16: end if
17: tgreedy := d
′ −W ′/fM
18: if t > tgreedy then
19: t := tgreedy
20: end if
21: for j := 1 to ri do
22: sj := tβij
23: round sj to the closest available discrete speed
24: end for
25: {linear scan}
26: j := ri
27: while τi cannot finish by the deadline do
28: increase sj to the next higher discrete speed
29: j := j − 1
30: end while
31: return [sj], i = 1, 2, . . . , ri 53
Algorithm 4.9 PIT-PPACE-online(i, d, ²)
1: s := AdjustContinuousSpeed(i, d)
2: t := Wi
s
3: set the deadline of τi to t
4: return PPACE(²)
PITDVS, PITDVS2, PGOPDVS, and PIT-PPACE schemes for general frame-based systems.
For these schemes, two major questions remain unanswered: (i) since the worst-case per-
formance of PPACE is dependent on the value of ², what is the appropriate ² value that
should be used and how well does it perform compared to the existing schemes? (ii) be-
cause PITDVS, PITDVS2, PGOPDVS, and PIT-PPACE are approximations of the optimal
schemes obtained under the ideal processor model and they do not have theoretical perfor-
mance guarantees as PPACE does, how well do they perform in practice? To answer these
questions, we conducted extensive simulations for different processor models and workloads
described in Section 3.6.
4.3.4.1 Evaluation of Intra-Task DVS Schemes We used the six distributions (Fig-
ure 2) described in Section 3.6 to generate synthetic tasks. The power scaling factor is 1.
The WCEC is 500, 000, 000 and the minimum number of cycles is 5, 000, 000. This is corre-
sponding to the ratio of the WCET to the best-case execution time being 100, as reported
in [54]. The default number of speed scaling points is r = 100 and they are placed evenly in
the range of [1,WCEC]. We also evaluate the effect of parameter r on the performance of
the schemes.
We define the relative error for any scheme that returns the expected energy consumption
E to be E−OPT
OPT
, where OPT is the optimal solution. We compute OPT using the PPACE
scheme without doing the trimming operation at the expense of much longer running time.
As usual with FPTAS algorithms, we set ² = 0.05 for PPACE when comparing it with other
schemes. For all experiments, we varied the slack available for power management. The
slack is changed by varying the deadline from WCEC
fM
to WCEC
f1
(increasing the deadline will
increase the slack, that is, will increase the allotted time for the task, thus resulting in less
54
energy consumption).
Under the above setup, we performed three experiments.
Experiment 1: Comparing All Schemes. In this experiment, we compare all intra-
task DVS schemes including a variant of the PACE scheme called PACE2, which differs
from PACE in that it uses two adjacent discrete speeds to emulate any continuous speed [32]
obtained from the solution to the mathematical program (4.9)-(4.11). Since all schemes ex-
cept for PPACE are based on the ideal processor model, which does not model speed change
overheads, we assume that they fix this problem by subtracting the maximum possible time
penalties from the allotted time. We only show the results for bimodal1 distribution (Figure
2(e)) because results for all other distributions are similar. As shown in Figure 6 (note the
different Y-axes scales), PPACE is very close to optimal and outperforms all other schemes
in all cases. On the other hand, the relative errors of GRACE, PACE, and PACE2 depend
on the power model, the distribution of number of execution cycles, and the deadline. For
the Synthetic processor (Figure 6(a)) for which the relative errors of GRACE, PACE and
PACE2 are only due to rounding, PACE and PACE2 perform reasonably well, but still have
relative errors up to 22% and 19% respectively. Since PACE2 uses two adjacent discrete
speeds to emulate any continuous speed, it performs very well for processors that have good
approximate analytical power functions (for XScale in Figure 6(c), PACE2’s relative errors
are less than 8%). However, PACE2 performs relatively poorly for the PowerPC 405LP,
which does not have good approximate analytical power function (the relative error is up to
66%). Still, for XScale and PowerPC 405LP in Figure 6(e), neither PACE nor GRACE is
a clear winner. In summary, the effect of approximations (i.e., using analytical functions to
approximate the actual power function, and rounding or using two speeds to emulate any
continuous speed) and patching for dealing with the speed change overhead could compound
or cancel each other out. Thus, the solutions returned by GRACE, PACE, and PACE2 are
unpredictable and unstable, while PPACE can provide performance guarantee.
Figures 6(b), 6(d), and 6(f) show the results of sensitivity analysis on ² for PPACE. We
can see that the relative errors are generally below 0.1%, 0.18%, 0.33% for ² = 5%, 10%, 15%,
respectively. This means that, in practice, PPACE performs much better (around 2 orders
55
 0
 20
 40
 60
 80
 100
 120
 0.5  1  1.5  2  2.5  3  3.5  4  4.5  5
R
el
at
iv
e 
Er
ro
r (
%)
Deadline (sec)
GRACE
PACE
PACE2
PPACE(ε = 0.05)
(a) Synthetic processor
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.5  1  1.5  2  2.5  3  3.5  4  4.5  5
R
el
at
iv
e 
Er
ro
r (
%)
Deadline (sec)
ε = 0.05
ε = 0.10
ε = 0.15
(b) Effect of ² on PPACE for Synthetic pro-
cessor
 0
 5
 10
 15
 20
 25
 30
 35
 0.5  1  1.5  2  2.5  3  3.5
R
el
at
iv
e 
Er
ro
r (
%)
Deadline (sec)
GRACE
PACE
PACE2
PPACE(ε = 0.05)
(c) XScale
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 0.16
 0.18
 0.5  1  1.5  2  2.5  3  3.5
R
el
at
iv
e 
Er
ro
r (
%)
Deadline (sec)
ε = 0.05
ε = 0.10
ε = 0.15
(d) Effect of ² on PPACE for XScale
 0
 20
 40
 60
 80
 100
 120
 140
 160
 2  4  6  8  10  12  14  16
R
el
at
iv
e 
Er
ro
r (
%)
Deadline (sec)
GRACE
PACE
PACE2
PPACE(ε = 0.05)
(e) PowerPC 405LP
 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0.06
 0.07
 0.08
 0.09
 2  4  6  8  10  12  14  16
R
el
at
iv
e 
Er
ro
r (
%)
Deadline (sec)
ε = 0.05
ε = 0.10
ε = 0.15
(f) Effect of ² on PPACE for PowerPC 405LP
Figure 6: Comparing intra-task DVS schemes for bimodal1 distribution (the relative errors
are relative to optimal solutions)
56
of magnitude) than the performance guarantees it offers. This knowledge allows system
designers to set the parameter ² higher than required, in order to speed up the algorithm
execution, if the worst-case performance guarantee is not needed.
The time complexity of PPACE is greater than those of PACE and GRACE, which are
O(r logM). In practice, the running time of PPACE depends on the implementation and
the platform 3. However, it is roughly proportional to the number of energy-time labels
generated during the execution of PPACE(²). Figure 7 compares the average number of
energy-time labels in all vertices for different versions of PPACE. The curves on the top in
Figure 7(a) and 7(b) are for PPACE without doing the trimming operation. We can see
that the elimination of energy-time labels that affects optimality significantly reduces the
size of the label set in each vertex but still allows for performance guarantee. It can also
be observed from Figure 7 that the average number of energy-time labels increases when ²
decreases, but they are all relatively small (especially for PowerPC 405LP shown in Figure
7(b)). It is also true that when M , the number of discrete speeds, increases, the average
number of labels increases; this can be seen by comparing Figure 7(a) and 7(b).
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1000
 0.5  1  1.5  2  2.5  3  3.5
Av
er
ag
e 
La
be
l S
et
 S
ize
Deadline (sec)
No trimming
ε = 0.05
ε = 0.10
ε = 0.15
(a) XScale
 0
 50
 100
 150
 200
 250
 300
 350
 400
 2  4  6  8  10  12  14  16
Av
er
ag
e 
La
be
l S
et
 S
ize
Deadline (sec)
No trimming
ε = 0.05
ε = 0.10
ε = 0.15
(b) PowerPC 405LP
Figure 7: Efficiency of PPACE (bimodal1 distribution)
3In this work, we use Java to implement PPACE and our hardware setting is Pentium 4 3.2 GHz with 1
GB memory. The running time of PPACE(0.05) for all processor models is always within 1 second.
57
Experiment 2: Effect of Speed Scaling Points. Intra-task DVS schemes assume a pre-
defined set of speed scaling points in tasks. However, it is not clear how to choose an optimal
sequence of speed scaling points, especially in the presence of speed change overheads. This
is still an open problem and it is beyond the scope of this dissertation. Intuitively, having
more speed scaling points should result in a better speed schedule at the expense of longer
time to find the speed schedule. Thus, we perform an experiment to find out how the number
of speed scaling points affects the speed schedule. For the tasks in Experiment 1, we set
the number of speed scaling points to be 2, 3, . . ., 100 (the positions of the speed scaling
points are evenly distributed) and obtained the speed schedule returned by PPACE with
² = 0.05. Figure 8 shows the energy consumption (normalized to the energy consumption
for the number of speed scaling points equal to 100) versus number of speed scaling points
for three values of the deadline (for other values of the deadline, the curve is similar; also
we do not show results for the Synthetic processor because they are similar). The results
agree with our intuition. However, Figure 8 shows the phenomenon of diminishing return in
increasing the number of speed scaling points. In practice, this fact will help system designer
find a good number of speed scaling points more quickly; from our experiments, it seems
that r=20 or r=30 is a good number, for these parameters.
 0.8
 1.2
 1.6
 2
 0  20  40  60  80  100
En
er
gy
#Transition Points
deadline 1
deadline 2
deadline 3
(a) XScale
 0.8
 1.2
 1.6
 2
 2.4
 2.8
 3.2
 0  20  40  60  80  100
En
er
gy
#Transition Points
deadline 1
deadline 2
deadline 3
(b) PowerPC 405LP
Figure 8: Effect of speed scaling points (bimodal1 distribution)
58
Experiment 3: Effect of Speed Change Overhead. We also performed simulations
to evaluate the effect of speed change overhead. In practice, processors that can adjust
the voltage internally have low overheads (in the range of microseconds), but systems that
require changing the voltage externally experience high overheads in the milliseconds range.
Using the same task sets used in Experiment 1, we varied the worst-case time penalty of
XScale and PowerPC405LP for speed changes from 50µs to 50ms. The energy penalty is
changed proportionally to the time penalty. For example, the energy penalty of XScale
for time penalty being tµs is 1.2µJ × t
12
. Figure 9 shows the relative errors for PACE and
PPACE (we do not show results for GRACE and PACE2 because similar conclusion can be
reached). We can see that the relative errors for PPACE are all below 0.06%. For PACE,
which has the smallest error, the relative error increases as the time penalty increases; when
the slack is small and time penalty is large, the relative error can be up to 94%.
4.3.4.2 Evaluation of the DVS Schemes for General Frame-based Systems
Evaluation on Synthetic Workloads. A frame-based real-time system is characterized
by the number of tasks, the power scaling factor for each task, the WCEC of each task,
the probability distribution of the number of execution cycles of each task, and the frame
length. We simulated systems consisting of 5 and 10 tasks. We only show the results for the
systems with 5 tasks because the results for systems with 10 tasks are similar. The power
scaling factor was randomly chosen uniformly from 0.8 to 1.2. As with the simulations in
Section 4.3.4.1, the WCEC of each task is 500,000,000 and the minimum number of cycles is
5,000,000. The probability function of each task’s actual execution cycles is randomly chosen
from the 6 representative distributions shown in Figure 2. The bin width of the histograms
denoting the probability functions is 5,000,000 cycles. We experimented with 20 frame
lengths chosen evenly from 5×WCEC
fM
to 5×WCEC
f1
. For each simulated system, we evaluated
7 DVS schemes: Proportional2, Greedy2, Statistical2, PITDVS, PITDVS2, PGOPDVS and
PIT-PPACE. The Proportional2, Greedy2, and Statistical2 schemes are extensions to the
original Proportional, Greedy, and Statistical schemes [48] described in Section 2.4.1 by using
up to two speeds to execute a task [32]. Note that these 7 schemes include inter-task and
hybrid DVS schemes. We also evaluated a clairvoyant scheme, which is aware of the actual
59
 0.5
 1
 1.5
 2
 2.5
 3
 3.5 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0
 20
 40
 60
 80
 100
Relative error (%)
Deadline (sec)
Time Penalty (sec)
(a) PACE, XScale
 0.5
 1
 1.5
 2
 2.5
 3
 3.5 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0.06
Relative error (%)
Deadline (sec)
Time Penalty (sec)
(b) PPACE(² = 0.05), XScale
 2
 4
 6
 8
 10
 12
 14
 16 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0
 20
 40
 60
 80
 100
Relative error (%)
Deadline (sec)
Time Penalty (sec)
(c) PACE, PowerPC 405LP
 2
 4
 6
 8
 10
 12
 14
 16 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0
 0.01
 0.02
 0.03
 0.04
 0.05
Relative error (%)
Deadline (sec)
Time Penalty (sec)
(d) PPACE(² = 0.05), PowerPC 405LP
Figure 9: Effect of speed change overhead (bimodal1 distribution)
60
execution cycles of each task and uses the optimal frequency to execute each task. The
clairvoyant scheme is used as the baseline to compare all other schemes.
In evaluating a DVS scheme on a system, we computed the average energy consumption
per frame as the energy consumption for that scheme on that system. For the same system,
we compute the relative error of a scheme whose energy consumption is E as E−OPT
OPT
, where
OPT is the energy consumption of the clairvoyant scheme. For each DVS scheme, we
averaged the relative errors for all systems with the same frame length because we consider
slack to be the most influential factor for energy consumption. Under the aforementioned
setup, we simulated a total of 20 billion frames. The comparisons of the DVS schemes are
shown in Figure 10.
For most of the simulations, the best scheme is either PITDVS2 or PIT-PPACE. For the
simulations in which the best scheme is neither PITDVS2 nor PIT-PPACE, the minimum
energy consumption of these two schemes is just off by less than 0.1% compared to the energy
consumption of the best scheme. Thus, we take a closer look at PITDVS2 and PIT-PPACE
on the plots in the right column of Figure 10. PITDVS2 performs no worse than PIT-PPACE
in most cases except for a small range of frame lengths for PowerPC 405LP. Considering the
high memory overhead or high run-time overhead of PIT-PPACE (as described in Section
4.3.3), we regard PITDVS2 as the better scheme.
From Figure 10, we can see that for most cases the worst scheme is PGOPDVS, which is
a surprising result considering that it is based on the best scheme under the ideal processor
model. This is because the excessive rounding of speeds in the PGOPDVS scheme makes it
drift far away from the optimal solution. The PITDVS scheme is not necessarily better than
the DVS schemes that do not use probabilistic information of the workload. This is because
the rounding-up effect offsets the advantage of using probabilistic information. Considering
the difference between PITDVS and PITDVS2, we can see that using two adjacent discrete
speeds to emulate a continuous speed [32] plays an important role in PITDVS2. Among the
Proportional, Greedy, and Statistical schemes, none is a clear winner. This is because all
of them are just based on heuristics and will only perform well in a subset of the problem
space.
The two key factors that affect the energy savings of the PITDVS2 scheme over other
61
schemes (by comparing relative errors) are the minimum speed of the processor and the
number of speeds available in the processor. In computing speed schedules, the PITDVS2
scheme is based on the solution under the ideal processor model in which the frequency of
the processor is unrestricted and continuous. Because of the convexity of the power function,
high speeds are not usually obtained by the PITDVS2 scheme. But low speeds are desired
because the PITDVS2 scheme can navigate the full spectrum of available speeds and can
find the best speed that minimizes the expected energy consumption. The importance of
the number of speeds available in the processor is obvious given that we need to convert
the continuous speeds to discrete speeds. For example, because the minimum speed of the
Synthetic processor is less than that of XScale, and the number of speeds of that processor
is greater than that of XScale, the energy saving for the Synthetic processor is greater than
that for XScale.
We also performed simulations to evaluate the effect of speed change overhead. We varied
the time penalty of XScale and PowerPC 405LP in the same way as in Section 4.3.4.1. The
results are very similar to those in Figure 10. This is because the derivations of all schemes
are based on the ideal processor model and they all deal with the speed change overhead
in a similar way. That is, they ignore the energy penalty and subtract the maximum pos-
sible time penalty from the available time when a task is being scheduled to execute. As
the time penalty increases, the energy consumption of all schemes increases, but their dif-
ferences in terms of energy consumption remain roughly the same when averaging the results.
Evaluation on Real-World Workload. We evaluated the DVS schemes on automatic
target recognition (ATR) described in Section 3.6. An example embedded system that uses
ATR is an unmanned autonomous vehicle (UAV) with two cameras installed. Each camera
will take 1 to 3 pictures every 100 ms and send them to a back-end for target recognition.
The back-end is required to finish processing each batch of pictures in a timely fashion.
Thus, the back-end can be modeled as a frame-based system whose frame length is 100 ms.
A task in this system is responsible for processing a picture. For any given frame, there
could be 2 to 6 tasks, each corresponding to a picture. The number of execution cycles of a
task depends on the number of ROIs in the picture that the task is processing. The number
62
 0
 50
 100
 150
 200
 250
 300
 350
 400
 0  5  10  15  20  25
R
el
at
iv
e 
er
ro
r (
%)
Frame length (sec)
PGOPDVS
Greedy
Statistical
Proportional
PITDVS
PIT-PPACE
PITDVS2
(a) Synthetic processor, all schemes
-10
 0
 10
 20
 30
 40
 50
 0  5  10  15  20  25
R
el
at
iv
e 
er
ro
r (
%)
Frame length (sec)
PIT-PPACE
PITDVS2
(b) Synthetic processor, 2 schemes
 0
 20
 40
 60
 80
 100
 120
 2  4  6  8  10  12  14  16
R
el
at
iv
e 
er
ro
r (
%)
Frame length (sec)
PGOPDVS
Greedy
Statistical
Proportional
PITDVS
PIT-PPACE
PITDVS2
(c) XScale, all schemes
-10
-5
 0
 5
 10
 15
 20
 2  4  6  8  10  12  14  16
R
el
at
iv
e 
er
ro
r (
%)
Frame length (sec)
PIT-PPACE
PITDVS2
(d) XScale, 2 schemes
-10
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 10  20  30  40  50  60  70  80
R
el
at
iv
e 
er
ro
r (
%)
Frame length (sec)
PGOPDVS
Greedy
Statistical
Proportional
PITDVS
PIT-PPACE
PITDVS2
(e) PowerPC 405LP, all schemes
-10
-5
 0
 5
 10
 15
 10  20  30  40  50  60  70  80
R
el
at
iv
e 
er
ro
r (
%)
Frame length (sec)
PIT-PPACE
PITDVS2
(f) PowerPC 405LP, 2 schemes
Figure 10: Comparison of DVS schemes for general frame-based systems ( the relative errors
are relative to the clairvoyant scheme)
63
of ROIs in a picture cannot be predicted before processing the picture. We assume that the
back-end is equipped with an Intel XScale processor.
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 0  2e+06  4e+06  6e+06  8e+06  1e+07  1.2e+07  1.4e+07  1.6e+07
Co
un
t
Cycle
(a) Histogram of execution cycles for ATR
 1
 1.1
 1.2
 1.3
 1.4
 1.5
 1.6
 1.7
 1.8
 1.9
 1  2  3  4  5  6  7
En
er
gy
 (n
orm
ali
ze
d t
o P
ITD
VS
2)
Number of images
PGOPDVS
Greedy
Statistical
PITDVS
Proportional
PIT-PPACE
(b) Comparison of DVS schemes
Figure 11: Experimental results for ATR
We obtained the probability distribution of cycle demand of the task by profiling on a
training image set using Simplescalar [6]. Figure 11(a) shows the histogram of the execution
cycles. We then precomputed the schedule for having 2, 3, 4, 5, and 6 images to be processed
in one frame. The five schedules are stored in the back-end. When a period begins, the back-
end counts the number of images received and applies the corresponding schedule. Note
that fewer number of images corresponds to more slack in the frame. Figure 11(b) shows
the energy consumption of each scheme normalized to that of the PITDVS2 scheme when
the back-end has 2, 3, 4, 5, and 6 images to process. From the figure we can see that the
PITDVS2 scheme can achieve significant energy savings over the previously existing schemes.
For example, the PITDVS2 scheme can achieve an average of 11% energy saving over the
Proportional scheme.
64
4.4 A UNIFIED APPROACH
In the previous section, we have investigated DVS schemes under the realistic processor
model. For inter-task and hybrid DVS schemes, we patched the optimal DVS schemes under
the ideal model (e.g., rounding continuous speed to available discrete speed) in order to
comply with the realistic model. As a result, the optimal speeds that were derived based on
the ideal model are no longer optimal for the realistic model.
Experiments in Section 4.3.4 show some anomaly for the patched DVS schemes based on
the ideal model. For example, the best of all DVS schemes for the ideal model is the optimal
stochastic hybrid DVS scheme called GOPDVS [72] (refer to Section 4.2.3). However, we
have seen in Section 4.3.4 that the patched GOPDVS performs even worse than certain
schemes that do not use any stochastic information of workloads (e.g., the Proportional
scheme). This is discouraging since using more information is supposed to lead to better
results. Even for the stochastic schemes that were shown experimentally to outperform non-
stochastic schemes, it is not clear how well those stochastic schemes perform when compared
to the optimal stochastic scheme under the realistic model, which is yet to be found.
In this section, we provide a step function based approach for obtaining the optimal
stochastic DVS schemes under the realistic model. To control the computational complexity,
we use a function approximation technique to obtain DVS schemes whose resulting expected
energy consumption is guaranteed to be within a factor of 1 + ² of the optimal solution and
whose time complexity is polynomial in 1
²
, where ² is a parameter of the DVS schemes. As
with PPACE, our approximation technique falls in the category of fully polynomial time
approximation schemes (FPTAS) [20]. Our approach is unified in the sense that it can be
used to obtain all three types of DVS schemes (i.e., inter-task DVS, intra-task DVS, and
hybrid DVS).
4.4.1 Problem Formulation
To facilitate the presentation, we formally define DVS schemes. A DVS scheme consists of
N speed schedule functions Si(·) (i = 1, 2, . . . , N). Si(t) denotes the speed schedule for task
65
τi, when τi is ready to execute and there is time t remaining in the frame. A speed schedule
for a task dictates what speed(s) to be used for executing this task. For inter-task DVS, a
speed schedule is a single speed; for intra-task and hybrid DVS, each speed schedule contains
a set of speeds and the corresponding speed scaling points.
Let e′i(ς, x) and t
′
i(ς, x) denote the energy consumption and time for executing τi using
speed schedule ς when the actual number of execution cycles of τi is x. The expected energy
consumption for executing τi, τi+1, . . . , τN using time t can be computed recursively as
Ei(t) =
ri∑
k=1
Pi(k)
(
e′i(Si(t), Bi(k)) + Ei+1(t− t′i(Si(t), Bi(k)))
)
and EN+1(t) = 0. Thus, the goal is to find DVS schemes that minimize E1(D).
We present three DVS schemes in this section: (1) SIDVS, which stands for the Simple
Inter-task DVS scheme that employs inter-task DVS strategy in the absence of speed change
overhead; (2) IDVS, which is a generalization of the SIDVS scheme that considers speed
change overhead; (3) HDVS, which stands for hybrid DVS schemes (also considering speed
change overhead). Obviously, SIDVS is the simplest scheme. The differences between SIDVS
under the realistic model and that under the ideal model are discrete speeds vs. continuous
speeds, and arbitrary power function vs. well-defined power function. We present the SIDVS
scheme because it derivation contains all the essential ingredients of our approach and can
be easily extended to form the other schemes.
4.4.2 The Basic Idea for the Unified Approach
In this section, we describe the basic idea behind our approach through the discussion of
the main idea of the SIDVS scheme. The purpose of this section is to illustrate all the key
elements in our approach without delving into too much mathematical detail. We start by
describing the SIDVS scheme, followed by the properties of this scheme and how to obtain
such a scheme.
66
4.4.2.1 The SIDVS Scheme The SIDVS scheme (simple inter-task DVS) contains 2N
functions, two for each task in the system. Specifically, each task τi corresponds to two
functions: Ei(·) and Si(·). These two functions denote that when task τi is ready to ex-
ecute and there is time t remaining in the frame, if we use speed Si(t) to execute τi, the
minimum expected energy consumption, Ei(t), of executing τi, τi+1, . . . , τN will be achieved.
Computing functions Ei(·) and Si(·) (i.e., finding all the mappings in the functions) is done
oﬄine, which we will discuss in Section 4.4.2.3. During the operation of the system, the OS
scheduler will consult functions Si(·) to determine the speed of each task. Specifically, at
the beginning of a frame when there is time D available, the OS scheduler will use the speed
S1(D) to execute τ1. After τ1 finishes and it has taken time t
′, there is time D− t′ remaining
in the frame and the OS scheduler will use the speed S2(D − t′) to execute τ2. The same
process will be applied to the rest of the tasks.
4.4.2.2 Properties of the SIDVS Scheme Before discussing how to obtain the SIDVS
scheme, we examine the properties of functions Ei(·) and Si(·), which will determine their
representation. During the examination, we also consider the scheme under the ideal model,
which will help us understand the motivation behind our approach.
2)( t
C
tE ii  
iE
t
(a) Ei(·) (1 ≤ i ≤ N)
t
C
tS ii
'
)(  
iS
t
(b) Si(·) (1 ≤ i ≤ N)
Figure 12: The SIDVS Scheme for the ideal model
We first examine functions EN(·) and SN(·) because they only involve a single task τN .
For the ideal model (assuming cubic power/frequency relationship), we have EN(t) =
CN
t2
and SN(t) =
C′N
t
, where neither CN nor C
′
N depends on t. Thus, both EN(·) and SN(·) (see
67
Figure 12) can be represented by just a constant (CN for EN(·) and C ′N for SN(·)). This is
due to the simplicity of the ideal model.
minimum
allotted
time
NE
t
(a) EN (·)
minimum
allotted
time
NS
t
Mf
1f
2f
(b) SN (·)
minimum
allotted
time
iE
t
(c) Ei(·), i < N
minimum
allotted
time
iS
t
Mf
1f
2f
(d) Si(·), i < N
Figure 13: The SIDVS Scheme for the realistic model
For the realistic model, however, both EN(·) and SN(·) are step functions (piece-wise con-
stant functions). Figure 13(b) shows the function of SN(·) for the realistic model, which can
be obtained by rounding its counterpart for the ideal model (Figure 12(b)) to the available
discrete speeds. We can see that there are M (M is the number of available discrete speeds)
half-open line segments in the graph, each corresponding to an available discrete speed. Each
line segment can be represented by its left end point because its right end point is the left
end point of the line segment to its immediate right or infinity when it is the rightmost line
segment of the graph. We call the left end point of a line segment a turning point. Thus,
SN(·) can be represented by 2M numbers because each turning point can be represented by
its two coordinates. Figure 13(a) shows the function of EN(·), which has also M half-open
line segments. Counting from left to right, the kth (1 ≤ k ≤M) line segment of EN(·) shares
the same starting t coordinate and ending t coordinate with the kth line segment of SN(·).
Thus, EN(·) can be also represented by 2M numbers as in SN(·). Note that EN(·) does
not include the idle energy consumption, as explained in Section 3.3. Computing EN(t) and
SN(t) can be turned into a table lookup, which takes O(logM) if binary search is used.
Now we examine functions Ei(·) and Si(·) (1 ≤ i < N), which involve multiple tasks.
For the ideal model, Ei(·) (1 ≤ i < N) is of the same form as EN(·), that is, Ei(t) = Cit2 ,
where Ci does not depend on t. The same holds for Si(·) (1 ≤ i < N). This elegant result,
which was proved in Section 4.2.2, is again due to the simplicity of the ideal model. Thus,
68
both Ei(·) and Si(·) (1 ≤ i < N , see Figure 12) can still be represented by a single constant.
This means that the complexity of the representation for the ideal model does not depend
on i. However, this is not the case for the realistic model.
For the realistic model, both Ei(·) and Si(·) (1 ≤ i < N , see Figures 13(c) and 13(d)) are
still step functions. But there are more turning points in Ei(·) (1 ≤ i < N) than in EN(·).
This is because Ei(·) is the expected energy consumption for multiple tasks and different
combination of speeds from these tasks usually results in different energy consumption. In
fact, the number of turning points of Ei(·) may suffer from exponential growth as i decreases,
which will be clear in Section 4.4.3.2. As in the case for EN(·) and SN(·), each line segment
of Ei(·) can be translated into one in Si(·). However, the number of possible values of Si(·)
is only M . If two adjacent line segments in Si(·) share the same Si coordinate, they can be
combined into one line segment. Thus, the number of turning points of Si(·) is usually much
smaller than that of Ei(·). As for the shape of the function, Ei(·) (Figure 13(c)) is still an
non-increasing function, while Si(t) (Figure 13(d)) may go up and down as t increases.
The latter claim is counter-intuitive, especially for the speed going up when t increases
(i.e., if there is more slack, the speed increases to yield lower energy consumption). This is
due to the nature of discrete speeds. We describe a scenario where this will happen. Suppose
that for some workload it is beneficial to use low speed to execute the tasks following τi.
For a given available time t, increasing the speed for τi will not give enough room to drop
the speed for the following tasks to the next lower discrete speed. However, as the available
time t increases, increasing the speed for τi will eventually be rewarded.
Similar to computing EN(t) or SN(t), computing Ei(t) or Si(t) (1 ≤ i < N) is a table
lookup, which takes O(logK), where K is the number of turning points in the function.
4.4.2.3 Obtaining the SIDVS Scheme From Section 4.4.2.2, we can see that comput-
ing Ei(·) and Si(·) is equivalent to identifying all the turning points in Ei(·) and Si(·). From
the recursive description of the problem in Section 4.4.1 it is natural to compute Ei(·) and
Si(·) in reverse order, that is, first compute EN(·) and SN(·), then EN−1(·) and SN−1(·), and
so on. The computation of Ei(·) and Si(·) only depends on Ei+1(·), as Ei+1(·) has already
“summarized” functions Ej(·) and Sj(·), where j = i + 2, . . . , N . When the computation
69
is done, all Ei(·) (i = 1, 2, . . . , N) can be discarded because they are not needed for the
operation of the system.
As mentioned in Section 4.4.2.2, the number of turning points in the functions may
suffer exponential growth. Thus, we propose a function approximation technique to limit
the number of points. We use an example to illustrate the technique. Consider the two
turning points, (e1, t1) and (e2, t2), inside the circle in Figure 14. Obviously, e1 > e2 and
t1 < t2. If the difference between e1 and e2 is small (formally, if
e1−e2
e2
< δ, where δ is a
parameter to quantify the difference), we eliminate the point (e2, t2). Through this kind of
elimination, the number of turning points is reduced and upper bounded by a polynomial
in 1
δ
. However, the resulting function is only an approximation of the original function (i.e.,
the elimination induces error). This is because when time t, where t2 ≤ t < t3, is available,
we cannot use the speed schedule corresponding to (e2, t2) since it was eliminated. We will
have to use the speed schedule corresponding to (e1, t1) and thus result in expected energy e1
that is greater than e2. Because of the way we eliminate the points, the difference between
the resulting function and the original function is guaranteed to be no more than a factor δ
of the original function.
minimum
allotted
time
iE
t
),,(, 1,1,te
),,(, 2,2,te
),,(, 3,3,te
),,(, 1,1,te
),,(, 3,3,te
Figure 14: Function approximation
We apply the above function approximation technique to function Ei(·) before computing
Ei−1(·). Thus, the error accumulates and increases as i decreases. Let the error of E1(·) be
denoted by ². The expected energy consumption of the system, E1(D), is within a factor of
70
1 + ² of the optimal expected energy consumption. If we let ² be a parameter set by system
designers, we can derive the value of δ to be used for each function approximation. The
technical detail can be found in Section 4.4.3.2.
4.4.3 The Details of the Unified Approach
Having described the basic idea behind our approach in the previous section, we provide the
technical details in this section. We first formally define step function and related operations
in Section 4.4.3.1. Then we give the algorithm to obtain the SIDVS scheme and present its
analysis in Section 4.4.3.2. Finally, we extend the algorithm for the SIDVS scheme to obtain
the IDVS and HDVS schemes in Section 4.4.3.3 and 4.4.3.4, respectively.
4.4.3.1 On Step Functions From Section 4.4.2 we can see that step functions play an
important role in our approach. Thus, being able to represent step functions effectively and
manipulate step functions efficiently are crucial for the viability of our approach.
We first formally define step function through the following two definitions.
Definition 3. A point P is a 2-tuple (e, t), where e and t are nonnegative reals and denote
energy and time respectively. We write the energy component as P.e and the time component
as P.t.
Definition 4. A step function (piece-wise constant function) F(·) is defined as a point se-
quence S = [P1,P2, . . . ,Pm], where P1.t < P2.t < · · · < Pm.t. F(t) is undefined when t < P1.t,
otherwise F(t) = Pi.e and i = max
j=1,2,...,m
{j|t ≥ Pj.t}. The cardinality of S is also called the
size of function F(·).
Having formally defined step function, we will use F to denote a step function F(·) unless
confusion arises. Let |F| denote the number of points in F. Obviously, computing F(t) can
be done in time O(log |F|) by using binary search.
We now look at three operations between a number and a step function.
Definition 5. The operator +e is defined between a real x and a step function F such that
x+e F = [(x+ P1.e,P1.t), (x+ P2.e,P2.t), . . .] (i.e., the result is still a step function). Other
operators, ×e and +t can be defined similarly.
71
Obviously, the operators defined in Definition 5 can be performed in time O(|F|).
Finally, we describe two operations between step functions.
Definition 6. The sum operator +F is defined between two step functions, F1 and F2, such
that F1 +F F2 = F and F(t) = F1(t) + F2(t). The merge operator ∪ is defined between two
step functions, F1 and F2 such that F1 ∪ F2 = F and F(t) = min (F1(t),F2(t)).
The resulting step function F by either the sum or the merge operators over n step
functions Fi (i = 1, 2, . . . , n) could have as many as
∑n
i=1 |Fi| points. The time component
of each point in F comes from one of the Fi’s. Because the points in Fi are already sorted,
the time components of all points in F can be obtained by a procedure similar to merge sort
in time O((
∑n
i=1 |Fi|) log n). To compute the energy component of each point in F, the sum
operator takes constant time and the merge operator takes O(log n) time by using a priority
queue. Thus, computing +F
n
i=1Fi takes O((
∑n
i=1 |Fi|) log n) time and computing ∪ni=1Fi takes
O((
∑n
i=1 |Fi|) log2 n) time.
4.4.3.2 The Algorithm for SIDVS Recall from Section 4.4.2.3 that we compute func-
tions Ei and Si in reverse order. For succinct presentation, we do not show the computation
of functions Si because it can be easily performed as a by-product of computing Ei.
To compute function Ei, we first consider M helper functions Eˆi,j (j = 1, 2, . . . ,M),
whereM is the number of available discrete speeds. Eˆi,j denotes the expected energy function
when frequency fj is used to execute task τi. A single value of Eˆi,j can be computed as
Eˆi,j(t) =
ri∑
k=1
Pi(k)
(
pˆi(fj)
Bi(k)
fj
+ Ei+1(t− Bi(k)
fj
)
)
= pˆi(fj)
Ai
fj
+
ri∑
k=1
Pi(k)Ei+1(t− Bi(k)
fj
)
In the above equations, pˆi(fj)
Bi(k)
fj
is the energy consumption of executing the first k bins of τi
and Ei+1(t− Bi(k)fj ) is the expected energy consumption of executing τi+1, . . . , τN . Computing
the whole function Eˆi,j can be expressed using our notations about step functions as Line
6 in Algorithm 4.10. Function Ei is just the result of merging M Eˆi,j functions. During
the merging process, the optimal speed corresponding to each point is also determined. The
72
optimal algorithm to solve SIDVS is shown at Lines 1-8 in Algorithm 4.10. Line 9 is used
for the function approximation technique mentioned in Section 4.4.2.3 and will be explained
at the end of this section.
We now analyze the time complexity and space complexity of computing Ei. The key
operation in computing Eˆi,j is the sum operation over ri step functions, each is of size |Ei+1|.
Thus, the time to compute Eˆi,j is O(ri|Ei+1| log ri) and the number of points in Eˆi,j is
O(ri|Ei+1|). The key operation in computing Ei is the merge operation overM step functions,
each is of size O(ri|Ei+1|). Thus, the time to compute Ei is O(Mri|Ei+1| log2M) and the
number of points in Ei is O(Mri|Ei+1|). Since the base case is |EN+1| = 1, we can obtain
the closed forms of the time complexity and space complexity to be O((Mri)
N−i+1 log2M)
and O((Mri)
N−i+1), respectively.
Algorithm 4.10 SIDVS(²)
1: EN+1 := {(0, 0)}
2: for i := N downto 1 do
3: {compute Ei}
4: for j := 1 to M do
5: {the case where fj is used to execute τi}
6: Eˆi,j := pˆi(fj)
Ai
fj
+e +F
ri
k=1Pi(k)×e
(
Bi(k)
fj
+t Ei+1
)
7: end for
8: Ei := ∪Mj=1Eˆi,j
9: Ei := TRIM(Ei, (1 + ²)
1
N − 1)
10: end for
The time complexity of the optimal algorithm for the SIDVS scheme depends greatly
on the size of functions Ei. As we can see from the analysis of the optimal algorithm, the
size of Ei may grow exponentially as i goes from N to 1. Thus, we need to control the
size of function Ei within some polynomial bound. To do that, we trim (i.e., remove some
points) function Ei after it is computed at Line 7 in Algorithm 4.10. A trimming parameter
δ (0 < δ < 1) is used to direct the trimming. After function Ei is trimmed, the energy
components of any adjacent points (recall from Definition 4 that the points are stored in the
order of increasing t component) differ by at least a factor of δ. The choice of δ = (1+²)
1
N −1
73
(² is a parameter of the SIDVS scheme) at Line 9 in Algorithm 4.10 will be clear at the end
of this section. Algorithm 4.11 shows the trimming procedure.
Algorithm 4.11 TRIM(F = [P1,P2, . . . ,P|P|],δ)
1: Fˆ := {P1}
2: l := P1
3: for i := 2 to |F| do
4: if l.e > (1 + δ)Pi.e then
5: append Pi onto the end of Fˆ
6: l := Pi
7: end if
8: end for
9: return Fˆ
The function approximation achieved by the trimming procedure is inspired by [20] and
is similar to the label elimination technique used in Section 4.3.1. Thus, we only sketch its
analysis for the sake of completeness
Before computing the number of points in Ei after trimming, we prove an important
lemma. Let E ′i (i = 1, 2, . . . , N) be the step functions obtained if Line 9 in Algorithm 4.10 is
omitted. That is, E ′i is the set of functions returned by the optimal algorithm. By comparing
E ′i and Ei, we have the following lemma:
Lemma 3. For every point P′ ∈ E ′i where 1 ≤ i ≤ N + 1, there exists a point P ∈ Ei such
that P′.e ≤ P.e ≤ (1 + δ)N+1−iP′.e and P′.t ≥ P.t.
Proof. This lemma is equivalent to Ei(t) ≤ (1+ δ)N+1−iE ′i(t) for any value of t. The proof is
by induction on i and the base case for i = N + 1 obviously holds from Line 1 in Algorithm
4.10. In the induction step for Ei, we inspect Line 6 in Algorithm 4.10. From the hypothesis,
Ei+1(t) is within a factor of of (1+δ)
N−i of E ′i+1(t). All the operations at Line 6 will preserve
this property. After the trimming operation, the factor will be only increased by (1 + δ),
which will make Ei(t) with a factor of (1 + δ)
N−i+1 of E ′i(t).
Using functions E ′i will lead to expected energy consumption of E
′
1(D) and using functions
Ei will lead to expected energy consumption of E1(D). From Lemma 3, we have E1(D) ≤
74
(1 + δ)NE ′1(D). Since we choose δ to be (1 + ²)
1
N − 1, we have E1(D) ≤ (1 + ²)E ′1(D).
To compute the upper bound of the number of points in Ei, we note that after the
trimming procedure, the energy components of any adjacent points differ by at least a factor
of δ. Let the leftmost point in Ei be denoted by Pl (which is upper bounded by the energy
consumption when all tasks use the maximum speed) and the rightmost point in Ei be
denoted by Pr ( which is lower bounded by the energy consumption when all tasks use the
minimum speed). Thus, we have
Pl.e > (1 + δ)|Ei|−1Pr.e
By plugging in δ = (1 + ²)
1
N − 1 and some algebraic manipulations, we will obtain |Ei| =
O(N log λ
²
), where λ = Pl.ePr.e . Thus, the number of points in Ei is upper bounded by a polyno-
mial in 1
²
.
4.4.3.3 The Algorithm for IDVS Recall from Section 4.4.1 that the IDVS scheme is
a generalization of the SIDVS scheme that considers speed change overhead. The SIDVS
scheme can be easily extended to form the IDVS scheme. Instead of computing only one
expected energy function Ei for each task τi as in SIDVS, we compute M expected energy
functions Ei,s (s = 1, 2, . . . ,M). Ei,s denotes the expected energy consumption of executing
tasks τi, τi+1, . . . , τN when the current speed is fs, that is, when the speed before the execution
of τi starts is fs. Computing each Ei,s in IDVS is similar to computing Ei in SIDVS. The
only difference is that computing Ei,s takes into consideration the energy penalty and time
penalty associated with the speed change. Thus, computing each Ei,s in IDVS has the same
time and space complexity as computing Ei in SIDVS. Algorithm 4.12 shows the details of
the IDVS scheme.
In the IDVS scheme, N×M speed schedule functions are computed. During the operation
of the system, when task τi is ready to execute and there is time t remaining in the frame,
the OS scheduler will detect the current speed s of the processor and use speed Si,s(t) to
execute τi.
75
Algorithm 4.12 IDVS(²)
1: for s := 1 to M do
2: EN+1,s := {(0, 0)}
3: end for
4: for i := N downto 1 do
5: for s := 1 to M do
6: {compute Ei,s}
7: for j := 1 to M do
8: {the case where fj is used to execute τi}
9: Eˆi,s,j := PE(fs, fj) + pˆi(fj)
Ai
fj
+e +F
ri
k=1Pi(k)×e
(
Bi(k)
fj
+ PT (fs, fj) +t Ei+1,j
)
10: end for
11: Ei,s := ∪Mj=1Eˆi,s,j
12: Ei,s := TRIM(Ei,s, (1 + ²)
1
N − 1)
13: end for
14: end for
4.4.3.4 The Algorithm for HDVS In the HDVS (Hybrid DVS) scheme, a task is
allowed to change speed during its execution. Because of the speed change overhead, a
limited number of speed scaling points at which speed may change are predefined for a
task [40]. This is similar to predefining the quantum size for OS due to the context switch
overhead. The speed remains constant between any two adjacent speed scaling points of a
task. For ease of presentation, we choose the bin boundary of the histogram representing
the probability distribution of a task as the speed scaling points for the task. By treating
each bin of a task as a subtask, the IDVS scheme can be easily extended to form the HDVS
scheme. We add one more dimension to the expected energy functions for each task τi. That
is, we compute function Ei,s,b (s = 1, 2, . . . ,M and b = 1, 2, . . . , ri) that denotes the expected
energy consumption of executing bin b, b + 1, . . . , ri of task τi, and tasks τi+1, . . . , τN when
the current speed is fs. There is a catch, however. Ei,s,b is not only dependent on Ei,·,b+1,
but also Ei+1,·,1. This is because task τi may finish at bin b and the rest of the bins will
not be executed. Let X be the number of cycles that τi executes. We compute Pˆi(b), the
76
probability that bin b of task τi will be executed given that the previous b−1 bins have been
executed
Pˆi(b) = Prob(X ≥ Bi(b)|X ≥ Bi(b− 1))
=
Prob(X ≥ Bi(b) ∧X ≥ Bi(b− 1)
Prob(X ≥ Bi(b− 1))
=
1− cdfi(b− 1)
1− cdfi(b− 2)
where cdfi(0) = cdfi(−1) = 0. Similar to the IDVS scheme, the helper function Eˆi,s,b,j
denotes the expected energy consumption of executing bin b, b + 1, . . . , ri of task τi, and
tasks τi+1, . . . , τN when the current speed is fs and speed fj is used to execute bin b. Thus,
a single value of Eˆi,s,b,j can be computed as
Eˆi,s,b,j(t) = Pˆi(b)(PE(fs, fj) + pˆ(fj)
wi(b)
fj
+ Ei,j,b+1(t
−PT (fs, fj)− wi(b)
fj
)) + (1− Pˆi(b))Ei+1,s,1
where wi(b) = Bi(b)−Bi(b− 1). Algorithm 4.13 shows the details of the HDVS scheme.
In the HDVS scheme, the speed schedule functions for each bin of a task is computed.
That is, there are a total of M ×∑Ni=1 ri speed schedule functions. However, we do not need
these many speed schedule functions for the operation of the system. This is because for a
single task, the execution speed is always non-decreasing [40]. This indicates that the speed
schedule of a task for a given amount of time has at most M speeds and M speed scaling
points. Thus, we compute new speed schedule functions Sˆi,s (the number of speed schedules
in Sˆi,s is |Si,s,1|), which denote the speed schedules for the whole task τi when the current
speed is fs, from Si,·,b (b = 1, 2, . . . , ri) and use them during the operation of the system.
Note that when there is only one task in the system, the HDVS scheme essentially
becomes an intra-task DVS scheme. In this case, the HDVS scheme is very similar to the
PPACE scheme from user’s standpoint. This is because both schemes are fully polynomial
time approximation schemes, that is, both schemes can give performance guarantees and
achieve very close to optimal solution. Their main difference is that PPACE computes the
speed schedule from the first bin of the task to the last bin, while HDVS does in the opposite
77
Algorithm 4.13 HDVS(²)
1: for s := 1 to M do
2: EN+1,s,1 := {(0, 0)}
3: end for
4: for i := N downto 1 do
5: for b := ri downto 1 do
6: for s := 1 to M do
7: {compute Ei,s,b}
8: for j := 1 to M do
9: {the case where fj is used to run bin b of τi}
10: Eˆi,s,b,j := Pˆi(b)×e (PE(fs, fj)+ pˆi(fj)wi(b)fj +e (PT (fs, fj)+
wi(b)
fj
+tEi,j,b+1))+ (1−
Pˆi(b))×e Ei+1,s,1
11: end for
12: Ei,s,b := ∪Mj=1Eˆi,s,b,j
13: Ei,s,b := TRIM(Ei,s,b, (1 + ²)
1PN
k=1
rk − 1))
14: end for
15: end for
16: end for
78
order. In fact, our experiments show that the performance of the PPACE scheme is almost
identical to that of the HDVS scheme for one task in the sense that when setting ² to 5%,
the relative errors compared to the optimal solution for both schemes are all below 0.1%.
4.4.4 Evaluation Results
In this section, we use the IDVS and HDVS schemes as the baselines to experimentally
evaluate the existing inter-task and hybrid DVS schemes, respectively, for general frame-
based systems. All existing DVS schemes can be regarded as heuristic solutions because
they do not have any performance guarantee under the realistic model. However, history
has shown that for certain hard problems, there exist heuristic solutions that work very well
and even close to the optimal in practice. The purpose of the evaluation in this section is to
identify those good heuristic schemes. Note that in general hybrid schemes are better than
inter-task DVS schemes. However, inter-task DVS schemes are easier to implement than
hybrid schemes and sometimes are preferred by system designers. Thus, identifying good
inter-task DVS schemes is also important.
The simulation setup is the same as that for evaluation of DVS schemes for general
frame-based systems described in Section 4.3.4.2.
4.4.4.1 Evaluation of Inter-task DVS Schemes We evaluated four inter-task DVS
schemes: Proportional, Greedy, Statistical, and PITDVS. The first three schemes [48] are
non-stochastic schemes that do not use stochastic information of the workloads. They are
all based on the ideal model and need to be patched to fit the realistic model. The PITDVS
scheme is obtained by patching the optimal stochastic inter-task DVS scheme under the
ideal model (refer to Section 4.3.4). The patches for all schemes are similar, including
rounding continuous speed up to the lowest feasible discrete speed (i.e., guaranteed to meet
deadlines) and subtracting the maximum possible time penalty from the available system
time. The energy consumption of all schemes is normalized to that of the IDVS scheme with
² = 0.05 (i.e., the energy consumption is guaranteed to be within 5% of the optimal). For
all experiments, the number of points in Si,s of the IDVS scheme is at most 97. Recall from
79
Section 4.4.2.3 that we only need functions Si,s during the operation of the system. Thus,
the space overhead of the IDVS scheme is very small.
 0.95
 1
 1.05
 1.1
 1.15
 1.2
 1.25
 1.3
 1.35
 1.4
 2  4  6  8  10  12  14  16
En
er
gy
 (n
orm
ali
ze
d)
Frame length (second)
Greedy
Statistical
Proportional
PITDVS
(a) XScale
 0.9
 1
 1.1
 1.2
 1.3
 1.4
 1.5
 1.6
 1.7
 1.8
 10  20  30  40  50  60  70  80
En
er
gy
 (n
orm
ali
ze
d)
Frame length (second)
Greedy
Statistical
Proportional
PITDVS
(b) PowerPC 405LP
Figure 15: Evaluation of Inter-task DVS Schemes (normalized to IDVS)
Figure 15 shows the evaluation results. We can see that the Statistical scheme is very
close to IDVS, which is guaranteed to be within (1+5%) of the optimal and many times is
in practice better than the guarantee. Thus, we conclude that Statistical is very close to the
optimal even though it is not provably optimal for either the ideal or realistic model. The
Greedy scheme also performs relatively well in most cases, comparing with the IDVS scheme.
PITDVS is outperformed by the Statistical scheme in most cases. This is more evident for
the PowerPC 405LP model. Even the Greedy scheme beats PITDVS in some cases. This is
anomaly because the rounding-up effect offsets the advantage of using stochastic information.
4.4.4.2 Evaluation of Hybrid DVS Schemes We evaluated five hybrid DVS schemes:
Proportional2, Greedy2, Statistical2, PITDVS2, and PGOPDVS. The first four schemes are
different from their inter-task DVS counterparts only in that up to two speeds are used to
execute a task. This is due to the fact that any continuous speed can be emulated by using
its two adjacent discrete speeds [32]. In essence, these schemes attempt to emulate inter-
task DVS schemes under the ideal model. However, they belong to hybrid DVS schemes
80
technically because the speed may change during the execution of a task (thus they need
interrupt support). The PGOPDVS scheme is obtained by patching the optimal stochastic
hybrid DVS scheme GOPDVS [72] under the ideal model (refer to Section 4.3.3 for the
patching). The energy consumption of all schemes is normalized to that of the HDVS
scheme with ² = 0.05. For all experiments, the number of speed schedules in Sˆi,s of the
HDVS scheme is at most 1013. Thus, the space overhead of the HDVS scheme is reasonably
small.
 0.9
 1
 1.1
 1.2
 1.3
 1.4
 1.5
 1.6
 1.7
 1.8
 1.9
 2
 2  4  6  8  10  12  14  16
En
er
gy
 (n
orm
ali
ze
d)
Frame length (second)
PGOPDVS
Greedy2
Statistical2
Proportional2
PITDVS2
(a) XScale
 0.9
 1
 1.1
 1.2
 1.3
 1.4
 1.5
 1.6
 1.7
 1.8
 10  20  30  40  50  60  70  80
En
er
gy
 (n
orm
ali
ze
d)
Frame length (second)
PGOPDVS
Greedy2
Statistical2
Proportional2
PITDVS2
(b) PowerPC 405LP
Figure 16: Evaluation of hybrid DVS Schemes (normalized to HDVS)
Figure 16 shows the evaluation results. We note several quantitative differences from the
results for inter-task DVS schemes. First, GOPDVS performs poorly, which is a surprising
result considering that it is based on the best of all schemes under the ideal model. This is
because the excessive rounding of speeds in the GOPDVS scheme makes it drift far away from
the optimal solution. Statistical2 performs much worse than its inter-task DVS counterpart,
when compared to other schemes. The Greedy2 scheme is the worst of all non-stochastic
DVS schemes. The good performance of the PITDVS2 scheme is also a surprising result
since its main idea is to emulate only the optimal stochastic inter-task DVS scheme under
the ideal model.
81
4.5 SUMMARY
In this chapter, we investigated energy-aware uniprocessor scheduling problems for streaming
applications. Since the problems related to deterministic workloads have been well studied,
we focused on problems of scheduling stochastic workloads, namely, STREAM-UP-S-ST and
STREAM-UP-S-TG. Solving these two problems is equivalent to finding DVS schemes for
frame-based hard real-time systems that execute stochastic workloads.
We started out by investigating DVS schemes under the ideal processor model. Because of
the simplicity of the model, provably optimal DVS schemes can be obtained and the optimal
schemes for different DVS strategies share great similarities. Our main contributions are to
propose OITDVS, the Optimal Inter-Task DVS scheme under the ideal processor model, and
to provide a unified view of the optimal DVS schemes for all DVS strategies under the ideal
processor model.
We then turned to DVS schemes for the realistic processor model. We presented PPACE
(Practical PACE), which is a new DVS scheme for frame-based systems with a single task
that takes into consideration discrete speeds and speed change overhead. PPACE can give
performance guarantees and achieve energy savings very close to the optimal solution. For
frame-based systems with multiple tasks, we proposed PITDVS2 (Practical Inter-Task DVS,
using 2 speeds) and showed that it outperforms the existing DVS schemes in our experiments.
We also showed that simple patches to optimal DVS schemes obtained under the ideal
processor model do not necessarily generate DVS schemes that perform well in practice.
However, in investigating DVS schemes for the realistic model, we used different ap-
proaches for different DVS strategies. Furthermore, the PITDVS2 scheme is based on heuris-
tics. Although it performs well experimentally, it is not clear that there still exists better
schemes. Driven by this motivation, we proposed a unified approach for obtaining optimal
(or provably close to optimal) stochastic inter-task, intra-task, and hybrid DVS schemes
under the realistic processor model. As a result, optimal DVS schemes for all DVS strategies
under the realistic model also share great similarity, as in the case for the ideal model. We
used the optimal schemes to establish tight upper bounds on energy savings for stochastic
DVS schemes and were able to identify good DVS schemes that are based on heuristics.
82
5.0 SCHEDULING IN MULTIPROCESSOR SYSTEMS
5.1 OVERVIEW
In this chapter, we consider energy-aware multiprocessor scheduling problems for streaming
applications, that is, the STREAM-MP-D-ST, STREAM-MP-D-TG, STREAM-MP-S-ST,
and STREAM-MP-S-TG problems that are described in Section 3.5.
As chip multiprocessors (CMPs) are quickly becoming the dominant computer architec-
ture, scheduling streaming applications on multiprocessor systems has become increasingly
important. CMP is the main solution to continuing improving computing performance be-
yond Moore’s law. However, as the number of cores on a chip increases, so does the power
density, making power management a major concern for CMPs. Compared to energy-aware
scheduling of a streaming application on uniprocessor systems, multiprocessor scheduling
poses more challenges, as follows.
1. Static-dynamic power trade-off: there is a fundamental trade-off between static and
dynamic power consumption for multiprocessor systems. Assuming perfect parallelism
for a given workload, as the number of active processors on a system increases, the
static power consumption of the system increases, while the dynamic power consumption
decreases since the load on each processor is smaller. As long as neither dynamic nor
static power accounts for most of the total power, which is true for current technology, the
two power management mechanisms, DVS and on-off, must be combined to optimize the
power consumption that leads to minimum energy consumption for applications. Thus,
we need to decide the number of active processors to execute a streaming application in
addition to determining the execution speed of each task.
83
2. Task mapping: we need to decide the mapping of tasks to active processors in mul-
tiprocessor scheduling. In general, task mapping for multiprocessor systems without
consideration of energy is already a hard problem [25]. Thus, energy-aware task map-
ping will only add more complexity. Even for streaming applications that have only
singleton task graph representation, multiprocessor systems open up the opportunity to
execute different instances of a task on multiple processors and we need to decide the
number of instances in order to optimize the energy consumption.
3. Two QoS requirements: the QoS constraints, throughput and response time, can
no longer be collapsed into one constraint in multiprocessor scheduling. Energy-aware
scheduling of streaming applications on multiprocessor systems, assuming that the num-
ber of active processors is given and the deadline is equal to the period, have been
extensively studied (for example, [5]). However, scheduling under both throughput and
response time constraints deserves research attention for the following reasons. First, the
period is shorter than the deadline in many situations, as for example when applying au-
tomatic target recognition (ATR) in unmanned autonomous vehicle (UAV). Scheduling
algorithms that assume equal deadline and period force streaming applications to service
requests faster than required, which goes against the common DVS wisdom of slowing
down processors for just-in-time completion. Second, finding the appropriate number of
active processors to execute an application is crucial for saving energy. We will show
that high static power may force streaming applications into servicing requests faster
than required (in spite of period being less than deadline) in order to save energy, which
is counterintuitive for DVS. Third, even when the optimal number of active processors is
known, the task mapping and task speed scheduling (i.e., deciding task speeds) become
more complex for multiprocessor scheduling problems because of the interplay of the two
QoS requirements.
In this chapter, we investigate energy-aware scheduling algorithms for multiprocessor
systems. The outcome of the scheduling algorithms are: (i) the number of active processors
to execute the task graph; (ii) the mapping of tasks to active processors; and (iii) the
execution speed of each task (also called speed schedule). As in the case for uniprocessor
systems, the ultimate goal is to obtain scheduling algorithms under the realistic processor
84
model. Thus, we only use the ideal processor model as a stepping stone in the investigation
and do not necessarily devise a scheduling algorithm under the ideal processor model for
each problem under investigation.
Table 6 shows the road map of our investigation in this chapter. We first consider
scheduling a single task on multiprocessor systems. Our solution is a master-slave scheme
that executes different instances of the task on multiple processors to satisfy the QoS require-
ments while trying to minimize the energy consumption. The proposed schemes for deter-
ministic and stochastic workloads are presented in Section 5.2.1 and 5.2.2, respectively. We
then investigate how to schedule a task graph on multiprocessor systems. For deterministic
workload. A fully polynomial time approximation scheme called Scheduling1D is proposed
for scheduling linear task graphs in Section 5.3.1. In Section 5.3.2, we propose heuristics
to reduce the complexity of scheduling general task graphs and extend Scheduling1D to a
heuristic algorithm called Scheduling2D for scheduling general task graphs. Scheduling2D is
again extended to deal with stochastic workloads in Section 5.3.3.
Table 6: The road map of our investigation
Single Task Task Graphs
Deterministic
STREAM-MP-D-ST STREAM-MP-D-TG
MS Scheduling2D
(Section 5.2.1) (Section 5.3.2)
Stochastic
STREAM-MP-S-ST STREAM-MP-S-TG
SMS SScheduling2D
(Section 5.2.2) (Section 5.3.3)
5.2 SCHEDULING A SINGLE TASK
There are times when a streaming application cannot be parallelized or simply does not
provide a task-graph representation to a system. In this section, we consider scheduling
85
a single task on a multiprocessor system to attack the problems of STREAM-MP-D-ST
and STREAM-MP-S-ST. The difference between these two problems is whether the task
workload is deterministic or stochastic. Since there is only a single task, we will omit the
subscripts of the parameters of the task unless confusion arises.
In scheduling a streaming application represented by a single task, there is no parallelism
inside the application that we can exploit. This implies that the entire application has to
be executed on one processor. Therefore, the maximum time to process a request in the
worst case using a single processor at the maximum speed is less than the response time
requirement (i.e., W
fM
≤ D). Otherwise, this application is not schedulable under the QoS
requirements. Obviously, if the period is no less than the response time requirement (i.e.,
T ≥ D), we need to use only one processor to execute this application and there is no point of
using more than one processor. The problem of STREAM-MP-D-ST (STREAM-MP-S-ST)
is essentially reduced to the problem of STREAM-UP-D-ST (STREAM-UP-S-ST).
It is more interesting to deal with the cases where the period is less than the response
time requirement (i.e., T < D). If we still use only one processor, the application has to finish
processing a request in time T . That is, the application is forced to process requests faster
than the response time requirement, which results in increased dynamic energy consumption.
Even worse, if T < W
fM
, the application is not schedulable under the QoS requirements. To
resolve this problem, we can take advantage of availability of multiple processors. The
resulting scheme is a master-slave scheme, the basic idea of which is to execute different
instances of the task on multiple processors. Using more processors results in increased static
energy consumption, but it will decrease the dynamic energy consumption. To optimize the
total energy, a perfect balance must be struck between static and dynamic energy. Note that
traditionally task duplication has been used as a technique to minimize the execution time of
a task graph (e.g., [31]). However, we use it in this dissertation mainly as an energy-reduction
approach.
In the master-slave scheme (Figure 17), we use a processor that acts as the master to
receive requests and distribute them in a round-robin fashion to other processors, each acting
as a slave and running a copy of the task. The master can be placed on the administrative
processor in the system (e.g., the PPE in CELL [1]); Or, because of the light workload of
86
the master, in general, it can be placed with one slave on a processor. Therefore, we ignore
the energy consumption of the master in the analysis. We assume that the time to send a
request from the master to a slave is constant and we incorporate it into the response time
requirement.
Master
Slave
Slave
Slave
.
.
.
T
D
Request Processor
Figure 17: The master-slave Scheme
All active processors are symmetric in the sense that each processor employs the same
speed schedule. Once the number of active processors is determined, the problem of STREAM-
MP-D-ST is reduced to STREAM-UP-D-ST and the problem of STREAM-MP-S-ST is re-
duced to STREAM-UP-S-ST. To see why this is the case, let us look at a simplified example
under the ideal processor model. Figure 18 shows different scenarios with different number of
active processors for a streaming application whose response time requirement is 2.5 times
the request interarrival time (i.e., D = 2.5T ). Suppose that each request takes exact W
cycles to process. If we use 2 slaves for this application (Figure 18(a)), the processing time
for each request is 2T (note that 2T < D) and the operating frequency of each processor is
W
2T
. If we use 3 slaves (Figure 18(b)), the processing time for each request is min(D, 3T ) = D
(each processor will have 0.5T of idle time between consecutive requests) and the operating
frequency of each processor is W
D
.
87
Time
2T
T T T
Slave 1
Slave 2
. . .
. . .
. . .
(a) Using 2 slaves
Time
D
T T T
Slave 1
Slave 2
Slave 3
. . .
. . .
. . .
. . .
(b) Using 3 slaves
Figure 18: Applying the master-slave scheme to a streaming application for which D = 2.5T
From the above example, we can see that the key question for the master-slave scheme is
how to determine the optimal number of active processors to minimize the energy consump-
tion of all the slaves. Suppose that the number of active processors is n, each active processor
acts as a frame-based real-time system of frame length min(nT,D). This also implies that
the maximum number of active processors is dD
T
e. If the number of active processors increases
beyond dD
T
e, the frame length will stay at D. As a result, the dynamic power consumption
of each processor stays the same while the static power consumption increases. Thus, the
number of active processors ranges from 1 to dD
T
e. To obtain the optimal number of active
processors, we also need to apply existing uniprocessor scheduling algorithms to compute
the dynamic energy consumption. Next, we describe how to determine the optimal number
of active processors under different scenarios.
5.2.1 Deterministic Workload
5.2.1.1 Ideal Processor Model Suppose that there are totally N requests for the
streaming application. Let the number of active processors to service the requests be denoted
by n, where 1 ≤ n ≤ dD
T
e.
The time allotted for servicing each request is t = min(D,nT ), and the time to service
N requests is (N − 1) · T + t. Thus, the static energy consumption is
es(n) = n · c0 · ((N − 1) · T + t) ≈ n · c0 · (N − 1) · T ≈ nc0NT
88
and the dynamic energy consumption is
ed(n) = Nc1
(
W
t
)α
× t
where W
t
is operating frequency of the processors. The total energy consumption is
e(n) = es(n) + ed(n) = N
(
nc0T + c1
(
W
t
)α
t
)
To obtain the optimal number of active processors n∗, we first relax two constraints. That
is, we ignore the response time requirement and allow fractional number of processors. Thus
the total energy consumption becomes
e(n) = N
(
nc0T + c1
(
W
nT
)α
nT
)
Since nc0T is an increasing function in n and c1
(
W
nT
)α
nT is a decreasing function in
n, there is only one global minimum for e(n). Through the first derivative of e(n), we can
obtain the optimal (fractional) value of n
n˜ =
W
T
α
√
c1
c0
(α− 1) (5.1)
Based on the property of e(n), we have the following rules to decide optimal actual number
of active processor n∗.
1. If n˜ ≤ 1, then n∗ = 1.
2. If n˜ ≥ dD
T
e, then n∗ = dD
T
e.
3. Otherwise, the optimal number of active processors is either bn˜c or dn˜e, and can be
simply determined by comparing e(bn˜c) and e(dn˜e).
Thus, determining the optimal number of active processors under the ideal processor model
can be done in constant time. All active processors will use the same speed to execute the
application, which is W
min(D,n∗T ) .
89
5.2.1.2 Realistic Processor Model The master-slave scheme under the realistic model
is called the MS scheme, which is our solution to the problem of STREAM-MP-D-ST. To
obtain the MS scheme, we first compute the total energy consumption e(n) of processing N
requests for each number of active processors n. The derivation of e(n) is similar to that for
the ideal processor model. The time allotted for servicing each request is t = min(D,nT ),
The static energy consumption is
es(n) = n · pidle · ((N − 1) · T + t) ≈ n · pidleNT
With a slight abuse of notation, we use dse to denote the closest discrete speed higher than
s. Thus, the dynamic energy consumption is
ed(n) = N · p
(
dW
t
e
)
W
dW
t
e
The optimal number of active processors n∗ is
n∗ = argmin
1≤n≤dD
T
e
(es(n) + ed(n))
Unlike the case for the ideal model, we do not have a closed form solution to computing n∗
for the realistic model. However, we can simply try all possible number of active processors
ranging from 1 to dD
T
e to find n∗. Obviously, the time complexity of this approach is O(dD
T
e).
90
5.2.2 Stochastic Workload
Determining the optimal number of active processors for stochastic workload is very similar
to that for deterministic workload. As in the case of deterministic workload, the number
of active processors decides the static energy consumption and the deadline for processing
each request. Once the deadline is known, we can apply the PACE scheme [40] for the ideal
processor model, or the PPACE scheme described in Section 4.3.1 for the realistic processor
model, to compute the dynamic energy for processing each request.
Thus, for the ideal processor model, we follow similar derivation as in Section 5.2.1 and
apply Formula (4.4) to obtain the optimal (fractional) number of active processors
n˜ =
1
T
α
√√√√c1∑Wj=1 F 1αj
c0
(α− 1)
We can follow the same rules in Section 5.2.1 to decide the optimal number of active pro-
cessors n∗, which also takes constant time.
For the realistic processor model, we can use the same brute force approach as in Section
5.2.1. The resulting scheme is called the SMS (Stochastic MS) scheme, which is our solution
to the problem of STREAM-MP-S-ST. In the SMS scheme, the PPACE scheme is applied
in computing dynamic energy consumption. The time complexity of the SMS scheme is
O(D
T
r2M lnλ ln(r lnλ)
²
) according to the time complexity of PPACE in Section 4.3.1.
5.2.3 Evaluation
When scheduling a streaming application represented by a single task, making use of multi-
ple processors serves two purposes. One is to satisfy the throughput requirement when the
time to process a request using maximum speed is greater than the period. In this case, we
are forced to use multiple processors. The other purpose is to reduce dynamic energy (while
increasing static energy) to minimize the total energy consumption. We are more interested
in the latter. In this section, we quantify the impact of using multiple processors in energy
reduction in our proposed schemes through experiments, as follows.
91
Power models. For the processor power model in the experiments, we used Intel XScale
[64]. The static power of the processor was varied to reflect the percentages of the static
power in total power being 22%, 44%, and 67% for the 70nm, 50nm, and 35nm technologies,
respectively [21].
Experiments on deterministic workload. We first compare the MS scheme and the
uniprocessor scheme [7] for a single task. Note that if the MS scheme turns out to use only
one processor, it is essentially equivalent to the uniprocessor scheme for a single task. The
number of execution cycles for the task ranges from 1 million to 100 million. For the period
T of the task, we chose 20 values distributed evenly between the minimum possible execution
time (using the maximum speed) and half of the maximum possible execution time (using
the minimum speed). For the deadline D of the task, we chose 20 values distributed evenly
between twice the minimum possible execution time and the maximum possible execution
time. Thus, we have 20× 20 = 400 combinations of T and D. However, we only experiment
with those combinations for which 2T ≤ D since otherwise the MS scheme will use one
processor and be equivalent to the uniprocessor scheme.
We find that the experimental results for different number of execution cycles are very
similar. This can be mostly explained by Formula (5.1) and the way we chose the value of T .
That is, the chosen values of T are all proportional to W . The optimal (fractional) numbers
of processors are the same for different number of execution cycles. Thus, it is sufficient to
show the results for the number of execution cycles being 1 million.
For 50nm and 35nm technologies, the MS scheme always uses only one processor because
of the large static power. Thus, the MS scheme provides no savings over the uniprocessor
scheme. For 70nm technology, the MS scheme could use up to 2 processors and provides up
to 23.5% of energy saving over the uniprocessor scheme when the period is small (Figure 19).
However, when the period becomes larger and so does the deadline, the MS scheme will use
only 1 processor again. The experiment results verifies the fact that the smaller the static
power, the more processors the MS scheme could end up using (also depending on the two
QoS requirements).
92
 1
 1.5
 2
 2.5
 3
 3.5
period(msec)
 2
 2.5
 3  3.5
 4
 4.5
 5
 5.5
 6  6.5
deadline(msec)
 0
 0.05
 0.1
 0.15
 0.2
 0.25
energy saving(%)
Figure 19: Energy savings for 70nm technology
Experiments on stochastic workload. We also compare the SMS scheme with unipro-
cessor scheme (PPACE) for a single task. We use the six distributions described in Section
3.6 for the task cycle distributions. The values of T and D are generated in the same way
as in the case of deterministic workload.
We find that in all experiments, the SMS scheme uses only one processor and thus there
are no savings over the uniprocessor scheme. The result is consistent with the previous
result on deterministic workload. This is because if we compare two tasks that have the
same number of worse-case execution cycles, but one has deterministic workload and the
other has stochastic workload, the dynamic energy consumption of the former is greater
than that of the latter when the deadlines are the same. Thus, the SMS will tend to use
fewer processors than the MS scheme.
93
5.3 SCHEDULING A TASK GRAPH
The previous section assumes that a streaming application is represented by a single task,
and thus there is no parallelism that we can exploit inside the application. In this section,
we consider scheduling a streaming application represented by a task graph to attack the
problems of STREAM-MP-D-TG and STREAM-MP-S-TG. The structure of a task graph
exposes the parallelism in time (indicated by predecessor and successor relationship in the
task graph) and the parallelism in space (indicated by sibling relationship in the task graph).
We apply pipelining technique to exploit the parallelism in time and use parallel processing
technique to exploit the parallelism in space, for the purpose of saving energy. We first
consider task graphs with deterministic workload and propose a scheme called Scheduling2D
to solve the problem of STREAM-MP-D-TG. We then extend the Scheduling2D scheme to
deal with task graphs with stochastic workload to attack the problem of STREAM-MP-S-
TG.
5.3.1 Scheduling for Linear Task Graphs with Deterministic Workload
In this section, we present the scheduling algorithm for a special type of task graphs, linear
task graphs. We start with scheduling for linear task graphs because it serves as the basis
for our scheduling algorithm for general task graphs.
In a linear task graph, task τ1 is the source, τn is the sink, and the only predecessor of
task τi (1 < i ≤ n) is task τi−1 (that is, tasks can be arranged to form a straight line and a
total order can be established on all tasks). Thus, only pipelining can be explored for linear
task graphs.
Although optimal scheduling of linear task graphs is NP-hard [5], it can be approximately
solved by a fully polynomial time approximation scheme. That is, the solution returned by
the scheduling algorithm is guaranteed to be within ² (² is a user-defined parameter) of the
optimal solution and the scheduling algorithm runs in time polynomial in 1
²
.
94
5.3.1.1 Y-Oriented Load To gain some insight, we simplify the problem of scheduling a
linear task graph by relaxing the application model and applying the ideal processor model.
We relax the application model by assuming that a streaming application is represented
by a single task τ , following the divisible load model [11] (Figure 20(a)). In other words,
although all W cycles of τ can only be executed consecutively, the task can be arbitrarily
partitioned into any number of load fractions that have precedence relations. We call this
Y-oriented load (Figure 20(b)) because it is depicted in the direction of Y-axis. There is also
a notion of X-oriented load, which will be described in Section 5.3.2.1. Moreover, we ignore
all communication cost.
Recall from Section 3.3.1 that the processor power function in the ideal model is
p(f) = c0 + c1f
α (5.2)
where f is the operating frequency, constant c0 represents the static power, and the term
c1f
α represents the dynamic power.
Y
X
(a) load
.
.
.
(b) Y-oriented
...
(c) X-oriented
Figure 20: Divisible load
To schedule τ , we first ignore the deadline constraint and consider the problem of energy
minimization of the load τ subject to only the throughput requirement, 1
T
. Pipelining is
a natural approach to satisfying the throughput requirement and load balancing is desired
due to the convexity of the power function in Equation (5.2). Suppose that y processors
are used to execute this load. Thus, each processor (corresponding to a pipeline stage) is
95
assigned W
y
cycles and required to execute these cycles in time T . Thus, the speed for each
processor given only throughput requirement is W/y
T
= W
yT
. In servicing a single request,
the static energy consumption is yc0T and the dynamic energy consumption is yc1
(
W
yT
)α
T .
Therefore, the total energy consumption for servicing a request is
eY (y) = yc0T +
c1W
α
yα−1Tα−1
(5.3)
which is a unimodal function and has a global minimum. This shows that starting from
a single stage, deepening the pipeline will reduce the energy consumption while satisfying
the throughput requirement 1
T
, until the number of pipeline stages increases past a certain
value y∗, which is the optimal number of pipeline stages for load τ . The optimal number of
pipeline stages strikes a balance between static and dynamic power. In fact, by obtaining
the first derivative of eY (y) and equating it to zero, we have the optimal number of pipeline
stages for executing τ , which is given by
y∗ = α
√
c1(α− 1)
c0
· W
T
(5.4)
Note that for the purpose of describing the basic idea succinctly, we allowed ourselves not
to be rigorous, that is, we can use fractional number of processors and do not consider the
boundary conditions.
Suppose that we now impose the deadline constraint D on τ . If D = T , we have to use
only 1 pipeline stage; if D = 2T , we can use 2 pipeline stages and the energy consumption
is reduced. This shows that the difference between T and D can have impact on energy
reduction. Ideally, to reduce the energy consumption of τ , D > y∗T . In this case, the
response time of τ is y∗T , which is less than the deadline constraint D. This shows that
the static power affects the upper bound of the response time when the goal is to save
energy, and sometimes the application needs to service requests faster than the response
time requirement to save energy. This is counter-intuitive because common wisdom on DVS
scheduling says that the execution of tasks should be stretched as much as possible as long
as the deadline constraint is not violated.
96
5.3.1.2 The Scheduling1D Algorithm We now revert back to the realistic processor
model to design scheduling algorithm to schedule a linear task graph. Since linear task
graphs are analogous to Y-oriented load and there is only parallelism in time, we call our
scheduling algorithm for linear task graphs Scheduling1D.
The Structure of the Optimal Solution The basic scheduling strategy for linear task
graphs is pipelining. There are three questions that need to be answered in order to schedule
a linear task graph.
1. How many pipeline stages should be used to execute this task graph? Each pipelining
stage will correspond to a processor. From the analysis in Section 5.3.1.1, we can see
that the static power consumption has a significant impact on the optimal number of
pipeline stages.
2. How to map the tasks in the task graph to processors? Obviously, only consecutive tasks
in the task graph can be mapped to the same processor (i.e., a pipeline stage). Note that
now the mapping granularity is tasks other than cycles as in Section 5.3.1.1 and that,
due to the communication energy and delay, load balancing is not necessarily desired.
3. What speed is to be used for each processor such that the delay of each stage is no more
than the period and the total delay of all stages is no more than the deadline? Note that
due to the convexity of the power function and the fact that there is no communication
cost among the tasks on the same processor (a pipeline stage), we make a simplified
assumption that all tasks on the same processor will use the same speed.
The above three questions are correlated and should not be considered separately. Thus,
the scheduling algorithm needs to perform finding the optimal number of stages, mapping,
and speed scheduling simultaneously. We first present an optimal scheduling algorithm for
linear task graphs. This optimal algorithm has worst-case exponential time complexity.
Below, we propose an approximation algorithm that is based on the optimal algorithm.
The optimal scheduling algorithm for linear task graphs is based on the recursive struc-
ture of the optimal solution. Let the optimal scheduling of the tasks τi through τn when
the end-to-end delay from task τi to task τn is t be denoted by the vector-valued function
97
Ei(t) = [e, q, j, d], where e denotes the minimum energy consumption executing the tasks
τi through τn when servicing a single request, q denotes the optimal number of stages for
the tasks τi through τn, j indicates that tasks τi, τi+1, . . . , τj are mapped to the first stage of
the q stages, and d is the time used for the first stage plus the communication delay from
the first stage to the next stage. With a slight abuse of notation, we use Ei(t).e to denote
the e component of Ei(t) (similar notations can be obtained for other values of Ei(t)). Sup-
pose that we are given the functions Ei+1(·) through En(·), we can compute Ei(t) using the
pseudo-code in Algorithm 5.1 (note that vn,n+1 = 0, that is, there is no communication after
the sink).
Algorithm 5.1 Computing Ei(t)
1: Ei(t).e :=∞
2: {n is # of tasks}
3: for j := i to n do
4: W :=
∑j
k=iWk
5: {M is # of frequencies}
6: for s := 1 to M do
7: d := W
fs
+ tp + λvj,j+1
8: if d ≤ T then
9: {e1 is the energy for the 1st stage}
10: e1 := (p(fs)− pidle)Wfs + pidleT
11: e := e1 + γvj,j+1 + Ej+1(t− d)
12: if e < Ei(t).e then
13: Ei(t).e := e
14: Ei(t).q := Ej+1(t− d).q + 1
15: Ei(t).j := j
16: Ei(t).d := d
17: end if
18: end if
19: end for
20: end for
98
The computation of the energy for the first stage at Line 8 in Algorithm 5.1 needs further
clarification. The first term (p(fs)− pidle)Wfs is the dynamic energy consumption of the first
stage for servicing a single request. The second term pidleT is the static energy consumption,
because the request lasts for the whole period T and pidle is the power always consumed in
a processor when it is on.
The optimal energy consumption and scheduling information of the whole linear task
graph will be obtained from E1(D). We can see from Algorithm 5.1 that, in order to com-
pute E1(D), we need to compute E2(·), E3(·), . . . , En(·). In general, Ei(·) depends on Ek(·)
(k = i + 1, . . . , n). Thus, the base case is En(·), which denotes the scheduling information
for a single task, τn. It is not difficult to see that the base case is a step function (piece-wise
constant function) because there are only a limited set of discrete speeds available in proces-
sors. By induction, we can show that all functions Ei(·) are step functions. A step function
can be represented by the end points of intervals in the function. Once all end points in
a step function are identified, we can obtain any value of that step function. Because of
the discrete nature and recursive structure of Ei(·), we can apply dynamic programming to
compute these functions. We compute the functions Ei(·) (i = 1, 2, . . . , n) in reverse order.
That is, we first compute En(·), then compute En−1(·), . . ., and last compute E1(·). Note
that computing function Ei(·) is to identify all its end points, rather than compute a single
value as in Algorithm 5.1.
The Optimal Scheduling Algorithm Before we present the algorithm to compute Ei(·),
we refer readers to Section 4.4.3.2 for the representation of step functions and the associated
operations. For succinct presentation, we do not show the computation of functions Ei(·).q,
Ei(·).j and Ei(·).d because they can be easily performed as a by-product of computing Ei(·).e.
We will also write Ei(·).e as Ei(·).
In Algorithm 5.1, we computed a single value of Ei(·). Now we make use of step functions
to compute the whole function Ei(·) (i.e., identify all the end points in Ei(·)). To do that, we
first consider (n− i+1)×M helper functions Eˆi,j,s (j = i, i+1, . . . , n and s = 1, 2, . . . ,M),
where M is the number of available discrete speeds. Eˆi,j,s denotes the energy function when
tasks τi, τi+1, . . . , τj are mapped to the first stage and fs is the speed used in the first stage.
99
A single value of Eˆi,j,s can be computed as
Eˆi,j,s(t) = (p(fs)− pidle)
∑j
k=iWk
fs
+ pidleT + γvj,j+1
+Ej+1(t−
∑j
k=iWk
fs
− tp − λvj,j+1)
(5.5)
Computing the whole function Eˆi,j,s can be expressed using step functions described in
Section 4.4.3.2 as Line 9 in Algorithm 5.2. The desired function Ei is obtained by merging
the (n−i+1)×M Eˆi,j,s functions. During the merging process, the optimal values of Ei(·).q,
Ei(·).j and Ei(·).d corresponding to each point are also determined. The optimal scheduling
algorithm for linear task graphs is shown at Lines 1-12 in Algorithm 5.2. Line 16 is used for
the approximation algorithm.
We now analyze the time complexity and space complexity of computing Ei. Comput-
ing the helper function Eˆi,j,s takes time O(|Ej+1|) and the number of points in Eˆi,j,s is also
O(|Ej+1|). The key operation in computing Ei is the merge operation over (n−i+1)×M step
functions, each is of size O(|Ei+1|). Thus, the time to compute Ei is O(nM |Ei+1| log2 nM)
and the number of points in Ei is O(nM |Ei+1|). Since the base case is |En+1| = 1, we can
obtain the closed forms of the time complexity and space complexity of computing Ei to
be O((nM)n−i+1 log2 nM) and O((nM)n−i+1), respectively. Since the optimal solution is in
E1(·), the time complexity and space complexity of the optimal scheduling algorithm for
linear tasks graphs are O((nM)n log2 nM) and O((nM)n), respectively.
Approximation Algorithm The time complexity of the optimal scheduling algorithm for
linear task graphs depends greatly on the size of functions Ei. As we can see from the analysis
of the optimal scheduling algorithm, the size of Ei may grow exponentially as i goes from n
to 1. Thus, we need to control the size of function Ei within some polynomial bound. To
do that, we apply similar approach as in Section 4.4.3.2. That is, we trim function Ei using
the TRIM procedure in Algorithm 4.3 with the value of the parameter δ = (1+ ²)
1
n − 1. By
using the same reasoning and derivation, we can reach that |Ei| = O(n log λ² ), where λ = Pl.ePr.e .
Thus, the number of points in Ei is upper bounded by a polynomial in
1
²
.
100
Algorithm 5.2 Scheduling1D(²)
1: En+1 := {(0, 0)}
2: for i := n downto 1 do
3: {compute Ei}
4: for j := i to n do
5: c :=
∑j
k=i ck
6: for s := 1 to M do
7: d := c
fs
+ tp + λvj,j+1
8: if d ≤ T then
9: Eˆi,j,s := (P (fs)− Pidle) cfs + PidleT + γvj,j+1 +e (d+t Ej+1)
10: else
11: Eˆi,j,s := φ
12: end if
13: end for
14: end for
15: Ei := ∪
j=i,...,n,s=1,...,M
Eˆi,j,s
16: Ei := TRIM(Ei, (1 + ²)
1
n − 1)
17: end for
101
5.3.2 Scheduling for General Task Graphs With Deterministic Workload
In this section, we present the scheduling algorithm for general task graphs. For general
task graphs, there exists not only temporal parallelism (Y-oriented load), but also spacial
parallelism (which we call X-oriented load). Thus, we first study X-oriented load by relaxing
the application model and applying the ideal processor model, as in Section 5.3.1.1, to
gain insight into the problem. Then we provide two heuristics to reduce the complexity of
scheduling general task graphs. Finally, we present the complete scheduling algorithm.
5.3.2.1 X-Oriented Load We relax the application and power models in the same way
as in Section 5.3.1.1 except for assuming that all W cycles of τ can be executed in parallel
(X-oriented load (Figure 20(c)). Now we look at the problem of energy minimization of the
load τ subject to the deadline constraint, D. Parallel processing is a natural approach to
satisfying the deadline constraint and load balancing is desired due to the convexity of the
power function. Suppose that x (x ≤ N) processors are used to execute this load. Thus, each
processor is assigned W
x
cycles and its corresponding speed is W/x
D
= W
xD
. The static energy
consumption is xc0D and the dynamic energy consumption is xc1
(
W
xD
)3
D. Therefore, the
total energy consumption is eX(x) = xc0D +
c1W 3
x2D2
, which is very similar to Equation (5.3).
The optimal number of active processors for executing τ is
x∗ = 3
√
2c1
c0
· W
D
(5.6)
Analogously to Y-oriented load, starting from uniprocessor, increasing the degree of parallel
processing can initially reduce the energy consumption by reducing the dynamic energy of
the processors while satisfying the deadline constraint D. However, the energy consumption
will start to increase after the degree of parallelism increases past a certain value due to
the static energy used by too many processors. The optimal degree of parallelism strikes a
balance between static and dynamic power.
102
5.3.2.2 Scheduling Heuristics A general task graph can be roughly regarded as a mix-
ture of X-oriented load and Y-oriented load. For X-oriented load, we use parallel processing
to reduce energy consumption while satisfying the deadline constraint; for Y-oriented load,
we use pipelining to reduce energy consumption while satisfying the throughput require-
ment. Thus, the first step of our scheduling algorithm is to identify the X-oriented load and
Y-oriented load of the task graph. To this end, our first heuristic is to use the classical topo-
logical sort to assign a level to each node in the task graph. The level of a node (task) is equal
to 1 plus the length of the longest1 path from the source to this node. By assigning a level
to each node in the task graph, we essentially morph the task graph into a two-dimensional
structure. Figures 21(a) and 21(b) show an example of task graph morphing. The tasks on
the same level represent the X-oriented load and the tasks across different levels represent
the Y-oriented load.
To match the two-dimensional structure of the morphed task graph, we conceptually
consider the processors to form a two-dimensional logical structure (Figure 21(d)). In map-
ping the task graph onto the processors, the X-dimension is used to map the X-oriented load
and Y-dimension is used to map the Y-oriented load, while considering energy consumption
in both dimensions. The logical arrangement of processors makes the underlying logical
tiled structure the same as the structure of the task graph, which will make the mapping
process computationally tractable. Our second heuristic is to let a row of processors in the
logical tiled structure correspond to a pipeline stage in the mapping process, and we only
allow contiguous levels of the morphed task graph to be mapped to the same pipeline stage.
Figure 21(c) shows a possible mapping from levels to pipeline stages, and Figure 21(d) shows
possible mapping from tasks to processors on each pipeline stage.
5.3.2.3 The Scheduling2D Algorithm Unlike previous work on energy-aware task
graph scheduling, which separated task mapping and speed scheduling, we interweave these
two closely correlated components. There are two types of mappings, each corresponding to
a dimension. The first type of mapping is called Y-mapping, which is performed along the
Y-dimension (pipelining dimension). Performing Y-mapping includes: (i) determining the
1We consider the longest path due to the precedence constraints.
103
A
B C
D
GE F H I
J
(a) A task graph
level 1
level 2
level 3
level 4
level 5
A
B C
D G
E F
H I
J
(b) Level assignment to tasks
level 1
level 2
level 3
level 4
level 5
A
B C
D G
E F
H I
J
stage 1
stage 2
stage 3
(c) Pipeline stage formation
: processor core
A
C
B
D
H
G
I
E
J
F
(d) Final task mapping
Figure 21: An example of scheduling for general task graphs
104
optimal number of pipeline stages; (ii) allotting time to each stage while guaranteeing that
the allotted time for each stage is no greater than T and that the sum of the allotted times
for all stages is no greater than D; (iii) mapping levels to pipeline stages. The second type of
mapping is called X-mapping, which is performed along the X-dimension (parallel processing
dimension) for each stage. Performing X-mapping for each stage includes: (i) determining
the optimal number of active processors for the stage; (ii) mapping tasks to active processors;
(iii) deciding execution speed for each task while guaranteeing that all tasks finish executing
and transferring data to their successors within the allotted time for the stage. Note that task
speed scheduling occurs during X-mapping. Because of the mapping along two dimensions,
we call the scheduling algorithm Scheduling2D.
We first explain X-mapping since it is less involved. X-mapping is mostly the classical
multiprocessor scheduling problem (which is NP-hard) because, if the number of active
processors for a stage is known, we can apply the classical list scheduling algorithm to
approximate the load-balancing mapping, and then apply the techniques2 in [24, 45, 5]
to obtain the execution speed for each task. To determine the optimal number of active
processors in a pipeline stage, it is straightforward to use a brute-force approach to check
every possible number of active processors and find out the number of processors resulting in
the minimum energy consumption. However, this approach is not suitable for large number of
tasks. Instead, we apply hill-climbing methods to search for the best solution using Formula
(5.6) as the starting estimation on the optimal number of active processors. This approach
has much lower time complexity than the brute-force approach and our experiments show
that the solutions obtained by this approach are very close to those obtained by the brute-
force approach.
Y-mapping is very similar to scheduling linear task graphs if we treat each level in general
task graphs as a task in linear task graphs. However, Y-mapping cannot be performed alone
because it requires knowledge from performing X-mapping. Next, we will describe the details
of Y-mapping.
Suppose that the total number of levels is L. Let the vector-valued function Ei(t) =
[e, q, j, d] (i ≤ j ≤ L and 0 < d ≤ t) denote the scheduling of the tasks on level i to level
2All these techniques take into consideration communication among tasks
105
L given an allotted time t (end-to-end delay from level i to level L). In this scheduling,
e denotes the energy consumption of the tasks on level i to level L, q denotes the number
of stages needed for the tasks on level i to level L, j indicates that level i, i + 1, . . . , j are
mapped to the first stage of the q stages, and d is the time allotted to that stage (including
the delay resulting from communication between that stage and the next stage), while level
j +1 to level L are mapped to subsequent q− 1 stages and the mapping information can be
recursively obtained from Ej+1(t− d). We can see that the definition of Ei(·) is very similar
to that in Section 5.3.1.2 if we associate levels for general task graphs with tasks for linear
task graphs. The scheduling algorithm, Scheduling2D, for general task graphs is shown in
Figure 5.3. Scheduling2D can be regarded as an extension to Scheduling1D (Figure 5.2).
The main difference between these two algorithms is the scheduling for the first stage of the
q stages.
Algorithm 5.3 Scheduling2D(²)
1: use topological sort to assign a level to each task
2: {L is the total number of levels}
3: EL+1 := {(0, 0)}
4: for i := L downto 1 do
5: {compute Ei}
6: for j := i to L do
7: compute W as the sum of cycles of the tasks on level i through level j
8: compute m as the maximum degree of parallelism on level i through level j
9: I := W
mfM
10: for k := 1 to bT
I
c do
11: d := kI
12: Eˆi,j,k :=XMAP(i, j, d) +e (d+t Ej+1)
13: end for
14: end for
15: Ei := ∪
j=i,...,L,k=1,...,bT/Ic
Eˆi,j,k
16: Ei := TRIM(Ei, (1 + ²)
1
L − 1)
17: end for
106
Algorithm 5.4 XMAP(i, j, d)
1: use formula (5.6) to estimate number of processors
2: use an algorithm from [24, 45, 5] to perform mapping and speed scheduling for the tasks
on level i to level j subject to real-time constraint d
3: use hill-climbing to search for a better solution
4: return the minimum energy consumption
For linear task graphs, because a stage corresponds to a single processor and all tasks
in the same stage have the same speed, the time allotted to the first stage only has M
possibilities, each corresponding to one of the available M discrete speeds. However, for
general task graphs, the scheduling for the first stage is a case of X-mapping in which multiple
processors may be used and different tasks may have different speeds. Enumerating every
possible number of processors and possible speed for each task would result in exponential
number of schedules for the first stage. Note that not all of such schedules are useful because
if a schedule consumes more energy and uses more time than another schedule, then the
former is useless.
Our approach is to use the alloted time for the first stage directly to decide the schedule.
For any given allotted time, X-mapping will attempt to find the schedule with minimum
energy consumption. We need to discretize the allotted time to make the scheduling for the
first stage tractable. We use a heuristic to choose the discretization interval. For a given
stage, letW be the sum of the cycles of all tasks in this stage and m be the maximum degree
of parallelism for this stage (equal to the maximum number of tasks on any level that is
mapped to this stage). We use the discretization interval of W
mfM
, which can be regarded as
the minimum possible allotted time for this stage.
5.3.3 Scheduling General Task Graphs with Stochastic Workload
In this section, we consider the problem of STREAM-MP-S-TG, that is, scheduling general
task graphs with stochastic workload. In the presence of stochastic workload, the objective
becomes minimizing expected energy consumption. Dynamic slack reclamation has been
107
shown to be indispensable in dealing with stochastic workload. For uniprocessor systems,
dynamic slack is reclaimed across tasks in the same processor. In the case of multiproces-
sor systems, dynamic slack is also reclaimed across processors. Our approach solution to
STREAM-MP-S-TG is an extension to the Scheduling2D algorithm, which we call the SS-
cheduling2D (Stochastic Scheduling2D) algorithm. Next, we describe the oﬄine part and
the online part of SScheduling2D in succession.
5.3.3.1 The Oﬄine Part of SScheduling2D As in the case of uniprocessor systems,
one is tempted to extend the oﬄine part of Scheduling2D by incorporating dynamic slack
reclamation and comparing expected energy among different mappings and speed schedules.
More specifically, the function XMAP in the Scheduling2D algorithm would compute the
expected energy consumption of a particular stage and the function Ei would denote the
optimal expected energy consumption of the tasks from level i to level L. An important
assumption of Scheduling2D is that Ei is agnostic of tasks on levels before level i and its
value is only dependent on the time allotted to level i to level L. Unfortunately, dynamic
slack reclamation across processors makes this assumption invalid. This is best demonstrated
by the following example.
Suppose that we have a linear task graph that consists of only two tasks, τ1 and τ2. The
communication cost is ignored. The requests come in every 2 time units (i.e., T = 2) and
the response time requirement is 4 time units (i.e., D = 4). Two stages (i.e., two processors)
are used to execute this task graph. Task τ1 is executed on Processor 1 and could execute
for 1 and 2 time units. Task τ2 is executed on Processor 2 and has a constant workload. A
possible execution scenario (Figure 22) is as follows.
1. At time 0, the first request arrives. Task τ1 starts to execute on Processor 1.
2. At time 1, Task τ1 finishes and produces 1 time unit of slack, which is reclaimed by
Processor 2 for Task τ2. Task τ2 starts to execute on Processor 2 using a speed that
makes it finish in 3 time units.
3. At time 2, the second request arrives, Task τ1 starts to execute on Processor 1.
4. At time 3, Task τ1 finishes and again produces 1 time unit of slack. However, Task τ2 is
still processing the first request and Processor 2 cannot reclaim this slack.
108
5. At time 4, Task τ2 finishes processing the first request and starts to process the second
request in 2 time units.
   
   
1st
request
       
1st request
2nd
request
2nd request
Time 0 1 2 3 4 5 6
Processor 1
Processor 2
1st
request
2nd
request
Figure 22: An execution scenario
In the above scenario, it takes the same amount of time (1 time unit) for Task τ1 to
process both requests, leaving the same amount of time (3 time units) for Task τ2 to process
both requests. However, Task τ2 uses different energy for these two requests because it uses
3 time units to process the first request and only 2 time units to process the second.
Another difficulty of the aforementioned extension is computing the expected energy
consumption in function XMAP. Because XMAP is in the inner loop of Scheduling2D, it
requires low time complexity. However, other than enumerating all possible combination of
speeds for all the tasks in a particular stage, we do not know of any good and low-complexity
algorithm to accomplish this task.
Based on the above analysis, we can see that it is difficult and not clear how to incorporate
dynamic slack reclamation into the oﬄine part of Scheduling2D to compute the expected
energy consumption. Thus, we choose to keep the oﬄine part of Scheduling2D intact in
SScheduling2D except for two simple extensions.
1. We use average number of execution cycles instead of worst number of execution cycles
in the mapping phase of XMAP. The idea is to do the load balancing in a stage based
on the average load of tasks in the hope to minimize the average (expected) energy
consumption.
2. We store the start time (relative to the arrival time of a request) for each task for the
online part of SScheduling2D to compute dynamic slack. Note that the start time of
109
each task is also the latest start time because we still use worst number of execute cycles
to do the speed scheduling.
These two simple extensions do not add any time complexity and we can still guarantee that
the application will satisfy the throughput and response time requirement in the worst case.
5.3.3.2 The Online Part of SScheduling2D In the online part of SScheduling2D, we
employ dynamic slack reclamation and distribute the slack in a greedy fashion, that is, we
give all the slack to the tasks available to execute. More specifically, when a task is ready to
execute, we take the difference between the current time and the latest start time computed
in the oﬄine part as the dynamic slack given to this task. We apply the best intra-task DVS
scheme, PPACE, to execute a task.
5.3.4 Experimental Results
In this section, we evaluate our proposed scheduling algorithms, Scheduling2D and SSchedul-
ing2D, through simulations. We first compare Scheduling2D against the previously existing
algorithms for the problems similar to STREAM-MP-D-TG since no other work has proposed
a solution to STREAM-MP-D-TG. We then compare SScheduling2D against Scheduling2D
under stochastic workload. Multiple task graphs and different power models were employed
in the evaluation, as follows.
Power models: For the processor model in the experiments, we used Intel XScale [64]. As
in the evaluation of the MS and SMS schemes, the static power of the processor was varied
to reflect the percentages of the static power in total power being 22%, 44%, and 67% for
the 70nm, 50nm, and 35nm technologies, respectively [21]. Different values of static power
result in different processor power model. For communication cost, we used a transmission
rate of 20 Gbytes/s and the transmission power is set to 20% of maximum processor power
when the communication link is fully utilized [59].
Evaluation of Scheduling2D Both synthetic and real-world task graphs were used in
110
the experiments. The synthetic task graphs are from TGFF and the real-world task graph
is ATR, as described in Section 3.6. The X-mapping in our Scheduling2D algorithm is
based on the S-SPM algorithm [45] because of its low time complexity and reasonably good
performance. Also, we set the parameter ² to 0.05. We chose the latest work [5] on a
subproblem of STREAM-MP-D-TG, which assumes that the period is equal to the deadline,
to be the baseline against which Scheduling2D compared. A convex programming based
approach3 was used in [5] to obtain the execution speed for each task given the task mapping.
Since it does not consider turning processor on/off, we enhanced it by trying all possible
number of processors to find the minimum energy consumption. We used the classical
earliest task first (ETF) list scheduling heuristic [70] to perform the task mapping.
We compare the two algorithms, baseline as described above and Scheduling2D, for dif-
ferent power models, different deadline constraints, and different throughput requirements.
The values of period T and deadline D are generated similarly as in the case of evaluating
the MS and SMS schemes. For the period T of synthetic task graphs, we chose 20 values
distributed evenly between the shortest possible execution time (time to execute the critical
path of the task graph using the maximum speed) and half of the time to execute the critical
path of the task graph using the minimum speed. For the deadline D of the task graphs,
we chose 20 values distributed evenly between twice the minimum possible execution time
and the maximum possible execution time. Thus, we have 20 × 20 = 400 combinations of
T and D. However, we only experimented with those combinations for which 2T ≤ D since
otherwise Scheduling2D is the same as the baseline. For ATR, the QoS requirements in its
documentation are 2-33ms for the period and 2-1000ms for the deadline. In our experiments,
we used values guided by this range and the processor model used: the period range is ap-
proximately 3ms-9ms and the deadline goes up to 18ms. Note that the baseline algorithm
takes T as its deadline constraint because T < D.
3In some of the experiments, the convex program solver (called ipopt) that we used could not produce a
solution due to its iteration limit. When that happened, we simply used S-SPM [45] in its place.
111
T
ab
le
7:
E
n
er
gy
sa
v
in
gs
(%
)
of
S
ch
ed
u
li
n
g2
D
ov
er
b
as
el
in
e
T
as
k
gr
ap
h
70
n
m
50
n
m
35
n
m
#
C
P
U
s
#
st
ag
es
av
g.
m
ax
#
C
P
U
s
#
st
ag
es
av
g.
m
ax
#
C
P
U
s
#
st
ag
es
av
g.
m
ax
k
se
ri
es
p
ar
al
le
l
2
-
8
1
-
5
18
.4
4
46
.6
2
1
-
5
1
-
3
9.
05
43
.4
7
1
-
5
1
-
3
6.
42
49
.7
4
cr
ed
s1
1
-
4
1
-
4
23
.1
2
53
.2
3
1
-
3
1
-
3
14
.0
6
55
.1
6
1
-
3
1
-
3
12
.9
9
60
.1
6
si
m
p
le
1
-
5
1
-
4
16
.2
6
41
.3
4
1
-
3
1
-
3
6.
93
32
.4
7
1
-
3
1
-
3
4.
62
32
.9
k
b
as
ic
ta
b
le
s
1
-
6
1
-
4
17
.8
9
37
.8
6
1
-
4
1
-
3
10
.4
2
26
.1
9
1
-
3
1
-
3
7.
56
25
.6
1
k
se
ri
es
p
ar
al
le
l
x
ov
er
2
-
6
1
-
4
18
.5
46
.5
4
1
-
4
1
-
3
9.
38
39
.9
4
1
-
3
1
-
3
7.
39
39
.9
7
b
u
gt
es
t
3
-
11
1
-
4
18
.8
7
38
.2
2
2
-
8
1
-
3
9.
1
23
.3
6
2
-
6
1
-
3
6.
74
27
.7
9
k
b
as
ic
ta
sk
2
-
10
1
-
5
20
.2
9
43
.2
2
1
-
7
1
-
5
11
.3
6
37
.2
5
1
-
6
1
-
4
9.
37
38
.6
2
ke
x
te
n
d
ed
1
-
10
1
-
5
17
.7
4
35
.3
2
1
-
6
1
-
3
8.
91
27
.4
4
1
-
5
1
-
3
6.
86
29
.3
p
ac
ke
ts
1
-
4
1
-
2
18
.6
40
.3
8
1
-
4
1
-
2
8.
55
30
.6
5
1
-
3
1
-
2
6.
1
30
.9
8
A
T
R
2
-
7
1
-
3
14
.9
8
35
.6
6
1
-
4
1
-
2
6.
28
17
.9
5
1
-
3
1
-
3
3.
42
9.
7
112
Table 7 shows the energy savings of Scheduling2D over the baseline for all experiments.
Scheduling2D achieves up to 53.2%, 55.1%, 60% savings for 70nm, 50nm, 35nm technologies,
respectively. Even for average, Scheduling2D saves 18.5%, 9.4%, 7.1% for 70nm, 50nm, 35nm
technologies, respectively. In general, it can be observed that as the static power increases,
the energy saving obtained by Scheduling2D decreases. This is because high static power will
force both algorithms to use fewer number of processors (for Scheduling2D, this translates
to fewer number of pipeline stages), and thus the room for optimization is reduced.
 8
 10
 12
 14
 16
 18
 20
 22
 24
 26
 28
period(msec)
 15
 20  25
 30  35
 40  45
 50  55
 60
deadline(msec)
 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 50
energy saving(%)
Figure 23: Energy savings for 70nm technology on k series parallel xover
We show the effect of throughput and response time requirements on the energy savings
of Scheduling2D over the baseline through task graph k series parallel xover from TGFF
(the results for other task graphs are similar). From Figure 23, we can see that as the period
increases, the energy saving tends to decrease (however, we can spot some increase during
the course due to the nature of discrete speeds). Increasing period means smaller workload
per time unit. Thus, both algorithms tend to use lower speed and there is less room for
optimization. When the period increases to the point where pipelining is not needed (i.e.,
Scheduling2D will only use one stage), Scheduling2D will essentially act as the baseline and
the energy saving is reduced to zero. We can also see that for a given period, increasing
113
deadline will initially result in increased energy savings. This is because increasing deadline
will allow Scheduling2D to use more pipeline stages to lower the energy consumption of the
streaming application. However, as stated in Section 5.3.1.1, static power affects the upper
bound of the response time and thus the upper bound of the number of pipeline stages.
Therefore, we can observe from Figure 23 that continuing increasing deadline after certain
point will not increase the energy saving since the number of pipeline stages will stay un-
changed.
Evaluation of SScheduling2D We evaluated SScheduling2D through comparison against
Scheduling2D on stochastic workload. Note that Scheduling2D is designed for deterministic
workload. Thus, when it is applied to stochastic workload, the speed of each task is based
on the assumption that each task runs for the worst-case number of execution cycles. For
the stochastic workload used in the experiments, we only used synthetic task graphs in the
experiments since the real-world task graph ATR has only deterministic workload. The task
cycle distribution is the uniform distribution described in Section 3.5.
Table 8 shows the energy savings of SScheduling2D over Scheduling2D for all exper-
iments. SScheduling2D achieves up to 36.67%, 37.64%, 19.17% savings for 70nm, 50nm,
35nm technologies, respectively. On average, SScheduling2D saves 9.03%, 12.38%, 6.75% for
70nm, 50nm, 35nm technologies, respectively. An interesting result is that energy savings do
not necessarily decrease as the static power increases. This is because these two algorithms
use the same number of processors in most test cases and thus result in the same static
energy consumption. The main source of the energy savings of SScheduling2D is dynamic
energy consumption.
114
T
ab
le
8:
E
n
er
gy
sa
v
in
gs
(%
)
of
S
S
ch
ed
u
li
n
g2
D
ov
er
S
ch
ed
u
li
n
g2
D
T
as
k
gr
ap
h
70
n
m
50
n
m
35
n
m
#
C
P
U
s
#
st
ag
es
av
g.
m
ax
#
C
P
U
s
#
st
ag
es
av
g.
m
ax
#
C
P
U
s
#
st
ag
es
av
g.
m
ax
k
se
ri
es
p
ar
al
le
l
2
-
8
1
-
4
9.
3
24
.4
7
1
-
5
1
-
4
12
.1
5
30
.2
1
-
4
1
-
3
6.
97
15
.4
1
cr
ed
s1
1
-
6
1
-
4
8.
95
36
.6
7
1
-
3
1
-
3
12
.3
4
37
.6
4
1
-
3
1
-
3
5.
74
17
.8
4
si
m
p
le
1
-
5
1
-
4
8.
19
36
.2
3
1
-
3
1
-
3
11
.4
3
30
.6
3
1
-
3
1
-
3
5.
86
19
.1
7
k
b
as
ic
ta
b
le
s
2
-
6
1
-
4
9.
26
27
.7
1
1
-
4
1
-
3
13
.1
1
27
.9
8
1
-
3
1
-
3
6.
86
13
.8
1
k
se
ri
es
p
ar
al
le
l
x
ov
er
2
-
6
1
-
4
8.
65
24
.2
9
1
-
4
1
-
3
12
.9
9
28
.1
9
1
-
3
1
-
3
6.
76
15
.8
b
u
gt
es
t
4
-
11
2
-
4
9.
95
16
.9
5
2
-
9
1
-
4
13
.1
9
23
.1
8
2
-
6
1
-
3
8.
78
14
.9
4
k
b
as
ic
ta
sk
2
-
12
1
-
5
8.
95
34
.2
9
1
-
7
1
-
4
12
.5
7
29
.2
8
1
-
6
1
-
4
7.
06
16
.7
5
ke
x
te
n
d
ed
2
-
9
1
-
4
8.
32
24
.4
4
1
-
6
1
-
3
12
.0
2
26
.7
7
1
-
4
1
-
3
6.
94
18
.3
4
p
ac
ke
ts
1
-
6
1
-
2
9.
75
33
.8
3
1
-
4
1
-
2
11
.6
7
34
.1
3
1
-
4
1
-
2
5.
8
17
.5
8
115
5.4 SUMMARY
In this chapter, we investigated energy-aware multiprocessor scheduling problems for stream-
ing applications. In multiprocessor scheduling, we need to consider two issues that do not
exist in uniprocessor scheduling. The first issue is how to find a balance between static and
dynamic energy consumption because of the availability of multiple processors. The second
issue is how to exploit the difference between the two QoS requirements, namely, throughput
and response time. If we collapsed the two QoS requirements into one, as in the case for
uniprocessor scheduling, we would lose the opportunity for energy optimization. In fact,
addressing these two issues is the key to energy-aware multiprocessor scheduling problems.
We started out by investigating scheduling a streaming application represented by a
single task. Although there is no parallelism that can be exploited inside the application, we
proposed a master-slave scheme that executes different instances of the streaming application
on different processors to satisfy the two QoS requirements while attempting to minimizing
the total energy consumption. The key to the master-slave scheme is how to find the optimal
number of active processors to execute the streaming application. We derived the formula
and algorithm under different workloads and different processor models.
We then turned to scheduling a streaming application represented by a task graph. For
deterministic workload, we proposed an algorithm called Scheduling2D that exploits the
difference of the two QoS requirements to perform processor allocation, task mapping, and
task speed scheduling simultaneously. Scheduling2D uses parallel processing and pipelining
in the task mapping, as traditional algorithms for maximizing throughput or minimizing
latency do. However, Scheduling2D focuses on finding the appropriate number of processors
and allotting the optimal amount of time to each pipeline stage and each task in order to
save energy. For stochastic workload, we extended Scheduling2D and proposed an algorithm
called SScheduling2D that makes use of dynamic slack reclamation technique to minimize
the expected energy consumption.
116
6.0 CONCLUSIONS
While traditionally performance has been the major concern for streaming applications, we
are now witnessing a focus shift from pursuing maximum performance to energy-performance
trade-off. This is because for battery-powered systems, increased energy consumption short-
ens operation time, and for high-end servers, increased energy consumption generates ex-
cessive amount of heat and reduces system reliability. Since streaming applications usually
operate for long periods of time, energy optimization is especially important. Even a small
improvement in scheduling algorithms can translate into significant energy savings.
This dissertation addresses the problem of scheduling a stream application with the goal
of minimizing its energy consumption while satisfying two typical QoS requirements, namely,
throughput and response time. An important feature of this dissertation is a complete treat-
ment of the problem by taking into account different underlying platforms, different charac-
teristics of workload, and different types of task graphs. Furthermore, an ideal and a realistic
processor power models are considered. Although the ultimate goal is to obtain scheduling
algorithms under the realistic model, considering the ideal model has proven to be very help-
ful (e.g., in the derivation of the PITDVS2 scheme). The easy mathematical manipulation
of the ideal model often leads to elegant and optimal scheduling algorithms, which give great
insight into the problem and provide the basis for designing practical scheduling algorithms.
In energy-aware scheduling of streaming applications on uniprocessor systems, the two
QoS requirements essentially collapse into one and the problem is very closely related to
energy-aware scheduling for frame-based hard real-time systems. One of the contributions
of this dissertation to the state of the art in energy aware real-time scheduling theory is to
show that simple patches to optimal scheduling algorithms obtained under simple models
(e.g., the ideal model) do not necessarily generate scheduling algorithms that perform well
117
in practice. This is demonstrated multiple times throughout Chapter 4.
1. For frame-based systems with a single task, previously existing DVS schemes, such as
PACE and GRACE, patch the solution obtained under the ideal model to comply with
the realistic model. I propose a new DVS scheme called PPACE that is based directly on
the realistic model. PPACE can give performance guarantees and achieve energy savings
very close to the optimal solution. Experimental results show that PPACE outperform
the existing schemes significantly.
2. For frame-based systems with multiple tasks, the PGOPDVS scheme, which is based
on the optimal hybrid DVS scheme under the ideal model, achieves worse performance
than all other DVS schemes under consideration in most of the experiments. Another
example is the PITDVS scheme, which is based on the optimal inter-task DVS scheme
under the ideal model. PITDVS is not necessarily better the DVS schemes that do
not use probabilistic information of the workload (e.g., the Greedy scheme). However,
the PITDVS2 scheme, which is based on PITDVS and uses two speeds to emulate a
continuous speed, is shown to outperform the existing DVS schemes in the experiments.
The moral story is that any DVS scheme obtained through patching the solution under the
ideal model needs to be justified by comparing against the optimal schemes, which can be
obtained using our proposed unified approach. The unified approach is the culmination of the
investigation on uniprocessor scheduling. It derives optimal (or provably close to optimal)
stochastic inter-task, intra-task, and hybrid DVS schemes under the realistic model.
In energy-aware scheduling of streaming applications on multiprocessor systems, the
two QoS requirements need to be distinguished in order to be satisfied as well as to save
more energy. For scheduling a singleton task graph on multiprocessor systems, I propose a
master-slave scheme that executes different instances of the streaming application on different
processors. The purpose of using multiple processors in the master-slave scheme is twofold.
One is to satisfy the throughput requirement and the other is to reduce energy consumption.
Our experimental results show that beyond 50 nm technology, using multiple processors will
not save energy comparing using a single processor because of the high static power. This
implies that a streaming application needs to expose its parallelism to scheduling algorithms
118
to exploit the opportunity to saving energy, which leads to our investigation on scheduling
a task graph on multiprocessor systems.
To the best of my knowledge, this dissertation is the first work to consider both through-
put and response time constraints in energy-aware scheduling of task graphs, while there
has been much research considering only response time constraint prior to this work. My
main contribution is a novel scheduling algorithm called Scheduling2D that exploits the dif-
ference of the two QoS requirements to perform processor allocation, task mapping, and
task speed scheduling simultaneously. The design of Scheduling2D emphasizes the use of
multiple processors as an energy reduction technique because of the fundamental trade-off
between dynamic and static energy consumption. The derivation of Scheduling2D shows
that the static power of processors has an important impact on the scheduling of streaming
applications. Specifically, the static power imposes an upper bound on the response time
of the streaming application to be scheduled and high static power could lead to servicing
requests faster than the response time requirement in order to save energy. This is contrary
to the common DVS wisdom of slowing down task execution as much as possible for just-
in-time completion. This important insight is also confirmed by the experimental results on
Scheduling2D.
119
7.0 FUTURE WORK
Energy-aware scheduling for streaming applications is an exciting research area and the
problems we considered in this dissertation can be extended in various directions. Next, I
elaborate on the research avenues that I consider promising for future work in this area.
I obtained scheduling algorithms under the realistic processor model for all the problems
considered. I believe that the power portion in the realistic model is very realistic in the
sense that each task has individual power consumption for each discrete frequency. However,
for the timing portion in the realistic model, I made an assumption that the data processing
of a task is performed on fast local memory. Thus, each task is CPU-bound and its execution
time is inversely proportional to its operating frequency. This greatly simplifies computing
the execution time of a task in deriving scheduling algorithms. It is worthwhile to do research
on the case where a task is I/O-bound (for example, the task needs to reference a lot of data
residing in disks). To do that, one needs to strengthen the timing modeling in the realistic
model and revisit all the problems.
Energy-aware scheduling of streaming applications on uniprocessor systems is relatively
well understood. However, there is much room left for research in scheduling on multipro-
cessor systems. There are a number of directions for future work.
1. Morphing task graphs into a two-dimensional structure is very important for the Schedul-
ing2D algorithm. Currently, a simple topological sort is employed for this step. Better
heuristics are expected to be obtained by considering the computational requirement
(i.e., cycle count) of each task.
2. For stochastic workload, I have shown in Section 5.3.3 the difficulty of incorporating
dynamic slack reclamation into the oﬄine part of Scheduling2D to compute the expected
120
energy consumption. The resulting solution for stochastic workload, SScheduling2D, is
only a simple extension to Scheduling2D. Although the SScheduling2D algorithm can still
guarantee the QoS requirements, it is not clear how far the resulting energy consumption
is from the optimal solution. Whether or not one can directly optimize the expected
energy consumption in the oﬄine part of the algorithm needs further research.
3. Because this dissertation focuses on energy reduction, I only consider memory power
consumption and implicitly assume unlimited memory size, which is not true especially
for on-chip local memory or embedded systems that have stringent memory constraint.
Adding memory size as an additional constraint to the scheduling problems has significant
interest. I envision that the X-mapping of the Scheduling2D algorithm is the very place
to deal with memory constraint.
4. All multiprocessor scheduling algorithms in this dissertation assume unlimited number
of processors in the system based on the trend that more and more processor cores
are available on chip multiprocessors. However, if multiple streaming applications are
executed in the same system, we may be constrained by the number of processors. Given
the already complex nature of multiprocessor scheduling problems, dealing with this
constraint is expected to be a challenging problem.
Finally, the problems considered in this dissertation can be expanded by taking into
account Dynamic Power Management (DPM) that attempts to put idle system components
into low-power states (e..g, turning off) whenever possible. This dissertation only considers
Static Power Management, which means once a scheduling algorithm decides that a processor
should be on, it will never be turned off when executing streaming applications. Combining
DVS and DPM has been shown to achieve further energy savings than DVS alone [73, 17].
How to incorporating DPM into the scheduling algorithms in this dissertation (especially for
multiprocessor systems) is an interesting problem for future research.
121
APPENDIX A
AN ILLUSTRATIVE EXAMPLE OF SPEED ROUNDING EFFECT
In Section 4.3.1, we described how the GRACE and PACE schemes round the continuous
speed obtained from the ideal processor model to comply with the realistic model. In this
appendix, we demonstrate the speed rounding can have a significant impact on the quality
of the solution through an illustrative example. In this example, there is a single task τ that
has 3 cycles and its deadline is 1.84 time units. The processor has 3 discrete frequencies,
that is, 1 Hz, 2 Hz, and 3Hz. The processor power model is p(f) = f 3 and speed change
overhead is zero.
Suppose that the probability function of the execution cycles is P (1) = 0.83, P (2) =
0.05, P (3) = 0.12. We can compute the cumulative function and obtain cdf(1) = 0.83, cdf(2) =
0.88, cdf(3) = 1. The expected energy consumption, according to Equation (4.1), is
s21 + 0.17× s22 + 0.12× s23
and the optimal continuous speed schedule, according to Equation (4.3), is s1 = 1.1126, s2 =
2.0084, s3 = 2.2557. As described in Section 4.3.1, GRACE rounds a continuous speed
up to the closest higher discrete frequency, while PACE rounds a continuous speed up or
down to the closest discrete frequency. Thus, GRACE will use the speed schedule s1 =
2, s2 = 3, s3 = 3 to execute the task and result in the expected energy consumption of
6.61. However, using the speed scheduling s1 = 1, s2 = 2, s3 = 3 will give us the optimal
expected energy consumption, which is 2.76. Therefore, GRACE will have the relative error
(defined in Section 4.3.4.1) of 6.61−2.76
2.76
= 139%. For PACE, the original speed schedule is
122
s1 = 1, s2 = 2, s3 = 2, which will make the task miss the deadline since in the worst case the
task will take time 1
1
+ 1
2
+ 1
2
= 2, which is greater than 1.84. PACE will perform a linear
scan and adjust the speed schedule. The new speed schedule is s1 = 1, s2 = 2, s3 = 3, which
actually is the optimal speed schedule. Thus, the relative error of PACE is 0 in this case.
Suppose that the probability function of the execution cycles is changed to P (1) =
0.96, P (2) = 0.02, P (3) = 0.02. We compute the cumulative function and obtain cdf(1) =
0.96, cdf(2) = 0.98, cdf(3) = 1. Thus, the expected energy consumption according to Equa-
tion (4.1) is
s21 + 0.04× s22 + 0.02× s23
and the optimal continuous speed schedule, according to Equation (4.3), is s1 = 0.8768, s2 =
2.5639, s3 = 3.2304. For PACE, the speed schedule is s1 = 1, s2 = 3, s3 = 3 and the
resulting expected energy consumption is 1.54. However, the optimal speed schedule is still
s1 = 1, s2 = 2, s3 = 3 and the optimal expected energy consumption is 1.34. Therefore,
PACE will have the relative error of 15%. For GRACE, the speed schedule is the same as
that of PACE and thus GRACE also has the relative error of 15%.
123
APPENDIX B
AN ILLUSTRATIVE EXAMPLE OF DVS SCHEMES
In this appendix, we demonstrate and compare several DVS schemes for general frame-based
systems under the ideal processor model through an illustrative example. In this example,
there are 3 tasks in a frame-based real-time system with a frame length of 14 time units.
The parameters for the 3 tasks are shown in Table 9. The tasks are required to be executed
in the order of τ1, τ2, and τ3. We can also treat the 3 tasks as three sequential sections of
a single task τˆ and its parameters are computed from those of the 3 tasks. τˆ is used for a
naive extension of PACE shown at the end of this appendix. The processor power model is
p(f) = f 3.
We first look at a simple scheme called the Proportional scheme [48], which distributes the
system slack proportionally among all unexecuted tasks. In this example, the Proportional
scheme will start executing τ1 using speed
2+4+2
14
= 0.5714. After τ1 finishes, the system
reclaims the slack created by τ1 if it runs for less than its WCEC, and computes the speed
of the next task recursively. Suppose that τ1 only runs for 1 cycle. Then the time left for
executing τ2 and τ3 is 14− 10.5714 = 12.2499, and speed 4+212.2499 = 0.4898 will be used to execute
τ2, and so forth.
For the OITDVS scheme, the time allocation fractions of τ1, τ2, τ3 are β1 = 0.3938, β2 =
0.7619, and β3 = 1.0, respectively (refer to Algorithm 4.1 for how to compute the β values).
Thus, OITDVS will use speed 2
0.3938×14 = 0.3628 to execute τ1. If τ1 runs for 1 cycle, then
the time left for executing τ2 and τ3 is 14− 10.3628 = 11.2737. Then OITDVS will use speed
4
0.7619×11.2737 = 0.4657 to execute τ2.
124
Table 9: The parameters for the 3 tasks in the illustrative example
Task W P (1), P (2), · · · , P (W )
τ1 2 .9, .1
τ2 4 .9, 0, 0, .1
τ3 2 .5, .5
τˆ 8 0, 0, .405, .45, .045, .045, 0.05, .005
For the GOPDVS scheme, the time allocation fractions of τ1, τ2, τ3 are β11 = 0.2147, β12 =
0.2207, β21 = 0.2832, β22 = 0.2086, β23 = 0.2636, β24 = 0.3579, β31 = 0.5575, and β32 = 1.0,
respectively (refer to Algorithm 4.2 for how to compute the β values). Thus, GOPDVS
will use speed 1
0.2147×14 = 0.3327 to execute the 1
st cycle of τ1. If τ1 has the 2
nd cycle, the
GOPDVS scheme will use speed 1
0.2207×(14−14×0.2147) = 0.4121 to execute. Table 10 shows the
expected energy consumption per frame for the DVS schemes.
Table 10: The comparison of the DVS schemes for the illustrative example
Scheme Expected energy consumption per frame
naive PACE 0.7953
Proportional 0.7733
OITDVS 0.6097
GOPDVS 0.5154
Finally, we show through this simple example that a naive extension of PACE (or, naive
PACE for short) cannot even obtain energy savings over the DVS schemes that do not use
intra-task DVS. Since PACE has only been studied for a single task, naive PACE is applied
to the supertask ˆτT in Table 9. From Table 10, we can see that naive PACE [40] will result in
expected energy consumption per frame of 0.7953, which is even worse than the Proportional
scheme.
125
APPENDIX C
PROOF OF LEMMA 2
In this appendix, we provide the proof of Lemma 2, which states that for every energy-time
label l′ ∈ LABEL′(i, j) where 1 ≤ j ≤ M , there exists a label l ∈ LABEL(i, j) such that
l′.e ≤ l.e ≤ (1 + δ)il′.e and l′.t ≥ l.t.
Proof. We prove by induction on i. The base case (i = 0) is trivially true.
In the inductive step, we assume that the claim holds for i. Now the goal is to find an
energy-time label ζ ∈ LABEL(i + 1, j) (1 ≤ j ≤ M) for every energy-time label l′i+1 ∈
LABEL′(i+ 1, j) such that
l′i+1.e ≤ ζ.e ≤ (1 + δ)i+1l′i+1.e
and
l′i+1.t ≥ ζ.t
Let l′i ∈ LABEL′(i, k) (1 ≤ k ≤ M , and k is not necessarily equal to j) be the energy-
time label that generates l′i+1. Then we have
l′i+1 = (l
′
i.e+ PC(i)PE(fk, fj) + Fie(fj), l
′
i.t+ PT (fk, fj) +
wi
fj
)
By the induction hypothesis, there is an energy-time label li ∈ LABEL(i, k) such that
l′i.e ≤ li.e ≤ (1 + δ)il′i.e
126
and
l′i.t ≥ li.t
Let li+1 = (li.e+ PC(i)PE(fk, fj) + Fie(fj), li.t+ PT (fk, fj) +
wi
fj
). Thus, we have
li+1.e < (1 + δ)
il′i.e+ (1 + δ)
i(PC(i)PE(fk, fj) + Fie(fj)) = (1 + δ)
il′i+1.e
There are two possibilities regarding li+1:
1. li+1 ∈ LABEL(i+ 1, j). Then we have
l′i+1.e = l
′
i.e+ PC(i)PE(fk, fj) + Fie(fj) ≤ li.e+ PC(i)PE(fk, fj) + Fie(fj) = li+1.e
and
li+1.e = li.e+ PC(i)PE(fk, fj) + Fie(fj)
≤ (1 + δ)il′i.e+ PC(i)PE(fk, fj) + Fie(fj)
≤ (1 + δ)i(l′i.e+ PC(i)PE(fk, fj) + Fie(fj))
= (1 + δ)il′i+1.e
< (1 + δ)i+1l′i+1.e
and
l′i+1.t = l
′
i.t+ PT (fk, fj) +
wi
fj
≥ li.t+ PT (fk, fj) + wi
fj
= li+1.t
Therefore we let ζ = li+1.
2. li+1 6∈ LABEL(i + 1, j) and it was removed as a result of trimming. Thus there is an
energy-time label lˆ ∈ LABEL(i+ 1) such that
li+1.e ≤ lˆ.e ≤ (1 + δ)li+1.e
and
li+1.t > lˆ.t
Therefore we have
l′i+1.e = l
′
i.e+ PC(i)PE(fk, fj) + Fie(fj) ≤ li.e+ PC(i)PE(fk, fj) + Fie(fj) = li+1.e ≤ lˆ.e
127
and
lˆ.e ≤ (1 + δ)li+1.e < (1 + δ)(1 + δ)il′i+1.e ≤ (1 + δ)i+1l′i+1.e
and
l′i+1.t ≥ li+1.t > lˆ.t
Therefore we let ζ = lˆ.
Summarizing the above two possibilities will prove the claim.
128
BIBLIOGRAPHY
[1] Cell broadband engine architecture documentation, 2005.
[2] Intel developer forum, 2006. http://www.intel.com/pressroom/kits/events/idffall 2006
/pdf/idf 09-26-06 paul otellini keynote transcript.pdf.
[3] N. AbouGhazaleh, D. Mosse´, B. Childers, R. Melhem, and Matthew Craven. Collab-
orative operating system and compiler power management for real-time applications.
In Proc. IEEE Real-Time Embedded Technology and Applications Symposium (RTAS),
May 2003.
[4] B. Agarwalla, N. Ahmed, D. Hilley, and U. Ramachandran. Streamline: A scheduling
heuristic for streaming applications on the grid, 2005.
[5] A. Andrei, M. Schmitz, P. Eles, Z. Peng, and B. Al-Hashimi. Simultaneous communi-
cation and processor voltage scaling for dynamic and leakge energy reduction in time-
constrained systems. In Proc. IEEE International Conference on Computer-Aided De-
sign (ICCAD), San Jose, CA, 2004.
[6] T. Austin, E. Larson, and D. Ernst. Simplescalar: An infrastructure for computer
system modeling. IEEE Computer, 35(2):59–67, 2002.
[7] H. Aydin. Enhancing Performance and Fault Tolerance in Reward-Based Scheduling.
PhD thesis, University of Pittsburgh, 2001.
[8] H. Aydin, R. Melhem, D. Mosse´, and P. Mejia-Alvarez. Dynamic and Aggressive
Scheduling Techniques for Power-Aware Real-Time Systems. In Proc. IEEE Real-Time
Systems Symposium (RTSS), pages 95–105, December 2001.
[9] S. Baruah and J. Anderson. Energy-efficient synthesis of periodic task systems upon
indentical multiprocessor platforms. In Proc. IEEE International Conference on Dis-
tributed Computing Systems (ICDCS), pages 428–435, Tokyo, Japan, 2004.
[10] Anne Benoit, Harald Kosch, Veronika Rehn-Sonigo, and Yves Robert. Bi-criteria
pipeline mappings for parallel image processing. In Proc. 8th International Conference
on Computational Science (ICCS), LNCS. Springer Verlag, 2008.
129
[11] V. Bharadwaj, D. Ghose, and T.G.Robertazzi. Divisible load theory: A new paradigm
for load scheduling in distributed systems. Cluster Computing, 6:7–18, 2003.
[12] Shahid H. Bokhari. Partitioning problems in parallel, pipelined and distributed com-
puting. IEEE Transactions on Computers, 1987.
[13] Shekhar Borkar. Design challenges of technology scaling. IEEE Micro, 19(4), 1999.
[14] D. Brooks, P. Bose, S. Schuster, H. Jacobson, P. Kudva, A. Buyuktosunoglu, J. Well-
man, V. Zyuban, M. Gupta, and P. Cook. Power-aware Microarchitecture: Design and
Modeling Challenges for Next Generation Microprocessors. IEEE Micro, 20(6), 2000.
[15] T. Burd and R. Brodersen. Design issues for Dynamic Voltage Scaling. In Proc. Inter-
national Symposium on Low Power Electronics and Design (ISLPED), June 2000.
[16] J.J. Chen, H.R. Hsu, and T.W. Kuo. Leakage-aware energy-efficient scheduling of real-
time tasks in multiprocessor systems. In Proc. 12th IEEE Real-Time and Embedded
Technology and Applications Symposium, San Jose, CA, 2006.
[17] J.J. Chen and L. Thiele. Energy-efficient scheduling on homogeneous multiprocessor
platforms. In Proc. ACM Symposium on Applied Computing, Sierre, Switzerland, 2010.
[18] J. Cong and K. Gururaj. Energy efficient multiprocessor task scheduling under input-
dependent variation. In Proc. Design, Automation and Test in Europe, Dresden, Ger-
many, 2009.
[19] G. Contreras and M. Martonosi. Power Prediction for Intel XScale Processors Using
Performance Monitoring Unit Events. In Proc. International Symposium on Low Power
Electronics and Design (ISLPED), August 2005.
[20] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT Press,
Cambridge, 1990.
[21] D. Duarte and N. Vijaykrishnan and M. J. Irwin and H-S Kim and G. McFarland.
Impact of scaling on the effectiveness of dynamic power reduction schemes. In Proc.
International Conference on Computer Design (ICCD), 2002.
[22] E.N. Elnozahy, M. Kistler, and R. Rajamony. Energy-efficient server clusters. In Proc.
Workshop on Power-Aware Computer Systems (PACS), 2002.
[23] A. Elyada, R. Ginosar, and U. Weiser. Low-complexity policies for energy-performance
tradeoff in chip-multi-processors. IEEE Transactions on Very Large Scale integration
systems, 16(9), 2008.
[24] F. Gruian and K. Kuchcinski. Lenes: Task scheduling for low-energy systems using
variable supply voltage processors. In Proc. Asia and South Pacific Design Automation
Conference (ASP-DAC), 2001.
130
[25] M. R. Garey and David S. Johnson. Computers and Intractability: A Guide to the
Theory of NP-Completeness. W. H. Freeman, 1979.
[26] Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Christo-
pher Leger, Andrew A. Lamb, Jeremy Wong, Henry Hoffman, David Z. Maze, and
Saman Amarasinghe. A stream compiler for communication-exposed architectures. In
International Conference on Architectural Support for Programming Languages and Op-
erating Systems, San Jose, CA, 2002.
[27] F. Gruian. Hard Real-Time Scheduling for Low-Energy Using Stochastic Data and DVS
Processors. In Proc. International Symposium on Low Power Electronics and Design
(ISLPED), August 2001.
[28] F. Gruian. On Energy Reduction in Hard Real-Time Systems Containing Tasks with
Stochastic Execution Times. In Proc. IEEE Workshop on Power Management for Real-
Time and Embedded Systems, Taipei, Taiwan, May 2001.
[29] F. Gruian and K. Kuchcinski. Uncertainty-Based Scheduling: Energy Efficient Ordering
for Tasks with Variable Execution Time. In Proc. International Symposium on Low
Power Electronics and Design (ISLPED), Seoul, Korea, August 2003.
[30] I. Hong, G. Qu, M. Potkonjak, and M. Srivastava. Synthesis Techniques for Low-Power
Hard Real-Time Systems on Variable Voltage Processors. In Proc. IEEE Real-Time
Systems Symposium (RTSS), Madrid, Spain, December 1998.
[31] I. Ahmad and Y.K. Kwok. On exploiting task duplication in parallel program scheduling.
IEEE Trans. Parallel and Distributed Systems, 9(9), 1998.
[32] T. Ishihara and H. Yasuura. Voltage Scheduling Problem for Dynamically Variable
Voltage Processors. In Proc. International Symposium on Low Power Electronics and
Design (ISLPED), pages 197–202, August 1998.
[33] N. S. Kim, T. Kgil, K. Bowman, V. De, and T. Mudge. Towards power-optimal pipelin-
ing and parallel processing under process variations in nanometer technology. In Proc.
IEEE International Conference on Computer-Aided Design (ICCAD), San Jose, CA,
2005.
[34] W. Kim, D. Shin, H. Yun, J. Kim, and S. Min. Performance Comparison of Dynamic
Voltage Scaling Algorithms for Hard Real-Time Systems. In Proc. IEEE Real-Time
Embedded Technology and Applications Symposium (RTAS), 2002.
[35] S. Krantz, S. Kress, and R. Kress. Jensen’s Inequality. Birkhauser, 1999.
[36] S. Lang. Calculus of Several Variables. Addison-Wesley, 1973.
[37] P.D. Langen and B. Juurlink. Trade-offs between voltage scaling and processor shutdown
for low-energy embedded multiprocessors. Lecture Notes in Computer Science, 4599,
2007.
131
[38] W.Y. Lee. Energy-saving dvfs scheduling of multiple periodic real-time tasks on multi-
core processors. In Proc. 13th IEEE/ACM International Symposium on Distributed
Simulation and Real Time Applicatio, Singapore, 2009.
[39] J. Lorch. Operating Systems Techniques for Reducing Processor Energy Consumption.
PhD thesis, University of California at Berkeley, 2001.
[40] J. Lorch and A. Smith. Improving Dynamic Voltage Scaling Algorithms with PACE. In
Proc. ACM SIGMETRICS, June 2001.
[41] J. Lorch and A. Smith. Operating system modifications for task-based speed and voltage
scheduling. In Proc. International Conference on Mobile Systems, Applications, and
Services (MobiSys), May 2003.
[42] J. Lorch and A. Smith. PACE: a New Approach to Dynamic Voltage Scaling. IEEE
TRansactions on Computers, 53(7):856–869, 2004.
[43] A. Mahalanobis, B. V. K. Vijaya Kumar, and S.R.F. Sims. Distance-classifier Correla-
tion Filters for Multiclass Target Recognition. Applied Optics, 35, 1996.
[44] Steven Martin, Krisztian Flautner, Trevor Mudge, and David Blaauw. Combined dy-
namic voltage scaling and adaptive body biasing for lower power microprocessors under
dynamic workloads. In Proc. IEEE International Conference on Computer-Aided Design
(ICCAD), 2002.
[45] R. Mishra, N. Rastogi, D. Zhu, D. Mosse´, and R. Melhem. Energy aware scheduling
for distributed real-time systems. In Proc. IEEE International Parallel and Distributed
Processing Symposium (IPDPS), Nice, France, 2003.
[46] A. Miyoshi, C. Lefurgy, E. V. Hensbergen, R. Rajamony, and R. Rajkumar. Critical
Power Slope: Understanding the runtime effects of Frequency Scaling. In Proc. ACM
International Conference on Supercomputing, June 2002.
[47] B. Mochocki, X. Hu, and G. Quan. A Unified Approach to Variable Voltage Schedul-
ing for Nonideal DVS Processors. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 23(9):1370–1377, September 2004.
[48] D. Mosse´, H. Aydin, B. Childers, and R. Melhem. Compiler-Assisted Dynamic Power-
aware Scheduling for Real-Time Applications. In Proc. Workshop on Compiler and OS
for Low Power (COLP), October 2000.
[49] M.T. Yang and R. Kasturi and A. Sivasubramaniam. A pipeline-based approach for
scheduling video processing algorithms on now. IEEE Trans. Parallel and Distributed
Systems, 14(2), 2003.
[50] P. D. Hoang and Jan M. Rabaey. Scheduling of dsp programs onto multiprocessors for
maximum throughput. IEEE Trans. Signal Processing, 41(6), 1993.
132
[51] P. Pillai and K. G. Shin. Real-time Dynamic Voltage Scaling for Low-Power Embedded
Operating Systems. In Proc. ACM Symposium on Operating Systems Principles (SOSP),
pages 89–102, October 2001.
[52] F. Preparata and M. Shamos. Computational Geometry An Introduction. Springer,
1993.
[53] R. P. Dick and D. L. Rhodes and W. Wolf. Tgff: Task graphs for free. In Proc. Sixth
International Workshop on Hardware/Software Codesign, 1998.
[54] C. Rusu, R. Xu, R. Melhem, and D. Mosse´. Energy-Efficient Policies for Request-
Driven Soft Real-Time Systems. In Proc. Euromicro Conference on Real-Time Systems
(ECRTS), Catania, Italy, July 2004.
[55] S. Banerjee and T. Hamada and P. M. Chau and R. D. Fellman. Macro pipelining
based scheduling on high-performance heterogeneous multiprocessor systems. IEEE
Trans. Signal Processing, 43(6), 1995.
[56] S. Saewong and R. Rajkumar. Practical Voltage-Scaling for Fixed-Priority RT-Systems.
In Proc. IEEE Real-Time Embedded Technology and Applications Symposium (RTAS),
May 2003.
[57] R. Stephens. A survey of stream processing. Acta Informatica, 34(7):491–541, 1997.
[58] W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming
applications. In Proc. International Conference on Compiler Construction, Grenoble,
France, 2002.
[59] H. S. Wang, X. Zhu, L. S. Seh, and S. Malik. Orion: A power-performance simulator
for interconnection networks. In Proc. International Symposium on Microarchitecture,
2002.
[60] N.H.E. Weste and K. Eshraghian. Principles of CMOS VLSI Design. Addison-Wesley,
Reading, MA, 1993.
[61] H. Lee H. Kim W.Y. Lee, Y.W. Ko. Energy-efficient scheduling of a real-time task
on dvfs-enabled multi-cores. In Proc. International Conference on Hybrid Information
Technology, Seoul, Korea, 2009.
[62] C. Xian, Y.H. Lu, and Z. Li. Energy-aware scheduling for real-time multiprocessor
systems with uncertain task. In Proc. Design Automation Conference (DAC), San Diego,
CA, 2007.
[63] F. Xie, M. Martonosi, and S. Malik. Compile-Time Dynamic Voltage Scaling Settings:
Opportunities and Limits. In Proc. Programming Language Design and Implementation
(PLDI), June 2003.
133
[64] Intel XScale Microarchitecture:Benchmarks, 2005. http://web.archive.org/web/
20050326232506/http://developer.intel.com/design/intelxscale/benchmarks.htm.
[65] R. Xu, R. Melhem, and D. Mosse´. A Unified Approach to Stochastic DVS Scheduling. In
Proc. of ACM International Conference on Embedded Software (EMSOFT), Salzburg,
Austria, October 2007.
[66] R. Xu, R. Melhem, and D. Mosse´. Energy-Aware Scheduling for Streaming Applica-
tions on Chip Multiprocessors. In Proc. IEEE Real-Time Systems Symposium (RTSS),
Tucson, AZ, December 2007.
[67] R. Xu, D. Mosse´, and R. Melhem. Minimizing expected energy consumption in real-
time systems through dynamic voltage scaling. ACM Transactions on Computer Systems
(TOCS), 25(4), 2007.
[68] R. Xu, C. Xi, R. Melhem, and D. Mosse´. Practical PACE for Embedded Systems. In
Proc. ACM International Conference on Embedded Software (EMSOFT), Pisa, Italy,
September 2004.
[69] R. Xu, D. Zhu, C. Rusu, R. Melhem, and D. Mosse´. Energy-efficient policies for embed-
ded clusters. In Proc. ACM SIGPLAN/SIGBED Conference on Languages, Compilers,
and Tools for Embedded Systems (LCTES), June 2005.
[70] Y. Chow and F. Anger and C. lee. Scheduling precedence graphs in systems with
interprocessor communication times. SIAM J. Computers, 18(2), 1989.
[71] W. Yuan and K. Nahrstedt. Energy-Efficient Soft Real-Time CPU Scheduling for Mo-
bile Multimedia Systems. In Proc. ACM Symposium on Operating Systems Principles
(SOSP), October 2003.
[72] Y. Zhang, Z. Lu, J. Lach, K. Skadron, and M. Stan. Optimal Procrastinating Volt-
age Scheduling for Hard Real-Time Systems. In Proc. Design Automation Conference
(DAC), June 2005.
[73] B. Zhao and H. Aydin. Minimizing expected energy consumption through optimal
integration of dvs and dpm. In Proc. International Conference on Computer-Aided
Design, San Jose, CA, 2009.
134
INDEX
ATR, see automatic target recognition
automatic target recognition, 1, 24
DVS, see dynamic voltage scaling
DVS Scheme, 3
DVS scheme, 9
hybrid DVS, 3, 9
inter-task DVS, 3, 9
intra-task DVS, 3, 9
dynamic power, 8
dynamic voltage scaling, 2, 8
FPTAS, 4
frame, 7
frame-based system, 7
GOPDVS, 36
Greedy2, 59
hard real-time, 7
HDVS, 76
IBM PowerPC 405LP, 26
ideal processor model, 16
IDVS, 75
Intel XScale, 26
master-slave scheme, 86
MS scheme, 90
OITDVS, 35
on-off, see vary-on/vary-off
period, 7
PGOPDVS, 51
PIT-PPACE, 52
PITDVS, 48
PITDVS2, 4, 49
PPACE, see Practical PACE
Practical PACE, 4, 38
Proportional2, 59
real-time system, 7
realistic processor model, 17
relative error, 54
request, 1
Scheduling1D, 97
Scheduling2D, 4, 105
SIDVS, 66
SMS scheme, 91
soft real-time, 7
speed scaling point, 38
speed schedule, 30
SScheduling2D, 5, 108
static power, 8
Statistical2, 59
step function, 71
stochastic DVS scheme, 20
STREAM-MP-D-ST, 21, 86
STREAM-MP-D-TG, 22, 94
STREAM-MP-S-ST, 22, 86
STREAM-MP-S-TG, 22, 94
STREAM-UP-S-ST, 20, 28
STREAM-UP-S-TG, 21, 28
streaming application, 1, 6, 14
response time, 1
throughput, 1
synthetic processor, 25
vary-on/vary-off, 2
WCEC, see worst-case execution cycles
WCET, see worst-case execution time
worst-case execution cycles, 7
worst-case execution time, 7
x-oriented load, 102
y-oriented load, 95
135
