Unified Framework for Energy-proportional Computing in Multicore Processors: Novel Algorithms and Practical Implementation by Hanumaiah, Vinay (Author) et al.
Unified Framework for Energy-proportional Computing in Multicore Processors: Novel
Algorithms and Practical Implementation
by
Vinay Hanumaiah
A Dissertation Presented in Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy
Approved May 2013 by the
Graduate Supervisory Committee:
Sarma Vrudhula, Chair
Karamvir Chatha
Chaitali Chakrabarti
Armando Rodriguez
Ronald Askin
ARIZONA STATE UNIVERSITY
August 2013
ABSTRACT
Multicore processors have proliferated in nearly all forms of computing, from servers,
desktop, to smartphones. The primary reason for this large adoption of multicore proces-
sors is due to its ability to overcome the power-wall by providing higher performance at a
lower power consumption rate. With multi-cores, there is increased need for dynamic en-
ergy management (DEM), much more than for single-core processors, as DEM for multi-
cores is no more a mechanism just to ensure that a processor is kept under specified tem-
perature limits, but also a set of techniques that manage various processor controls like
dynamic voltage and frequency scaling (DVFS), task migration, fan speed, etc. to achieve
a stated objective. The objectives span a wide range from maximizing throughput, min-
imizing power consumption, reducing peak temperature, maximizing energy efficiency,
maximizing processor reliability, and so on, along with much more wider constraints of
temperature, power, timing, and reliability constraints. Thus DEM can be very complex
and challenging to achieve. Since often times many DEMs operate together on a single
processor, there is a need to unify various DEM techniques. This dissertation address such
a need.
In this work, a framework for DEM is proposed that provides a unifying proces-
sor model that includes processor power, thermal, timing, and reliability models, supports
various DEM control mechanisms, many different objective functions along with equally
diverse constraint specifications. Using the framework, a range of novel solutions is de-
rived for instances of DEM problems, that include maximizing processor performance,
energy efficiency, or minimizing power consumption, peak temperature under constraints
of maximum temperature, memory reliability and task deadlines.
Finally, a robust closed-loop controller to implement the above solutions on a real
processor platform with a very low operational overhead is proposed. Along with the con-
troller design, a model identification methodology for obtaining the required power and
i
thermal models for the controller is also discussed. The controller is architecture indepen-
dent and hence easily portable across many platforms. The controller has been successfully
deployed on Intel Sandy Bridge processor and the use of the controller has increased the
energy efficiency of the processor by over 30%.
ii
To my mother
Without your unconditional love and support, all of this could never have happened
iii
ACKNOWLEDGEMENTS
I would like to thank my advisor, Prof. Sarma Vrudhula, for his unwavering faith in
me, and for all his guidance and advice through the years. This work would not have been
possible without him. My sincere thanks to my committee members for taking the time to
review my work, attending my presentations, and offering many helpful suggestions.
I deeply indebted to Ravishankar Rao, whose work laid the foundation for my work,
without which this work would not have progressed to the current state.
I have been fortunate to share my workplace with many colleagues who became
my friends. My thanks to them for being there for me when I needed them and making
the workplace livelier. My special thanks to Digant Desai and Benjamin Gaudette, without
their collaboration, this work would be far from complete.
I am indebted to Chakravarthy Akella and Christopher Lucero of the Intelligent
Systems Group (ISG) at Intel in Chandler, Arizona for their technical and material support
and their valuable guidance. The important closed-loop implementation work would not
have been possible without their assistance and cooperation. I also like to thank Srikanth
Sridharan for helping me verifying controller designs and suggest improvements.
My parents have been a constant source of encouragement and support through
these long years. I am grateful for their trust in me, and for everything they have done for
me.
Finally, my thanks to many sponsoring agencies, the National Science Foundation
– grants CSR-EHS 0509540, NeTS 0905035; Center for Embedded Systems – grant DWS-
0086; Science Foundation Arizona – grant SRG 0211-07; Stardust Foundation. I also like
thank the Department of Computer Science for granting me a research assistantship, and
Graduate college for awarding me the Dissertation Fellowship.
iv
v
TABLE OF CONTENTS
PAGE
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF SYMBOLS/NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Shift Toward Multicore Processors . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Need for Advanced Thermal Management in Multi-core Processors . . . . . 3
1.3 State-of-the Art in Dynamic Energy Management . . . . . . . . . . . . . . 4
Classification of Related Work . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Research Contribution and Thesis Outline . . . . . . . . . . . . . . . . . . 15
2 Processor System Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Power and Thermal Models . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Thermo-electrical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
HotSpot Thermal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Reduced Thermal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Effect of Temperature and Voltage on Delay . . . . . . . . . . . . . . . . . 29
2.4 Fan Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Performance Optimal Dynamic Voltage and Frequency Scaling . . . . . . . . . . 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 General Problem Formulation Involving DVFS and Migration . . . . . . . . 35
3.3 Performance Optimal DVFS Policy . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Convex Optimization Formulation and its Solution . . . . . . . . . . . . . . 40
3.5 Fast Computational Procedure . . . . . . . . . . . . . . . . . . . . . . . . 41
vi
CHAPTER PAGE
3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
The Optimal Makespan Minimization Policy . . . . . . . . . . . . . . . . . 46
Comparison of the Approximate Solution with the Convex Optimization
Solution for Makespan Minimization . . . . . . . . . . . . . . . . . 47
Discrete Voltage-speed Implementation . . . . . . . . . . . . . . . . . . . 47
4 Performance Optimal Task-to-core Allocation . . . . . . . . . . . . . . . . . . . 49
4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Structure of the HotSpot Conductance Matrix . . . . . . . . . . . . . . . . 51
4.3 Simplified Temperature Computation . . . . . . . . . . . . . . . . . . . . . 52
4.4 Linear Assignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Performance Improvement through Task Migration . . . . . . . . . . . . . 55
Performance Comparison of the Optimal Task-to-core Allocation with the
Power-based Thread Migration . . . . . . . . . . . . . . . . . . . . 56
Effect of Cores and Tasks on the Performance of Task-to-core Allocation . . 57
Computation Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Performance Optimal Dynamic Voltage and Frequency Scaling with Hard Dead-
lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Optimal Solution for Minimum Makespan without Deadlines . . . . . . . . 63
Determination of smaxh,i . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Minimum Makespan with Deadlines . . . . . . . . . . . . . . . . . . . . . 65
Critical Task Voltage-speed Determination . . . . . . . . . . . . . . . . . . 68
vii
CHAPTER PAGE
Speeds of non-critical tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Makespan Minimization for Tasks without Deadlines . . . . . . . . . . . . 73
Makespan Minimization for Tasks with Deadlines . . . . . . . . . . . . . . 75
6 Minimizing Peak Temperature using Dynamic Voltage and Frequency Scaling . . 77
6.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Reliability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Problem Statement and Approach . . . . . . . . . . . . . . . . . . . . . . . 78
Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Solution Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Quasiconvex Programming Solution for Minimizing Peak Temperature . . . 80
6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Optimal Policy vs Max-throughput Policy . . . . . . . . . . . . . . . . . . 82
7 Energy-efficient DTM using DVFS, Task Migration and Active Cooling . . . . . 86
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Problem Description and Optimal Solution . . . . . . . . . . . . . . . . . . 88
Characteristics of PPW metric . . . . . . . . . . . . . . . . . . . . . . . . 89
Voltage-Speed Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Task-to-Core Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Fan Speed Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Comparison of Voltage-speed Scaling Schemes . . . . . . . . . . . . . . . 96
Effect of cores on PPW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Comparison of Task migration schemes . . . . . . . . . . . . . . . . . . . 102
Improvement in PPW through Fan Speed Scaling . . . . . . . . . . . . . . 103
viii
CHAPTER PAGE
8 Exploiting Reliability and Bandwidth Slacks for Improved Memory Energy Man-
agement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Memory reliability models . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.2 Opportunities for exploiting memory reliability and bandwidth slack . . . . 108
Limitations of pre-determined voltage-frequency pairs . . . . . . . . . . . . 108
Inherent error resiliency of multimedia . . . . . . . . . . . . . . . . . . . . 109
Under utilized last-level cache . . . . . . . . . . . . . . . . . . . . . . . . 109
Near-Memory architecture for LLC . . . . . . . . . . . . . . . . . . . . . . 111
8.3 Problem Statement and Approach . . . . . . . . . . . . . . . . . . . . . . . 112
Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.4 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Solution process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Energy savings from the proposed DVFS for LLC . . . . . . . . . . . . . . 117
Energy savings by lowering BER . . . . . . . . . . . . . . . . . . . . . . . 118
9 Temperature-aware Robust Controller: Accurate Modeling and Prediction for
Multi-core Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.2 Power and Thermal Models . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.3 Model Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Thermal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.4 Minimization of Power and Temperature Prediction Errors through Kalman
Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
ix
9.5 DEM Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Noise Analysis of Power and Thermal Sensors . . . . . . . . . . . . . . . . 133
Comparison of Tracking Ability of the Proposed Controller with Existing
Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Maximizing Performance Subject to Thermal Constraints . . . . . . . . . . 136
Maximizing Energy efficiency . . . . . . . . . . . . . . . . . . . . . . . . 137
10 Conclusions and Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 141
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A PROOFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A.1 Proof of Optimal Control Policy . . . . . . . . . . . . . . . . . . . . . . . 153
Hamiltonian Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Optimal Voltage-Speed Profile . . . . . . . . . . . . . . . . . . . . . . . . 153
A.2 Proof of Convexity of Voltage-speed Scaling . . . . . . . . . . . . . . . . . 154
A.3 Proof of Quasiconcavity of PPW . . . . . . . . . . . . . . . . . . . . . . . 155
A.4 Proof of Quasiconvexity of Temperature w.r.t. Fan Speed . . . . . . . . . . 156
A.5 Proof of Quasiconvexity of BER w.r.t. vm . . . . . . . . . . . . . . . . . . 157
x
LIST OF TABLES
TABLE PAGE
1.1 List of processor thermal and power models . . . . . . . . . . . . . . . . . . . 12
3.1 Characteristics of benchmarks used in the experiment. . . . . . . . . . . . . . . 45
3.2 Comparison of the proposed approximate method with the accurate convex
optimization method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Computation times per core (in milliseconds) for DVFS, task-to-core allocation
and fan speed scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Instruction length, IPC, average dynamic power and deadlines of the tasks used
in the experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Comparison of makespan for the optimal MMS and the modified algorithm,
the associated deadline violations of the optimal MMS, and the computation
times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1 Characteristics of Tasks used in the experiments. . . . . . . . . . . . . . . . . . 82
7.1 Task completion times of benchmarks used in the experiment . . . . . . . . . . 96
7.2 Comparison of overall delay, energy and PPW for schemes shown in Figure ??. 98
8.1 Characteristics of Parsec benchmarks used in the experiments . . . . . . . . . . 116
9.1 Comparison of overall delay, energy and PPW of the proposed energy-efficient
DVFS policy with the Linux DVFS policies shown in Figure 9.8. . . . . . . . . 139
xi
LIST OF FIGURES
FIGURE PAGE
1.1 Dual versus Single Core [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Hierarchal organization of the survey . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Outline of topics covered in this dissertation . . . . . . . . . . . . . . . . . . . 18
2.1 Electrical analogy for thermal conduction [2] . . . . . . . . . . . . . . . . . . 20
2.2 HotSpot-4 thermal model for a four core processor. . . . . . . . . . . . . . . . 21
2.3 Piecewise-linear approximation of leakage power . . . . . . . . . . . . . . . . 25
2.4 Reduced multi-core thermal model . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Typical plot of convection resistance Rconv and fan power Pf an with the speed
of a fan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Time scales for optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Figure showing the optimal voltage-speed policy of a core as described by
(3.17) and (3.18). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Plot of values of matrix R for the dual-core Alpha processor with scheduling
interval ts = 10 ms [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Dual-core floorplan of Alpha 21264 processor [4]. . . . . . . . . . . . . . . . . 42
3.5 Experimental setup showing various components . . . . . . . . . . . . . . . . 45
3.6 Performance of the optimal DVFS scheme in minimizing the overall makespan 46
3.7 Discretization of the optimal makespan algorithm with ten speed states . . . . . 48
4.1 Response of the die and the package temperatures to changes in speed of a
core [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Plot of the conductance matrix showing the sparsity of the dual-core Alpha
processor floor plan shown in Figure 3.4 [4]. . . . . . . . . . . . . . . . . . . . 52
4.3 Components of the thermal conductance matrix G of the dual-core Alpha pro-
cessor floor plan shown in Figure 3.4 [4]. . . . . . . . . . . . . . . . . . . . . 53
4.4 Comparison of throughput with and without task migration. . . . . . . . . . . . 55
xii
FIGURE PAGE
4.5 Comparison of temporal performance of the optimal allocation scheme with
the P.TM scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Plot of throughput improvement of the proposed optimal algorithm against the
P.TM technique for various ratios of tasks and cores . . . . . . . . . . . . . . . 57
5.1 Minimizing makespan may lead to deadline violations. . . . . . . . . . . . . . 66
5.2 (a) An example showing a deadline violation for zero-slack execution of four
tasks on a four-core processor. (b) A possible deadline feasible solution for the
example in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Plot of speeds, voltages and temperature of the hottest block of cores when
executing under zero-slack policy . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Plot of speeds, voltages and temperature of the hottest block of cores when
executing under optimal MMS policy . . . . . . . . . . . . . . . . . . . . . . 74
6.1 Creation of slots based on start and end times of tasks. . . . . . . . . . . . . . 79
6.2 Two task example to demonstrate the quasiconvexity of (6.8). . . . . . . . . . . 81
6.3 Core speeds and temperatures of hottest blocks for Scenario 1. . . . . . . . . . 84
6.4 Optimal speeds and temperatures for Scenario 2. . . . . . . . . . . . . . . . . . 85
7.1 Energy-delay curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Quasiconcave nature of PPW metric . . . . . . . . . . . . . . . . . . . . . . . 89
7.3 Effect of temperature on PPW . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.4 Effect of fan speed on PPW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.5 Effect of leakage power on PPW . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.6 Comparison of PPWs of the brute-force TCA with the proposed TCA (migra-
tion interval - 100 ms) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.7 Energy delay curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.8 Comparison of speed, temperature and power of execution for various objectives. 99
xiii
FIGURE PAGE
7.7 Comparison of speed, temperature and power of execution for various objectives.100
7.6 Comparison of speed, temperature and power of execution for various objectives.101
7.7 Plot of MIPS/Watt against the number of cores for various factors of power
reduction per core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.8 Comparison of throughput, total power and PPW for various αs . . . . . . . . 102
7.9 Comparison of throughput, total power consumption and PPW of DVFS with
and without fan speed scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.1 Figure illustration the complex interaction between various physical quantities
in a processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.2 Effect of process variations and in timing related memory errors . . . . . . . . 107
8.3 Memory BER as a function of memory voltage and frequency . . . . . . . . . 109
8.4 Examples of restoring images corrupted by various levels of noise, and the
possible energy savings that is obtained by operating a memory at a supply
voltage that satisfies the corresponding BER. . . . . . . . . . . . . . . . . . . . 110
8.5 Plot of bandwidth requirement for LLC for three different Parsec benchmarks . 111
8.6 Proposed near-memory architecture for LLC . . . . . . . . . . . . . . . . . . . 111
8.7 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.8 Energy savings from the proposed BER and bandwidth aware DVFS for LLC . 117
8.9 Energy savings increase proportionally with increasing BER tolerance . . . . . 118
9.1 Structure of the closed-loop controller with its various components . . . . . . . 119
9.2 Piece-wise linearized surface of leakage power w.r.t. core speed and temperature124
9.3 Comparison of MSR power readings with the measurements from TI micro-
controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.4 Plot of standard deviation of noise from temperature sensors . . . . . . . . . . 133
xiv
TABLE PAGE
9.5 Plot showing the tracking of power consumption of the proposed model vs.
Wang et al. model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.6 Plot depicting the tracking ability of the proposed closed-loop controller . . . . 135
9.7 Plot of speed, core temperatures and power while maximizing total perfor-
mance under thermal constraints . . . . . . . . . . . . . . . . . . . . . . . . . 137
9.8 Comparison of speed, maximum temperature, power and PPW using various
policies on Intel Sandy Bridge processor . . . . . . . . . . . . . . . . . . . . . 138
9.9 Pictorial comparison of existing DEM policies on Linux with the proposed
policy. The dotted line shows the energy-delay pattern. . . . . . . . . . . . . . 140
xv
LIST OF SYMBOLS
n No. of cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
q No. of tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
m No. of functional units in a core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
N Total no. of thermal blocks in a processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
sc Normalized clock speed of core c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vc Normalized voltage of core c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
IPC j Instruction per cycle of task j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
xc(t) Number of instructions completed by core c at time t . . . . . . . . . . . . . . . . . . . . . . . 36
Ic Total number of instructions of task j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
vth Threshold voltage of circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
s f an Normalized angular velocity of fan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Pc,b Total power consumption of block b in core c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Pdyn,c,b Dynamic power of block b in core c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
Pmaxdyn,c,b Profile of maximum Pdyn,c,b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Plkg,c,b Leakage power consumption of block b in core c . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Tc,b Temperature of block b in core c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
T max Maximum allowed temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
G Conductance matrix of entire processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
GT,c,b Temperature co-eff. of lkg. power of block b in core c . . . . . . . . . . . . . . . . . . . . . . 25
Pv Leakage power component due to voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
M Task-to-core allocation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Gdie,c Conductance matrix of the die layer of core c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Gtim,c Conductance matrix of the TIM layer of core c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Gpkg Package conductance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Tspr Temperature of the spreader center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xvi
TABLE PAGE
xvii
Chapter 1
Introduction
Until recently, processor industry was able to achieve the predicted performance increase
according to Moore’s law, by scaling the size of transistors in every generation. However,
this performance increase was not sustainable due to power wall – inability to increase
performance without increasing power consumption, and the associated high temperatures
caused by the increased scaling of transistors. Additionally, the increasing power con-
sumption is detrimental to battery life, and higher temperatures reduce circuit reliability
drastically. Towards this, a significant research in the areas of circuit, architecture, and
system-level software/control theoretic methods are conducted towards increasing the pro-
cessor energy-efficiency, reducing temperatures, and increasing reliability. The collection
of the above such techniques is termed as processor thermal management (PTM) and the
branch of PTM that deals with real-time solutions for changing workloads are termed as
dynamic energy management or simply DEM.
This work addresses a subset of processor thermal management techniques, both
online and offline, that are software controlled at the operating system level, designed to
achieve various objectives of maximizing throughput, energy-efficiency, and minimizing
power consumption with the constraints on reliability, task deadlines and peak temperature.
1.1 Shift Toward Multicore Processors
Multi-core processors emerged as a response to the “power wall”, which limited the scal-
ing of clock frequencies in single-core processors [1, 6]. Although transistor scaling en-
abled reduction in dynamic power consumption allowing for increase in clock frequencies,
the leakage power contribution to total power increased exponentially. Thus limiting the
growth of clock frequencies. Further, a processor maximum heat dissipation is by the pack-
aging and cooling solution, thus enforcing a safe operating temperature of the electronic
1
circuits on the silicon die. Violating this maximum temperature constraint over any part of
the die can reduce chip reliability and even cause a permanent failure [7]. Moreover, higher
power consumption requires advanced cooling technologies like liquid cooling, which is
not viable due to higher costs, and reduced battery life (for portable electronics).
With the use of multiple small, low power cores, each running at lower clock fre-
quencies, it is possible to achieve to attain higher processor throughput compared with a
fast single core processor within the same power/thermal budget. This can be explained by
with an empirically observed characteristic of throughput, known as Pollack’s rule [6]. It
states that throughout increases only as the square-root of the number of transistors. Con-
sider a single large core processor with Nt transistors versus a processor with n smaller
(and less powerful) processors, each with Nt/n transistors. Thus, both processors have the
same number of transistors. Let Sn denote the throughput of an n-core processor. Then
Sn/S1 = (n ·
√
Nt/n · fclk)/(
√
Nt · fclk) =
√
n. That is, assuming that both processors are
operated at the same clock frequency, their power consumption will be roughly the same
whereas the throughput of the multi-core processor will be
√
n times higher than the single
core processor. For example, the throughput of a dual-core processor would be 40% higher
than that of a single core processor, even though the throughput of each individual core of
the multi-core processor will be only 70% of the single core processor. This is borne out
in Figure 1.1 [1] which compares results of benchmark programs running on a single and
dual core Intel processor. For a single core processor, increasing the clock frequency by
20% returns a performance improvement of just 13%, but a increase in power consumption
of nearly 74%. In contrast, for a dual-core processor, a 20% reduction in clock frequency
yields a 73% improvement in performance and the same power consumption when com-
pared to a single core processor running at maximum frequency. Thus, the multi-core ap-
proach exploits the thread-level parallelism to improve performance while keeping power
consumption nearly constant. Hence, increasing the number of cores per die has become
2
Freq = 1.2 x f
over clocked
Freq = f Freq = 0.8 x f
under clocked
Dual Core
Single Core
Single Core
1.13X
1.73X
1.00X
1.73X
1.02X
Power
Perf
Figure 1.1: Dual versus Single Core [1]
the new “scaling” strategy, with massively multi-core (100s of cores) expected in the near
future [6].
1.2 Need for Advanced Thermal Management in Multi-core Processors
With single-core processors and even with dual and quad cores, the DTM techniques were
intended only for emergency use. The thermal package and cooling system were designed
to handle the worst-case power dissipation. However, with many-core processor, there is
a corresponding increase in both temporal and spatial variance of the power dissipation of
the chip. This is due to difference in the power consumption of tasks executing on cores,
and also changing phases (workload utilization) during execution of a task.
With this significant variation in power and temperature distribution in a processors,
processors can no longer be designed to handle worst-case dissipation, but rather have their
thermal design power closer to the average power consumption. Such a design will have to
rely on DTM techniques to reduce spatial and temporal variation in power consumption to
prevent thermal emergencies without compromising on desired throughput.
3
There are interesting scenarios that exist in multicore processors unseen in single-
core processors. It is possible under multicore processors that a processor can have dif-
ferent maximum operable clock frequencies depending on the number of the cores that
are in use. For example, a eight-core processor can have a maximum operable frequency
of 2.66 GHz when all cores are used, but can have higher operating frequency of 3.5 GHz
when only six cores are used. This is possible because in the above two scenarios, the com-
bination of operating frequency and number of cores produces similar power consumption.
In addition to the above, with multicores, the derivation of DEM solution cannot
be achieved with just simple binary search or a proportional-integral-derivative (PID) con-
troller, as in the case of single-core processors. Also, multicore processors have led to new
objectives and constraints for DEM. Hence DEM is of utmost importance for multicore
processors.
1.3 State-of-the Art in Dynamic Energy Management
Energy-efficient computing was first introduced in the seminal work on low-power com-
puting [8]. Low-power computing included circuit, architectural and algorithmic-level
techniques to reduce power wastage, e.g. stand-by mode, clock-gating and reduction in
required functional units for computation. However, in the recent decade, power den-
sity along with power consumption became a critical issue in causing higher processor
temperature. Higher temperatures are the major cause for device failures, lowering de-
vice reliability, throttling performance and increase in leakage power consumption. The
cyclic relationship between power consumption and temperature makes it imperative that
lowering power consumption is necessary for lowering temperatures. Unlike energy con-
sumption, temperature is not an aggregate quantity, but a time specific event. As such,
static techniques like microarchitectural, floorplan and circuit techniques are not sufficient
to constrain a time-dependent processor temperature to a specified tolerance limit. Thus
DTM is a must for regulating processor temperatures within a safe limit. Also, as men-
4
tioned previously, software handling of DEM provides greater leverage and flexibility in
control and optimization of processor energy efficiency.
In the early years, the main focus of DEM was to ensure a processor operating
temperature does not cross the maximum temperature. One of the simplest solutions to
this problem is to implement a stop-and-go policy [9], which says, whenever a processor
temperature reaches the stipulated maximum temperature, the processor is shut off and al-
lowed to cool to a certain pre-determined temperature and restart the processor. While this
scheme does address the problem of safeguarding the processor, the performance penalties
are huge. The DVFS technique was in response to reducing the performance penalties.
Unlike the stop-go policy, DVFS allows many levels of frequency and voltage combina-
tions, and the cubic savings in power consumption for linear reduction in performance is
key factor of DVFS in reducing performance penalties.
The scope of the DEM was very restricted during for single-core processors, which
mainly consisted of performance maximization under thermal/power constraints or mini-
mizing power consumption/peak temperature under guaranteed performance. With the in-
troduction of multicores, the scope of DEM has been greatly increased and most research
in low power techniques invariably has to address multicore architectures. Multicore pro-
cessors add new control variables to DEM, e.g. multiple voltage and frequency controls
and task migration, which were not present in single-core processors. The single-core opti-
mization problems usually have as simpler solution, due to a single control. In most cases,
the solution can be obtained by a simpler binary search. With additional controls, the ear-
lier problems have no more simpler solutions, and in most cases, may not readily framed
into known optimization frameworks. Apart from increasing computational complexity,
multicore DEM has opened some very interesting new problems. Some of them are: find-
ing feasible DVFS schedules to satisfy task deadlines for all cores, DVFS for minimization
of temperature gradients among cores, minimum makespan (tasks completion time), re-
5
source allocation to tasks according to user priority while maximizing energy efficiency,
and others. The DEM optimization solutions need to ensure that the performance benefits
obtained from multicores is not negated by the additional expense in computing the solu-
tion and implementation overhead. One of the major bottle necks with multicore DEM is
the practical implementation of the DEM solution. This is because multicore processor is
basically a multi-input-multi-output (MIMO) controller, and there is no known standard
optimal technique for MIMO control implementation, and as such any implementation for
multicore DEM has to ensure to reduce computational overhead.
Classification of Related Work
This section is primarily focussed on thermal management techniques that operate at the
operating system level.These techniques will be primarily have software-based approach
that use built-in hardware support to alter the power and the thermal characteristics of pro-
cessors to optimize a given objectives while ensuring all the constraints are satisfied. The
related techniques are categorized broadly into four main categories: optimization goals
and constraints, power and temperature controlling mechanisms, modeling and prediction
of processor power consumption and temperature, and solution methodology and imple-
mentation schemes.
Every DEM problem starts with a specification of optimization goals and con-
straints that need to be met. The goals can vary widely from maximizing performance,
maximizng energy efficiency subject to thermal and power constraints, or minimizing peak
temperature or maximizing reliability while meeting performance guarantees or task dead-
lines.
Once the optimization goals and the constraints to be met are decided, the next thing
to decide on is which control mechanism to use to achieve the stated goals. The available
control knobs are dynamic voltage and frequency scaling, which alters the voltage and fre-
quency of each core, thereby affecting performance, power consumption and temperature
6
of the core; task migration, which allows better matching of workload to cores to enhance
performance with lower power consumption. Task sequencing involves deciding the se-
quence of tasks for the purpose of matching the final temperature of a previous task to the
initial temperature of the current task to minimize the number of DVFS switches, thereby
reducing performance overhead. The task sequencing is mostly applicable for periodic
tasks, where the execution time and the power consumption of tasks are known a priori.
Fan control is not frequently used in DEM, as it is believed that fan control has less im-
pact on power consumption and performance than other control variables discussed above.
However, it was shown recently that optimizing for fan speed can have substantial impact
on energy savings.
Solving a DEM optimization problem requires models describing the effect of
changing control knobs over a given objectives and constraints. These methods can be
model equations provided a priori by manufactures, or derived by correlating the output
power and temperature measurements with input control values. Modeling and prediction
mechanisms plays a very crucial role in determining the quality of the solutions and the
complexity of the resulting DEM implementation, and for this reason, choosing the right
power and thermal models is the most challenging aspect in a DEM optimization.
The final step in DEM involves the solution method and the implemention scheme
of the chosen DEM problem. The power and the temperature models that are chosen play
an important role in deciding the solution method. Model complexity directly translates to
the solution complexity. A poorly chosen model may even result in an infeasible solution.
A good solution method should be computationally inexpensive and easy to implement
practically. The implementation scheme depends on the target platform. Some of the early
works relied on implementing the DEM solution on a simulator, as they can be tested and
improved comparatively easy and in a shorter time than on a real platform. It is also easier
to understand the complexities of a real implementation by working on its simulator. How-
7
Energy-aware
computing
Optimization
goals
Performance maximization
Maximizing energy-efficiency
Minimizing peak temperature
Satisfying task deadline
Enhancing reliability
Controlling
mechanisms
DVFS
task migration
task sequencing
fan control
Modeling and
prediction
Compact thermal models
ARMA
Kalman filter
Implementation
schemes
Simulator
Open-loop
PID
MPC
Figure 1.2: Hierarchal organization of the survey
ever, the simulators are never perfect and do not accurately reproduce various events and
measurements of a real platform. There are several reasons for the inaccuracies. First, there
are no known perfect models available for power consumption and temperature estimation.
Second, even if the models are assumed correct, the measurement from sensors are almost
always corrupted by noise, which are substantial. It is very important to consider the above
factors in designing a closed-loop controller which can provide stable, robust control with a
good prediction of power consumption and temperature estimation. The designs vary from
single-core to multicore processors. Designing a robust MIMO controller for multicore
processors is very challenging.
A summary of the classification of DEM techniques is shown in Figure 1.2. The
following sections briefly compare the existing techniques in each of the above DEM cat-
egories.
8
Optimization goals
Over the last decade, researchers have expanded the scope of DEM to many objective func-
tions as discussed in the previous section. One of the first objectives and still the one that
attracts most research work is maximizing processor throughput or performance. There
is lot of literature regarding DEM for maximizing throughput, some of them involve en-
ergy/power constraints [10–13], or thermal constraints [3, 5, 9, 14–20], or task deadline
constraints [21]. Other DEM objectives that can be found in the literature are: minimizing
peak temperature [22–24], maximizing energy-efficiency [25–29], enhancing processor re-
liability [30]. Of late, maximizing energy-efficiency has gathered lot of attention with the
emergence of power hungry limited battery mobile devices.
In [22], the authors present a heuristic procedure aimed at minimizing the peak
temperature of a single core processor by sequencing tasks. Ref. [23] formulates a mixed-
integer linear program to determine the schedule of tasks to minimize peak temperature
while satisfying timing constraints. Ref. [24] describes a control-theoretic solution for
determining the transient core speeds to minimize peak temperature while satisfying task
deadlines.
The PPW metric has been studied by several researchers. Architectural enhance-
ments to improve PPW are described in [25]. Methods to reduce cost of data center opera-
tion by energy efficient computing are presented in [27]. Ref. [28] shows how to minimize
the power consumption of a homogeneous multi-core processor under a throughput con-
straint through a hierarchical framework of core consolidation, DVFS and task allocation.
However, it neglects constraints on temperature and the leakage power dependence on tem-
perature. All these play a significant role in determining the energy-efficient operation.
9
Control mechanisms
Of all the control mechanisms that exist for DEM, DVFS is the most common method used
for controlling processors to achieve a given objective. In the early days, voltage control
was not widely available on processors, and hence most literature talks mainly about dy-
namic frequency control (DFS). DFS also includes techniques like clock gating, fetching
throttling, and any technique that throttles the execution rate in a linear fashion, similar
to clock frequency, resulting in linear reduction in power consumption. DVFS provides a
cubic reduction in power consumption for a linear reduction in performance, the highest
among all other control mechanisms. Also, DVFS has the fastest response or the perfor-
mance overhead penalty, thus making it an attractive control mechanism. t Methods for
optimal continuous and discrete DVFS for single core processors are described in [20]
and [31], respectively. Similar problems employing DFS (fixed voltage) for multiple cores
are considered in [5, 18]. Ref. [18] formulates the problem as a constrained convex op-
timization problem. Due to the complexity of the formulation, the solution presented is
feasible only for off-line computations. Ref. [5] provides an analytical control-theoretic
solution for single-core processors, and derives a speed control policy that maximizes per-
formance. More recently, Ref. [17] addresses the problem of minimizing the makespan
(latest completion time) of a set of tasks running on a multi-core processors using DVFS
and task migration. The optimal zero slack control policy derived in [5] for single core
processors was extended to multi-cores, and based on the control policy, an efficient online
solution to minimizing the makespan was presented.
The next popular control mechanism is task migration or task-to-core allocation or
task scheduling. This technique took birth with the introduction of multicores, as there is
no migration involved in single-core processors. Task scheduling is mainly used to reduce
the temperature gradient across the die of a processor. Reducing temperature gradient
10
achieves dual purpose of enhancing reliability and improving performance of processors.
Unlike DVFS, task scheduling has less impact on power reduction of a processor, but can
improve its performance. However, a badly designed task scheduler can severely degrade
performance due to its high overhead of context switching.
A comprehensive summary of thread migration techniques is provided in [32, 33].
The authors of [34] proposed a scheme called heat-and-run which moves threads from
over heated SMT cores to cooler cores for maximizing performance. While this technique
works when the number of tasks is less than number of cores and for processors that have
a temperature slack, it may not be optimal for high performance processors where most
of the cores operate close to thermal maximum. In [9] various thermal management tech-
niques are studied, including OS based migration controllers. They make use of multi-loop
control, wherein thread migration controls the outer loop, and DVFS makes up the in-
ner loop. The authors of [35] describe the implementation of a load balancing procedure
within Linux that is aimed at reducing the temperature gradients among cores in a MPSoC.
Ref. [36] presents integer linear programming formulation for task allocation in multicore
processors for several objective functions, including minimizing temperature gradients,
balancing energy consumption and minimizing total energy among the cores.
Previous studies on task migration methods [32–34] used detailed cycle accurate
simulators to determine the migration policies. The simulation times can be as large as tens
of hours. These are not suitable for thread assignment in real time. In [17, 37], an optimal
algorithm combined with DVFS is proposed to conduct an online task migration, which
was shown to improve performance over 20% compared with the existing techniques.
Fan control is not very common among DEM mechanisms, partly because it has
less impact on both power reduction and performance. However, an extension of fan con-
trol in data centers, i.e. controlling cooling temperature is very important, as data centers
spend nearly 50% of their energy in cooling alone. The work in [29] accounts for the
11
Table 1.1: List of processor thermal and power models
Method Example tools Temporal Power/thermal ?
Circuit-level SPICE Fraction of clock
Power
simulation Synopsis Power Compiler cycle (1 ps - 5 ns)
Cycle-accurate Wattch [39] Clock cycle Both power
simulation Hotspot [38] (0.3 ns - 5 ns) and thermal
Finite-element
Ansys, Icepak [40] 100 µs Thermal
method (FEM)
Compact HotSpot [38]
10 µs Thermal
thermal model Integral Transforms [41]
fan speed and aims to maximize PPW by controlling clock and fan speed. However, their
proposed method is applicable only for single core processors; uses a simplified thermal
model and relies on the α-power law model for voltage scaling, which neglects the effect
of temperature on delay.
Modeling and prediction
Table 1.1 summarizes the methods used to obtain estimates of power and thermal behav-
ior of processors. Each method differs in the accuracy of the method, which decides the
model computation time. For example, finite-element methods (FEM) provide the most
accurate thermal models, but may take several hours to obtain the models. Since this work
deals with high-level algorithms for DPTM and given the fact that the lowest thermal con-
stant in a processor is of the order of few milliseconds, compact thermal models (CTM)
are more than sufficient to accurately model processor thermal behavior. The CTM that
is most often used is the HotSpot thermal model [38], which provides a reasonable accu-
rate analytical linear models to compute temperature from power consumption trace that
drastically reduce the computation time.
Most of the recent literature uses mainly CTM as they are easier to model and use
in optimizations, and also verified more thoroughly than other modeling techniques. How-
ever, owing to the difficulty of incorporating time-dependent thermal constraints, much of
12
the existing work on DTM have resorted to the use of simpler thermal models [3,16,18–20].
Some of these approximations include: (i) the use of simple lumped thermal RC model [19,
20, 42], which ignores the spatial thermal distribution and ignores the differences between
the die and the package thermal time constants; (ii) neglecting the effect of leakage de-
pendence on temperature [18–20] (at high temperatures, leakage power can increase power
consumption by ten-fold); (iii) undermining the importance of voltage scaling [16, 18, 43]
(DVFS provides cubic power reduction). These assumptions may severely underestimate
the throughput of processors.
Implementation methods
The design of DTM controllers for single-core processors is often straight forward and usu-
ally involve using a simple control implementation like PID controllers [44, 45]; however,
there are no standard techniques available for multi-core DTM implementations. The rea-
son being, a multi-core processor is basically a multi-input-multi-output (MIMO) device,
and as such, MIMO controllers are inherently complicate to design, and assuring stability
and robustness of those controllers are very challenging.
Several new controller designs have been proposed in the recent years for multi-
core DTM [46–49]. These controllers can be broadly classified based on their approaches,
viz. statistical and control-theoretic. Jung and Pedram [47] propose a statistical technique
based on partially observable Markov chains to predict an optimal processor frequency
setting. While their approach avoids the need for a priori models of the processor power or
thermal model, their algorithm requires complex computations, and the number of DVFS
state searches grow exponentially with the number of processor cores. Moreover, such
statistical approaches are not practical for online DTM.
The control-theoretic techniques [46, 48, 49] for multi-core processors usually in-
volve a model-predictive controller (MPC) to determine optimal control states for one or
more control time steps in the future, by solving a constrained optimization problem. Some
13
of the limitations of the above works are: (i) ignoring workload characteristics; (ii) includ-
ing power and thermal models that are either simplistic, leading to sub-optimal controller
actions or using higher order models than necessary that increases computational complex-
ity; (iii) neglecting the leakage power dependence on temperature; (iv) lack of flexibility
in handling objectives functions and constraints of various kinds, e.g. a highly non-liner
objective function like performance/Watt, where the control variables are present both in
the numerator and the denominator; (v) high computation complexity of controller action
determination, which increases exponentially with number of cores. Bartolini et al. [46]
addressed the last limitation by developing a distributed system, where each core or group
of cores determine the control action for their own group, with minimal communication
with other cores. However, their policy is mainly a heuristic and does not guarantee either
the optimality, or the bounds on the optimality of the controller action.
By exploiting the workload characteristics, Isci et al. [50] were able to improve en-
ergy efficiency by applying simple heuristics to select an optimal frequency setting based
on the degree of memory-boundedness of workloads. Similarly, Dhiman et al. [51] pro-
posed an online learning model to take into account IPC and cache misses for determining
an optimal frequency setting. While both techniques consider workload characteristics,
the simple heuristics can be significantly improved with detailed power and temperature
models. To do so, Cochran et al. [52] also took advantage of the DVFS controls available
in most modern processors to optimize for performance and energy efficiency by comput-
ing an energy model offline, which is used to compute the necessary DVFS states during
the online phase. However, since there is no feedback mechanism and no accounting of
process temperatures and leakage power, it is possible that the system can quickly deviate
from its optimality.
14
1.4 Research Contribution and Thesis Outline
In this dissertation, an unified framework for DEM in heterogeneous multicore proces-
sors is developed, which can address various objectives of maximum performance, maxi-
mum energy-efficiency, and minimum power consumption, subject to constraints on task
deadlines, maximum temperature, minimum reliability, and circuit-delay. Computationally
efficient solutions are also presented. Finally, an architecture independent closed loop con-
troller for practical implementation of the proposed techniques is derived. The following
outlines the topics covered in this dissertation (see also Figure 1.3).
• Chapter 2 presents system models used in building the unified framework like accu-
rate power [53], thermal [54], voltage-speed-temperature [53], processor memory re-
liability, and fan [29] models is proposed, which enables one to optimize a processor
as one complete system, for various objectives and constraints. Chapter 2 also talks
about the solution for circular dependency between leakage power and temperature,
which presents difficulty in solution exploration of computationally hard non-linear
DEM problems, through piece-wise linearization of leakage power dependence on
temperature and voltage.
• Formulation and solution for performance optimal DVFS computation is presented in
Chapter 3. Minimizing makespan is considered as a performance measure in the case
of multicore processors. The solution to minimizing makespan is provided by “zero-
slack” policy, which states that either the speed of a core has to be at the maximum
or the temperature of the core has to be maximum.
The implementation of the above zero-slack policy is possible through the convex
optimization process. The chapter also provides the proofs of convexity of tempera-
ture w.r.t. clock speed and supply voltage, and convexity of supply voltage w.r.t. to
15
clock speed for a given temperature. The convexity ensures the existence of a unique
optimal solution and also helps in a fast solution search.
• Chapter 4 extends the above performance optimization by including task-to-core
allocation mechanism. The task-to-core allocation is originally a complex mixed-
integer non-linear program. Using novel observations, the problem is simplified to a
linear assignment problem, which is of polynomial complexity.
• Chapter 5 explores minimizing makespan subject to task deadlines, which is a harder
problem compared to minimizing makespan without deadline constraints. The solu-
tion process consists of two steps: In the first step, the optimal solution for makespan
without deadlines is derived and in the second step, this solution is modified to satisfy
the task deadlines.
• A similar problem to the above is minimizing peak temperature while satisfying both
start and end times. Since the start times are an additional constraint, the problem is
solved as a quasiconvex optimization, where task completion times are quasiconvex
functions of core speeds. Chapter 6 discusses the solution to the above problem.
• Chapter 7 introduces the concept of energy efficiency, which is measured using per-
formance/Watt (PPW) metric. This metric is a highly-nonlinear function. By proving
that the objective is a quasiconcave function of core frequencies, an optimal solution
is derived which maximizes the overall energy efficiency of a processor as measured
by the PPW.
• Chapter 8 explores the idea of reducing power consumption of memory by optimally
controlling memory voltage, while ensuring bit-error rate (BER) and temperature
constraints are not violated. The problem is formulated as a quasiconvex optimal
solution. The problem also includes scaling frequency proportional to bandwidth
demand of memory.
16
• In the final chapter 9, a robust architecture independent closed-loop controller, that
is capable of handing various objectives and constraints is proposed. In addition to
the controller, the chapter describes methods to derive power and thermal models.
17
Full thermal 
model
(Chapter 2.1)
Reduced  thermal model
(Chapter 2.2)
Performance optimal DVFS
- zero-slack policy
(Chapter 3)
Performance optimal 
migration
- Linear assignment problem
(Chapter 4)
Maximize energy-efficiency
- quasiconcave optimization
- fan control (incl.)
(Chapter 7)
DVFS to meet deadlines
- two step solution
(Chapter 5)
Minimize peak temperature
- quasiconvex optimization of package 
temperature
(Chapter 6)
Minimize power s.t BER 
constraint
- Near-memory architecture
- quasiconcave solution
(Chapter 8)
Closed-loop controller
- flexible objective
- architecture independent
- no a priori models needed
(Chapter 9)
Figure 1.3: Outline of topics covered in this dissertation
18
Chapter 2
Processor System Models
2.1 Performance Models
Consider a multicore processor with n cores with q tasks in the queue. Each core is capable
of executing a single task and each task is assumed to execute independently of other tasks,
which means that there is no inter-task communication. This assumption is certainly valid
for a (large) class of stream processors, where each core is designed to be simple enough
to run one task at a time and requires every thread to have independent execution [55].
The processor core frequencies/speeds s and the voltages v are assumed to be continuous
functions of time, normalized over [0,1]. The instruction per cycle of a task j is denoted
by IPC j . Note that the symbols in bold denote vectors or matrices, and all vectors are
considered as column vectors.
The throughput in a multicore processor is defined by S =
n
∑
i=1
wcsc, where wc is the
weight associated with the task executing on core c. This weight can be priority given to a
user task or just the IPC of the task, in which case the throughput is given by instructions
per section.
2.2 Power and Thermal Models
Thermo-electrical models
Figure 2.1 depicts the well-known thermal-electrical analogy of heat storage and spreading
with resistors and capacitors, respectively. Consider a slab of thickness t and area A, which
is heated on one side with a uniform heat flux P as shown in the figure. Then the tempera-
ture difference across the sides of the slab due to heat conduction is given by ∆T =Pt/(kA),
where k is the specific heat capacity of the slab. This equation is similar to Ohm’s law for
electrical conduction V = IR. Thus voltage and current are analogous to temperature and
power, and thermal resistance is given by Rth = t/(kA).
19
Uniform heat flux P/A
AT1
T2 T1
T2
t
R
CP
Figure 2.1: Electrical analogy for thermal conduction [2]
Further, energy stored in the slab is given by E =
∫
P dt = (ρAtc)∆T , where ρ and
c denote the density and the specific capacity of the material of the slab, respectively. The
above relation is similar to electrical charge storage equation Q =
∫
I dt = CV . By this
analogy, the thermal capacitance is given by Cth = ρAtc.
A processor consists of a silicon die, which consists of all the circuitry required
for computation. Due to the power flow between various components of circuitry heat is
generated, as circuits can be basically compared as a network of electrical resistors and
capacitors. This heat needs to be dissipated outside to the ambient environment to prevent
buildup of heat leading to destruction of silicon die. Towards this, the silicon die is package
with various materials to aid in faster heat dissipation. The package consists of thermal
interface material with high thermal conductivity, then a layer of heat spreader for uniform
heat distribution, The last layer is called a heat sink, which is manufactured out of metals
with very low thermal conductivity like copper or aluminum, and designed to be large so
as to have high capacity to store heat, which reduces spikes in temperature. The aiding of
heat removal is usually enhanced by using air or liquid cooling.
HotSpot Thermal Model
HotSpot thermal model [54] is a widely used compact thermal model (CTM) to characterize
the thermal behavior of processors. It is based on the thermal-electrical analogy described
in the previous section. The granularity of the thermal model used in this work is at the
level of functional blocks. Figure 2.2 shows the HotSpot thermal model for a typical four
20
Ra1
Ra2
Rb1
Rb2 Rc1
Rd1
Rc3
Rc5
Ra5
Rd3
Rd4 Rb4
Rspr,lat
Rspr,lat
Rspr,lat
Rspr,lat
Rspr,verRspr-hs Rspr-hs Rspr-hs
Rspr-hs
Rhs,ver
Rhs,lat
Rhs,lat
Rhs,lat Rhs,lat
Rconv
Cconv
Chs,midChs,edge
Chs,edge
Chs,edge
Chs,edge
Cspr,edge
Cspr,edge
Cspr,edge
Cspr,edge
Cspr,mid
n1
n2 n3
nl+1
nl+2
nl+3
nl+4
nl+5
nl+6
nl+7
nl+8
nl+9
nl+10
nl+11
l = die + TIM 
blocks = 8
Chip
Heat Spreader
Heat Sink
Cooling
Ambient
TIM
n5
n6 n7
n8
C1,tim
C7,tim
C7,tim
C1,tim
C1,die C2,die
C3,die
C4,die
n4
P1
P2 P3
P4
R1 R2 R3
R4
Figure 2.2: HotSpot-4 thermal model for a four core processor.
core processor. Each core is divided into m thermal blocks on the die and the thermal
interface material (TIM) layers. The package, which includes the heat spreader and the
heat sink, is modeled with 5 and 9 thermal blocks, respectively. Together, the total number
of thermal blocks for a n core processor is N = 2nm+14.
Power dissipation of each block on the die depends on the core sc and vc to which
the block belongs. It also depends on the tasks that are executed on these cores. Given this,
21
the thermal model can be expressed using state-space models [44] as:
dT(t)
dt
=−C−1GT(t)+C−1P(s,v,T, t). (2.1)
G and C are N×N conductance and capacitance (diagonal) matrices, respectively. The
elements of the conductance matrix represent heat spreading between a pair of nodes, and
the capacitance matrix elements denote the heat storage of an individual node.
By substituting
B= C−1, A=−BG, (2.2)
Equation (2.1) can be re-written as
dT(t)
dt
= AT(t)+BP(s,v,T, t) (2.3)
where T and P are temperature and power vectors of dimension N×1 respectively. Since
only the die units of chip generates heat, only the first nm units of P are non-zero. The
dimension of s and v are n× 1, where n is the number of cores. A and B are constant
matrices of size N×N and this makes the thermal system a time-invariant linear system.
An example of the positions of the thermal blocks in the P and T vectors for a 4 core
processor with m = 20 blocks in each core is shown below:
P or T=

Die: Core 1 (1−20)
...
Die: Core 4 (61−80)
TIM: Core 1 (81−100)
...
TIM: Core 4 (141−160)
Spreader (161−165)
Package (166−174)

22
P represents the total power, which is sum of the dynamic power Pdyn and the
leakage power Plkg . The dynamic power varies linearly with the clock frequency, as a
circuit is operated only when the clock is high, while dynamic power varies quadratically
with the voltage, as power of a transistor is the product of transistor current and voltage,
and current of a transistor is also a function of voltage. The components of the dynamic
power vector are expressed as
Pdyn,c,b(t) = Pmaxdyn,c,b(t)sc(t)v
2
c(t), ∀c,b, t (2.4)
where Pmaxdyn,c,b is the dynamic power dissipated by block b of core c when the core is at
the maximum speed and voltage. Pmaxdyn,c,b is obtained by profiling the time-varying power
consumption of the task to be run on core c.
The leakage power is known to have exponential dependence on the die temperature
and supply voltage. The exact equation is hard to derive it analytically. Hence it is usually
derived based on data fitting the simulated power values for various circuits like adders,
multipliers, memories, etc. An example empirical equation for leakage power in 65 nm is
given [53].
Plkg,c,b(t) = k1c,bvc(t)T
2
c,b(t)e
αc,bvc(t)+βc,b
Tc,b(t) + k2c,be
(γc,bvc(t)+δc,b), ∀c,b, t. (2.5)
k1c,b, k
2
c,b, αc,b, βc,b, γc,b and δc,b are parameters that depend on circuit topology, size, tech-
nology and design. Note that in the rest of this work, the above equation is used as a
representative of leakage power, as it is a well established power model used in literature.
The above model has been verified to have a model error of less than 0.7% for logic circuits,
and less than 3.6% error in memory power consumption. However, this does not imply that
the derived algorithms or solutions will not work for different power models. This is be-
cause the solutions derived are based on the convexity property in the above equation, i.e.,
power consumption is a convex function of processor speed, voltage and temperature. This
23
property is not specific to a processor or a technology, but to physical properties used in
the construction of a processor.
The non-linear leakage power dependence on temperature and voltage (LDTV), as
well as the cyclic dependency between the leakage power and the temperature, compli-
cate the analysis, as the temperature that needs to be computed is part of both L.H.S and
R.H.S, and they non-separable. Without decoupling the cyclic dependency, the only way to
determine the temperature is through iterating the temperature values until they converge,
which is not practicle. Without any further simplification, one can only resort to numeri-
cal solutions for general non-linear analysis. To make any further progress and to develop
computationally efficient solutions, this relation needs to be approximated by linear mod-
els.
Piece-wise Linear Approximation (PWL) to LDTV
The leakage power described in (2.5) can be represented as a three dimensional surface with
voltage and temperature axis as shown in Figure 2.3. This surface can be linearized w.r.t.
temperature and voltage to any desired accuracy. The leakage power for an approximated
linear section is expressed by the following equation:
Plkg,c,b(t) = Plkg0,c,b+GTc,bTc,b(t)+ k
v
c,bvc(t), ∀c,b, t. (2.6)
GTc,b and k
v
c,b represent the temperature and the voltage coefficients. Plkg0,c,b represents the
leakage power for block b in core c corresponding to Tc,b = 0 and vc = 0. Note that Tc,b = 0
and vc = 0 refer to the ambient temperature and the minimum voltage.
Let Y be defined as
Y (i, j) =

1, if block i belongs to core j and i≤ nm (total no. of blocks in the die),
0, otherwise.
24
0 0.1 0.2
0.3 0.4 0.5
0.6 0.7 0.8
0.9 1
0
20
40
60
80
100
120
0
10
20
30
40
50
60
0 0.1 0.2 0.3 0.4 0.5
0.6 0.7 0.8 0.9 1020
4060
80100
120
60
50
40
30
20
10
0
Voltage (V)Temperature (oC)
Le
ak
ag
e 
Po
we
r (
W
)
Figure 2.3: Piecewise-linear approximation of leakage power
Then the dynamic and the leakage power vectors are given by
Pdyn(s,v, t) = diag(Pmaxdyn (t))Ydiag(v(t))
2s(t), (2.7)
Plkg(v,T, t) = Plkg0+GTT(t)+Plkg,v(v, t) (2.8)
where Plkg,v(v, t) = diag(kv)Yv(t). GT is a vector of GTc,b. diag creates a diagonal matrix
of a vector.
Substituting (2.7) and (2.8) in (2.3),
dT(t)
dt
= AˆT(t)+BPˆ(t), (2.9)
where
Aˆ= A+BGT , (2.10)
Pˆ(s,v, t) = Pdyn(s,v, t)+Pv(v(t))+Plkg,0 (2.11)
P(s,v,T, t) = Pˆ(s,v, t)+GTT(t). (2.12)
Thus the cyclical dependency between the temperature and the power in (2.3) is removed
with the use of PWL. Note that B and A can also be represented in terms of conductance
G and capacitance C matrices as follows:
B= C−1 A=−C−1G. (2.13)
25
Core 1 Block i
Package
Internal Ambient
Core 2
Core n
Chip P1
T1
R1
Pi
Ti
Ri
Pm
Tm
Rm
Tp
Rp
Cp
Figure 2.4: Reduced multi-core thermal model
The conductance matrix element Gi j denotes the conductance connecting functional units i
and j, while Gii denotes the vertical resistance between functional unit i to the below layer.
On the other hand, C is a diagonal matrix of capacitance of functional units.
Although the above models are sufficient in most cases, especially for online schedul-
ing, there is a need for simpler thermal models in some special cases, e.g. DEP involving
task deadlines. This need is mainly due to the requirement to compute analytically the
number of instructions completed at any time. This model called the reduced thermal
model reduces the order of the system to a single-order system without any significant loss
in accuracy. The model is described in detail in the next section.
Reduced Thermal Model
The reduced thermal model is shown in Figure 2.4. In the figure, Rc,b denotes the sum of
vertical resistances from the die to the package, while Rp and Cp denotes the package resis-
tance and capacitance, respectively. This model was obtained after analyzing the thermal
model for a Alpha 21264 processor. The following observations [4, 5] allowed us to make
useful and practical simplifications to the HotSpot model presented in Section 2.2:
26
1. The lateral resistance between the functional blocks on the die and the TIM blocks
are at least 4–6 times higher than the corresponding vertical resistances. On Alpha
21264, the maximum die vertical resistance is 2 Ω, while the least die lateral resis-
tance is 12 Ω. Thus most of the heat spreads through vertical resistances and hence
lateral resistances can be ignored. Note that this does not eliminate thermal hotspots
as they are created due to differences in the power densities among functional blocks.
2. The thermal time constant of the package is three orders of magnitude (≈ 49 s for
Alpha 21264) higher than the thermal time constant of the die (max. 20 ms for Alpha
21264). The representative tasks considered in this work have execution times that
are much longer than the die thermal time constant, and are comparable to the pack-
age thermal time constant. Consequently, this saturates the die thermal capacitances,
and hence these capacitances can be ignored.
It was shown in [56] that the error in accuracy by using this model over the full HotSpot
model is less than 6%. Note that the above observations are not limited to a particular
processor. It is true for any silicon-based 2D processor.
The following section describes the method to compute the temperature of a func-
tional block analytically.
Computation of On-chip Temperatures
Since now all the capacitance apart from the package are ignored, (2.9) reduces to a scalar
differential equation, where the scalar state is the package temperature Tp. The scalar
differential equation is given below:
dTp(t)
dt
=−Tp(t)
RpCp
+
1
Cp
n
∑
c=1
m
∑
b=1
Pc,b(sc,vc,Tc,b, t). (2.14)
The coefficients for the package temperature and the total power are computed using (2.13)
and substituting scalar values for G and C.
27
For computing the temperature of rest of the functional units, (2.13) will not be
useful, as the rest of the function units are not states, but depend directly on the package
temperature as shown in Figure 2.4. From the figure,
Tc,b(t) = Tp(t)+Rc,bPc,b(sc,vc, t) (2.15)
Substituting the value of Pc,b from (2.4) and (2.6),
Tc,b(t) = Tp(t)+Rc,b[Plkg0,c,b+ kvc,bvc(t)+G
T
c,bTc,b+ sc(t)v
2
c(t)P
max
dyn,c,b(t)] (2.16)
Let ζc,b = (1−GTc,bRc,b)−1, then the above equation can be written as,
Tc,b(t) = ζc,bTp(t)+Rc,b[Plkg0,c,b+ kvc,bvc(t)+ sc(t)v
2
c(t)P
max
dyn,c,b(t)] (2.17)
Letting
P′c,b(sc,vc, t) = ζc,b[Plkg0,c,b+ k
v
c,bvc(t)+ sc(t)v
2
c(t)P
max
dyn,c,b(t)]. (2.18)
the final transformation of Equation (2.15) is given by
Tc,b(t) = ζc,bTp(t)+Rc,bP′c,b(sc,vc, t) (2.19)
P′c,b(sc,vc, t) is called the apparent power for the reason that it has the same form as the
power Pc,b and used in its place in subsequent equations. Note that even though Pc,b depends
only on the package temperature Tp, it is still affected by the temperature of all core, but
through the package.
The total power consumption is given by
Pc,b(sc,vc, t) = Plkg,c,b(vc,Tc,b, t)+Pdyn,c,b(sc,vc, t)
= P′c,b(sc,vc, t)+(ζc,b−1)Tp(t)/Rc,b.
(2.20)
In the absence of LDT, GTc,b ≡ 0 and ζc,b ≡ 1, and P′c,b ≡ Pc,b.
28
Computation of Package Temperature
Let
P′(s,v, t),
n
∑
c=1
m
∑
b=1
ζc,bP′c,b(sc,vc, t), G,
n
∑
c=1
m
∑
b=1
(ζc,b−1)/Rc,b.
Then substituting for Pc,b from (2.20) into (2.14),
dTp(t)
dt
=−Tp(t)
RpCp
+
1
Cp
n
∑
c=1
m
∑
b=1
[
ζc,bP′c,b(sc,vc, t)+
(ζc,b−1)Tp(t)
Rc,b
]
=−Tp(t)
RpCp
+
P′(s,v, t)+GTp(t)
Cp
=−Tp(t)
R′pCp
+
P′(s,v, t)
Cp
.
(2.21)
where R′p , Rp/(1−GRp). Tp(t) can be obtained by solving the above first order ODE.
2.3 Effect of Temperature and Voltage on Delay
It is well known that temperature, supply and threshold voltages affect the delay of a cir-
cuit. The following model [53], incorporates the above factors to express the maximum
frequency of operation smaxc as a function of the core voltage and its temperature for a
65 nm technology.
smaxc (t) = k
v
c
(vc(t)− vth)1.2
vc(t)max
b
(Tc(t))1.19
, (2.22)
where vth is the threshold voltage and kvc is the constant of proportionality. The max in
(2.22) is taken over all the blocks in core c. Like the models discussed previously, the model
described above and the rest of this section are used only to understand their properties,
and not the actual values of the model parameters. The property that is interesting in these
models is convexity of output to input variables. In the model, voltage vc is convex w.r.t.
speed sc. The proof is provided in Appendix A.2.
2.4 Fan Model
Active cooling/forced convection cooling helps in increasing the heat removal from proces-
sors. The most commonly used methods are air cooling and liquid cooling. In this work,
29
the discussion is restricted to air cooling, but similar relationships hold true for liquid cool-
ing also. In air cooling, a fan blows air at a given speed over a heat sink or an extended
piece of metal connecting to the heat sink. Two factors that affect the rate of heat removal
are the speed of fan and the temperature difference between the heat sink and the ambient
air. The convection resistance Rconv of a forced convection cooling is modeled as [57]:
Rconv(s f an) =
(
mcp
(
1− e
(
− hAemcp
)))−1
(2.23)
where
• m = ρs f an/Ac is the mass flow rate, where ρ is the density of the air; s f an is the
velocity of the air (function of angular velocity and the geometry of the fan); Ac is
the cross sectional area of the air channel.
• cp is the specific heat capacity of the air.
• Ae is the effective area of the heat sink.
• h = ktNu/Dh is the heat transfer co-efficient of the heat sink, where kt is the thermal
conductivity of the heat sink, Dh is the hydraulic diameter of the air channel and
Nu is Nusselt number. Nusselt number is the ratio of convective to conductive heat
transfer and is a function of Reynolds number Re = s f anDh/ν (ratio of inertial to
viscous force), where ν is the viscosity of the air.
Substituting the above quantities in (2.23) results in an empirical equation given
by [29]:
Rconv(s f an) =
(
h1s f an
(
1− exp
(
−h2s
h3
f an+h4
h1s f an
)))−1
. (2.24)
Since the cooling due to the fan is in addition to the normal heat dissipation by the
heat sink, the cooling resistance Rconv appears in parallel with the package resistances, i.e.
Gpkg(s f an) =Gpkg+R−1conv(s f an). (2.25)
30
500 1,000 1,500 2,000 2,500 3,000
0
0.2
0.4
0.6
Fan speed (rpm)
R
c
(◦
C
/W
)
Convection resistance
0
2
4
6
P
f
a
n
(W
)
Fan Power
Figure 2.5: Typical plot of convection resistance Rconv and fan power Pf an with the speed
of a fan
G is the thermal conductance matrix of the processor, where each element represents the
conductance connecting two thermal blocks. An example of G is shown in Figure 4.3.
Gpkg is the package thermal conductance matrix.
Thus s f an changes the rate of heat dissipation by altering package thermal resis-
tance, thereby the temperatures of all other thermal blocks as follows:
A(s f an) =−BG(s f an). (2.26)
Fan power consumption Pf an is given by
Pf an(s f an) ∝ s f anPtotr , (2.27)
where Ptotr is the total pressure (static + dynamic) at the output. Since P
tot
r ∝ s2f an, Pf an is
given by [29]:
Pf an(s f an) = k f s3f an (2.28)
where k f is the constant of proportionality. Figure 2.5 shows the exponential relationship
of the convection resistance and the fan power with the fan speed.
2.5 Summary
The performance, power, thermal, timing, and fan models used in the derivation of novel
solutions in this thesis were presented in this chapter. The full HotSpot thermal model
31
presented in Section 2.2 will be used in Chapters 3, 4, and 7, while the reduced thermal
models will be used in Chapters 5 and 6 for the reason that any tractable solution in DEM
including task deadlines require a reduced thermal model, and in Chapters 8 and 9 for the
reason that implementation on real systems or working on systems with limited information
on power and temperature behavior require a simplified thermal model. The fan model is
used only in Chapter 7, and the timing model is used in all chapters except Chapter 9.
Below are the assumptions about the models discussed in this chapter that are cen-
tral to the validity of the solutions derived in rest of the chapters. As long as the following
assumptions are justifiable in real systems, the solutions presented in this thesis can be used
in real system implementation with a slight modification in models as demonstrated in the
final chapter of this dissertation.
Key assumptions:
1. The key important characteristic of all these models presented in this chapter is the
property of convexity of quasiconvexity (monotonicity) of the dependent variables
power and temperature w.r.t. the input variables processor frequency, voltage, and
fan speed is assumed.
2. Every task is assumed to have non-significant interaction with other tasks. Although
the interaction does not affect the power and thermal models, they do affect the per-
formance computation.
3. Multicore floorplans are designed to ensure that two hot processing units of different
cores are not close. Usually this is avoided by facing cores such that hot units are
sided with cooler units that can be memory arrays like L1 or L2 caches.
32
Chapter 3
Performance Optimal Dynamic Voltage and Frequency Scaling
3.1 Introduction
Throughput optimization is one of the most important DEM problems. Performance is
typically constrained by the maximum power delivery capacity and the thermal design of
the processor. It is important and necessary to (i) estimate the maximum possible perfor-
mance that can be extracted; (ii) devise mechanisms that allow the extraction of maximum
throughput without violating power and temperature constraints. This chapter address this
issue by presenting online control algorithms for optimizing the performance of a heteroge-
neous multi-core processor subject to thermal constraints, when executing a set of tasks are
presented. The objective is to minimize the latest completion times of all tasks (makespan).
In a multicore environment without many tasks executing simultaneously, throughput can
be defined as either (i) maximizing instantaneous throughput from all cores, or (ii) mini-
mizing completion time of all tasks. In this chapter we prove that both the above definitions
are equivalent, and derive the proof and the resulting solution. The decision variables are
the task-to-core mapping, and the individual core speeds and voltages. The relation be-
tween temperature of each core and its power dissipation is expressed through a set of
linear differential equations.
The optimization processor for obtaining the solution for minimizing makespan is
typically done in three steps to reflect the fact that the dynamic control takes place on three
different time scales (See Figure 3.1. The time scales are a result of (i) the overhead of a
certain control mechanism; (ii) how fast a control can be switched. The fan speed scaling
is not applicable in this chapter, but in Chapter 7). The allocation of tasks are determined at
the beginning of the migration interval which is typically between 50 ms to 100 ms (context
switching overhead is on the order of few ms). Once the tasks are assigned to the cores at
the beginning of the migration interval, then every 5 to 10 ms (≈ τdie) within the migration
33
50 - 100 ms
1 - 3 s
Chip-wide migration 
Set fan speed
5 - 10 ms Chip-wide voltage 
and speed schedule
Schedule core 
speeds and voltages
Set task-to-core 
assignment
s
v
Figure 3.1: Time scales for optimization
interval, the core speeds and voltages are adjusted. The core speeds and voltages are held
fixed within a scheduling interval.
The problem of optimal thermal aware task-to-core allocation turns out to be a
computationally difficult non-linear optimization problem. A simplification to the original
problem is presented that reduces the problem to a linear assignment problem (LAP), which
is polynomially solvable. Solving the task-to-core allocation problem as a LAP results in a
significant improvement in throughput over power based thread migration method [14].
It is shown in this chapter that the thermally constrained DVFS policy (also called
the voltage-frequency or the voltage-speed policy) over a given interval that minimizes the
makespan of a set of tasks is equivalent to maximizing their instantaneous throughput.
Throughput is defined as the aggregate of speeds of all cores over a duration of time. Fur-
ther, this is achieved through the zero-slack policy, which either sets the speed of a core to
its maximum value if the temperature of its hottest block is below the temperature upper
bound Tmax, or sets the speed to a value that keeps the temperature of the core at Tmax.
34
The zero-slack policy is just that - a policy. It expresses the form of the globally
optimal speed function. It will be shown first that implementation of that policy requires
repeatedly solving a convex optimization problem over short intervals. Next the structure
of certain matrices that appear in constraints of the convex optimization formulation are
exploited to develop a fast computational procedure which when compared to the con-
vex optimization approach is 20 times faster, and with about 0.4% error in the predicted
throughput.
The proposed solutions on the task-to-core allocation and thermally constrained
DVFS are demonstrated by numerical results for a many-core processor. The proposed
method for the task-to-core allocation achieves 20.2% improvement over corresponding
power based thread migration scheme [14]. Finally, the potential of real-time executions of
both online DVFS and task-to-core allocation are also demonstrated through experiments.
3.2 General Problem Formulation Involving DVFS and Migration
Consider q tasks (not necessarily identical). The problem is to determine the assignment
of tasks to cores and the transient voltages and speeds of cores, such that the latest task
completion time or the makespan of all tasks is minimized, subject to constraints on tem-
perature, speeds and voltages.
Let M be an n×q matrix that represents the assignment of q tasks to n cores. It is
defined by
Mi j =

1, if task j is assigned to core i,
0, otherwise.
The following equations ensure that each task is mapped to only one core and vice-versa.
n
∑
i=1
Mi j = 1,
q
∑
j=1
Mi j = 1, ∀i, j. (3.1)
35
Then the general problem can be stated as follows.
min
s(t),v(t),M(t)
t f = max
1≤c≤n
t f ,c, (3.2)
s.t.
dx(t)
dt
= s(t)M(t)IPC(t), (3.3)
x(0) = 0, x(t f ) = I, (3.4)
dT(t)
dt
= AˆT(t)+B[Pdyn(M,s,v, t)+Plkg,v(v, t)+Plkg0], ∀t, (3.5)
T(0) = T0, T(t)≤ Tmax, ∀t, (3.6)
0≤ sc(t)≤ ksc
(vc(t)− vth)1.2
vc(t)max
b
(Tc(t))1.19
, ∀t,c (3.7)
0n×1 ≤ v(t)≤ 1n×1, ∀t. (3.8)
t f ,c refers to the task completion time of the task executing on core c and t f denotes
the final completion time or the makespan of all tasks. Here xi(t) is the number of com-
pleted instructions of the task running on core i by time t. The total number of instructions
is denoted by I . Equation (3.3) relates the speed of a core to its execution rate. Each task
starts at time 0 and finishes by time t f as described in (3.4).
For the purpose of modeling power and thermal behavior of tasks, we use HotSpot
thermal described in Section 2.2. The term P(s,v,M,T, t) in (3.5) is same as P(s,v,T, t) in
(2.12), except that now Pdyn is a function of M given by
Pdyn(s,v,M, t) = Pmaxdyn,q(t)M
T diag(v(t))2s(t). (3.9)
Pdyn(M,s,v, t) is an N × 1 vector representing the dynamic power associated with each
of the N thermal blocks. The term MT diag(v(t))2s(t) is a q× 1 vector in which the kth
element is of the form sckv
2
ck , and ck is the core to which task k is assigned according to M.
Pmaxdyn,q(t) is a N× q matrix, in which the entry in row i and column j is the portion of the
maximum dynamic power that is flowing in to or out of thermal block i as result of running
task j. This maximum dynamic power is computed by profiling the tasks on each core,
36
running at the maximum speed and the maximum voltage (s = 1,v = 1). Multiplying the
ith row of Pmaxdyn,q(t) with the vector M
T diag(v(t))2s(t) yields the ith element of the vector
Pdyn(M,s,v, t). This is given by
Pdyn(i) =
[
Pmaxdyn,q(i,1), . . . ,P
max
dyn,q(i,q)
]
×
[
sc1v
2
c1, . . . ,scqv
2
cq
]T
.
Task allocation or migration incurs high performance penalty compared with voltage-
speed scaling. Hence the migration interval is typically chosen around 50–100 ms much
larger than the die thermal time constant (5–10 ms). Once the tasks are allocated, the
voltage-speed scaling for each core is performed on the order of the die thermal time con-
stant. These two problems are addressed separately in the following sections.
3.3 Performance Optimal DVFS Policy
The formulation of the transient voltage-speed control to minimize the makespan of tasks
within a migration interval is presented here. Assuming that the task allocation is known
at the beginning of the migration interval, without loss of generality, the task and the core
number are the same and used interchangeably. Ic denotes the number of instructions of
the task assigned to core c. The control variables, viz., the speed and the voltage are
continuous, hence the problem is formulated as an optimal control problem.
min
s(t),v(t)
t f = max
1≤c≤n
t f ,c, (3.10)
s.t.
dxc(t)
dt
= IPCcsc(t), ∀c, (3.11)
x(0) = 0, x(t f ) = I, (3.12)
dT(t)
dt
= AˆT(t)+B[Pdyn(s,v, t)+Plkg,v(v, t)+Plkg0], ∀t, (3.13)
T(0) = T0, T(t)≤ Tmax, ∀t, (3.14)
0≤ sc(t)≤ ksc
(vc(t)− vth)1.2
vc(t)max
b
(Tc(t))1.19
, ∀t,c, (3.15)
0n×1 ≤ v(t)≤ 1n×1, ∀t. (3.16)
37
The above formulation is a minimum time problem in optimal control theory [58].
Since the task allocation is fixed for the duration of the voltage-speed scaling, the formula-
tion does not contain M. x and T are two state variables with fixed initial conditions. x has
a variable endpoint t f . The mixed control-state point-wise inequalities (3.14), (3.15) com-
plicate the solution process. The solution is obtained through the use of direct adjoining
approach [59]. Only the final solution is presented below and the details of the derivation
can be found in Appendix A.1.
Let smax,c be the speed which sets the temperature of the hottest block of core c at
the maximum. Then the optimal speed policy for the core c is given by:
s∗c(t) =

1, max(Tc(t))< Tmax,
0, max(Tc(t))> Tmax,
smaxh,c(t), max(Tc(t)) = Tmax.
(3.17)
The corresponding voltage is calculated by solving the following equation numerically.
sc(t) = ksc
(vc(t)− vth)1.2
vc(t)max
b
(Tc(t))1.19
, ∀t,c. (3.18)
The above policy, which is referred to as the zero-slack policy, which suggests that
in order to minimize the overall makespan, either the speed of a core is set to the maximum
when the temperatures of all thermal blocks in that core are less than the maximum or
the speed should be set such that at least one of the thermal blocks in that core is at the
maximum specified temperature.
Figure 3.2 shows the typical optimal voltage-speed profile for a core. The speed is
deduced from (3.17) and the corresponding voltage from (3.18). Initially the temperature
of the hottest block is less than the maximum, hence the speed of the core is set to the
maximum. Once the temperature of the hottest block reaches the maximum at time tm,i,
the speed is decreased according to smaxh,i to maintain the temperature of the hottest block
38
0
t
si(t)
1 Ti,h(t)
Tmax
vi(t)
tm,i te,i
Figure 3.2: Figure showing the optimal voltage-speed policy of a core as described by
(3.17) and (3.18).
at the maximum. Finally the task completes at time te,i. This policy can be also be stated
as follows: for the minimum makespan, there should not be slack in both the speed of a
core and the temperature of its hottest block at the same time during the execution of a
task. The solution corresponding to the zeroslack policy will be referred to as the minimum
makespan solution (MMS).
The above zero-slack policy is just a policy or a guideline to set core speeds and
voltages to minimize the makespan. It does not provide a mechanism to implement the
policy. In the process of finding an implementation, it is important to note that the core
speeds and voltages determined through the zero-slack policy depends only on the current
temperature of the cores and is independent of the core speeds, voltages and temperatures
at any other time instants. Thus global minimization of makespan can be achieved through
local maximization of instantaneous throughput. For an online implementation of the zero-
slack policy, the execution time is discretized into short time intervals referred to as the
scheduling intervals, where the core speeds and voltages are held fixed. With this setup,
the zero-slack policy can be implemented as the solution of a convex optimization problem
discussed in the following section.
39
3.4 Convex Optimization Formulation and its Solution
In order to implement the zero-slack policy at every scheduling interval, the temperature
of all thermal blocks at any scheduling interval should be computable. Let ts be the length
of the scheduling interval, then the temperature at the kth interval can be computed as
(obtained by discretizing (2.9)) [60],
T(kts) = eAˆtsT((k−1)ts)+ Aˆ−1(eAˆts−1N×N)BPˆ(s,v,kts), (3.19)
where
Pˆ(s,v,kts) = Pdyn(s,v,kts)+Plkg,v(v,kts)+Plkg0 (3.20)
and 1 is the identity matrix. For simplicity of notation, the use of ts is omitted and T(kts)
is simply written as T(k). Let
E= eAˆts (3.21)
R= Aˆ−1(eAˆts−1N×N)B. (3.22)
Then
T(k) = ET(k−1)+RPˆ(s,v,k). (3.23)
With the above discretization, the problem of instantaneous throughput maximiza-
tion for every scheduling interval is formulated as a convex optimization problem.
max
s,v ∑c
sc(k), ∀c, (3.24)
s.t. T(k) = ET(k−1)+RPˆ(s,v,k), (3.25)
T(k)≤ Tmax, (3.26)
0≤ sc(k)≤ ksc
(vc(k)− vth)1.2
vc(k)max
b
(Tc(k))1.19
, ∀c, (3.27)
0n×1 ≤ v(k)≤ 1n×1. (3.28)
40
Die, 
core 1
TIM Spreader
Pkg
Rows
Die, 
core 2
Columns
Figure 3.3: Plot of values of matrix R for the dual-core Alpha processor with scheduling
interval ts = 10 ms [3].
The proof of convexity of the above optimization problem is derived in Appendix A.2.
The worst-case complexity of an convex optimization process in NP-hard, hence
the computational time to solve through the convex optimization process does not scale
well with the problem size and thus it is less attractive for an online implementation. With
the aid of realistic assumptions, a fast computational procedure is derived in the following
section.
3.5 Fast Computational Procedure
In this section, a fast computational method, which avoids the complexity of convex op-
timization is proposed. By making use of certain useful properties that matrix R exhibits
for scheduling intervals on the order of the die thermal time constant, an efficient com-
putational procedure is developed, whose computational time complexity is linear in the
number of cores and logarithmic in the number of discrete states of speeds and voltages.
41
L2 Cache
ILC DLC
branch DTLB
RAT
FPReg
ALU4
IntReg
decode
FPAdd
FPMul RUU
LSQ
ITLB
ALU3
ALU2
ALU1
L2 Cache
ILC DLC
branch DTLB
RAT
FPReg
ALU4
IntReg
decode
FPAdd
FPMul RUU
LSQ
ITLB
ALU3
ALU2
ALU1
Core 1 Core 2
Figure 3.4: Dual-core floorplan of Alpha 21264 processor [4].
In order to understand the structure of R, we need to know the typical multi-core
floorplan. Figure 3.4 shows the floorplan of a dual core Alpha 21264 processor as an
example of a typical multi-core floorplan. This floorplan is constructed by replicating the
single-core floorplan, such that the processing units of cores are separated by L2 caches or
similar memory areas which do not create temperature hotspots. Floorplans similar to the
above are typically used as they are shown to be beneficial as it produces fewer thermal
hotspots compared to other floorplans due to reduced lateral heat flow [61].
Next, the properties of R are studied for scheduling intervals on the order of the
die thermal time constant. Figure 3.3 shows a typical plot representing values of R for the
dual-core Alpha 21264 processor with the scheduling interval set to the die thermal time
constant, which is typically 10 ms for the Alpha processor. The arrangement of die, TIM,
spreader and package blocks of the matrix are shown in the figure. Note that the Alpha
21264 processor has 20 functional blocks in both the die and the TIM layers.
From (3.23), we see that the temperature T is obtained by multiplying R with the
power vector Pˆ. From Figure 3.3, we see that the power dissipation of a die section of a
42
certain core affects mainly the temperature of the die section of the same core. For example,
we see that rows 1–20 are populated mostly between columns 1–20 (die) and 41–60 (TIM).
Noting that the power vector is non-zero only for the die blocks, we see that the power
dissipation of the die section of core 1 is mostly responsible for the temperature of the die
section of core 1. This observation can be explained due to the presence of large and cooler
inter-core caches (see Figure 3.4) which reduce the inter-core lateral heat transfer. Since
the scheduling intervals are chosen on the order of the die thermal time constant, which is
orders of magnitude lesser than the package thermal time constant, the heat generation due
to the power dissipation of a core gets localized within the core for these short intervals of
time. Thus the temperature of a core is mainly derived from the power of that core for time
durations on the order of the die thermal time constant.
Input : Power profile, instruction length of tasks. Initial temperature.
Available speed and voltage states.
Output: Speed and voltage profile
for each scheduling interval kts do
for each core c do
Run binary search on sc such that Tdie,c(kts) = Tmax (see (3.29));
Let sc(kts) be the corresponding speed;
sc(kts) = min(sc(kts),1);
Solve (3.18) to find the corresponding vc(kts);
end
end
Algorithm 1: Fast computational procedure to determine the voltage-speed scaling to min-
imize makespan.
The above observation helps in the local determination of speeds and voltages of
cores based on their own power dissipation and independent of the speeds and voltages of
other cores. The simplified temperature computation of the die section of core c is then
given by
Tdie,c(k) = ETdie,c(k−1)+Rdie,cPˆdie,c(k), (3.29)
The matrix Rdie,c corresponds to the die section of core c in R. For the dual-core processor
43
shown in Figure 3.4, Rdie,1 corresponds to rows 1–20 and columns 1–20 of R. Note that
the matrix Rdie,c is of constant size (number of functional units) irrespective of the number
of cores. Thus the temperature computation of an n core processor is linear in the number
of cores.
This decoupling of temperature dependence on power consumption of other cores
for short duration enables the use of binary search technique to determine the speed and the
voltage of each core such that they satisfy the optimal policy (3.17). The time complexity
of the binary search method is logarithmic in the number of discrete states of speeds and
voltages. The above procedure is summarized in Algorithm 1.
3.6 Experimental Results
Experimental Setup
For the simulation purposes, the power and thermal models of Alpha 21264 microprocessor
is used. Alpha 21264 is commonly used for simulating and verifying DTM policies [53]
due to the availability of detailed power and thermal models. Since Alpha 21264 is a
single-core processor, in order to create a hypothetical multi-core version of the processor,
a single core processor is scaled down and replicated such that the resultant cores fit the
size of a single Alpha core as shown in Figure 3.4. HotSpot-4 [54] and PTScalar [53]
simulators were used to derive the thermal and the power models, respectively.
The above simulation setup and the policies obtained from this work are imple-
mented in a thermal management simulator called MAGMA [62], which is capable of design
space exploration, performing several offline and online thermal management techniques
including the algorithms developed in this work. The simulator includes the leakage de-
pendence on temperature and allows user to trade-off model accuracy with the simulation
time. The simulator is open source and is available for public download. A schematic of
Magma setup is shown in Figure 3.5.
44
Power of 
all FUs
 in cores
Speed of 
all cores
Dynamic and 
static power 
models
Thermal conduction 
and convection 
equations
Temperature of
 all FUs in cores
DTM
Controller
Workload characteristics
(e.g. ALU accesses, 
cache misses)
Max. die 
temperature
Max. core
speed
FU: Functional Unit
block of CPU core
HotSpotPTScalar
Figure 3.5: Experimental setup showing various components
Table 3.1: Characteristics of benchmarks used in the experiment.
Benchmark basicmath qsort IFFT GSM dec.
Avg. dyn. power (W) 60.13 56.75 55.41 57.54
IPC 1.8 1.85 1.95 2.45
Instructions (billion) 700 900 500 300
Completion Optimal 277 132 287 342
times (sec) Discrete 276 134 287 343
The benchmarks for the simulation were obtained from MiBench benchmark suite [63].
The benchmarks used in the experiments, their instruction length, IPC and average dynamic
power consumption values are listed in Table 8.1. The maximum speed was fixed at 2 GHz,
while the voltages were allowed to vary from 0.3 V (threshold voltage) to 1.2 V. The total
leakage power of the processor was limited to 60 W. The maximum temperature was fixed
at 110◦C [53]. The convectional thermal resistance in HotSpot thermal model was set at
0.35◦C/W. The voltage-speed scheduling interval was set at 10 ms, while the migration and
the fan speed scheduling intervals were set at 100 ms and 3 s, respectively.
45
0 132 277 342
80
95
110
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
0 132 287 342
0
1
2
Time (s)
Sp
ee
d
(G
H
z)
basicmath qsort IFFT GSM dec.
0 132 287 342
0.3
0.7
1.2
Time (s)
Vo
lt
ag
e
(V
)
Figure 3.6: Performance of the optimal DVFS scheme in minimizing the overall makespan
The Optimal Makespan Minimization Policy
Figure 3.6 shows the results of execution of the optimal makespan policy for four tasks
executing on four cores. Table 3.1 lists the task completion times.
Since there is a lack of an equivalent work which considers both voltage and fre-
quency scaling, with time-varying power profiles to minimize the makespan, a comparison
of the developed optimal makespan policy with other works cannot be provided.
46
Table 3.2: Comparison of the proposed approximate method with the accurate convex
optimization method.
No. of cores 4 8 16 32 64
Speed up 3.27 4.06 5.59 10.52 24.13
Error(%) 0.023 0.05 0.07 0.15 0.35
In Figure 3.6, the temperature of all cores are initially below the maximum speci-
fied temperature of 110◦C. Hence the speeds are set to the maximum. As the temperatures
increase, the speeds of the cores are throttled exponentially to maintain the core temper-
atures at the maximum. We also see that once a task is completed, it enables other tasks
to execute at faster pace as there is less power consumption. For example, after task qsort
(132 s) and basicmath (277 s) finish their executions, we see an increase in the speeds of
other tasks. The optimal DVFS scheduling resulted in a makespan of 342 s.
Comparison of the Approximate Solution with the Convex Optimization Solution for
Makespan Minimization
The results of using the faster but approximate method described in Section 3.5 in compar-
ison with the convex optimization method (Section 3.4) are shown in Table 3.2. The error
reported is simply the relative difference between the speeds obtained from the convex op-
timization approach and the approximate method. The approximate procedure improves
the computation time significantly (3X – 24X) with increasing number of cores, while the
error in accuracy is less than 0.4%.
Discrete Voltage-speed Implementation
To demonstrate that the developed procedure is suitable for practical implementation, an
approximate discrete voltage-speed policy using ten speed and voltage states (current Intel
processors support upto ten voltage-speed states) is computed as shown in Figure 3.7. The
tasks conditions are the same as in Table 8.1. The speeds are chosen such that the highest
discrete speed which satisfies the thermal constraint is chosen. This may lead to frequent
47
0 134 276 343
80
95
110
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
0 134 287 343
0
1
2
Time (s)
Sp
ee
d
(G
H
z)
basicmath qsort IFFT GSM dec.
0 134 287 343
0.3
0.7
1.2
Time (s)
Vo
lt
ag
e
(V
)
Figure 3.7: Discretization of the optimal makespan algorithm with ten speed states
toggling between neighboring speed states as seen in the figure to maintain the temperature
at the maximum according to the zero-slack policy. Since the discrete implementation is
an approximation to the continuous optimal policy, it results in a slightly higher makespan
(343 s) and longer completion times of tasks compared with the optimal policy as seen
from Table 3.1.
48
Chapter 4
Performance Optimal Task-to-core Allocation
4.1 Problem Definition
The goal of the task-to-core allocation is to determine an optimal allocation for every mi-
gration interval, which when combined with the optimal voltage-speed scaling policy (de-
rived in Section 3.3) within a migration interval minimizes the overall makespan.
Recall that the optimal speed policy is given by (3.17). In [16], the authors have
shown for dynamic frequency scaling (DFS) (no voltage-scaling and neglecting the die
capacitances) that the optimal speed curve is an exponential function, whose time constant
is that of the package thermal time constant. It can be shown that the optimal speed policy
derived in Section 3.3 follows this exponential curve in the general sense with variations
along the path.
The above inference is useful in comparing two speed profiles. It can be deduced
that a speed function with higher steady-state speed has higher overall throughput (integral
of speed over the duration of execution) than a speed function with lower steady-state
speed. Hence the steady-state speed can be used as a metric for throughput comparison
of different speed functions. It was previously mentioned that the tasks are allocated at
intervals much longer than the die thermal time constant. At these intervals, the die and the
thermal interface material (TIM) capacitances saturate. Thus we can conduct the steady-
state thermal analysis for the die and the TIM. Note that the package temperature can be
assumed to remain constant (see Figure 4.1) during this interval as its thermal time constant
(1–2 min) is orders of magnitude higher than the migration interval.
From the zero-slack policy stated in Section 3.3, we know that minimizing makespan
can be achieved by maximizing the instantaneous throughput. Hence objective in the for-
mulation in Section 3.2 is replaced with maximizing throughput (sss) for the purpose of
49
sc(t)
Tc,b(t)
Tp(t)
τdie
Tp(t)
Tc,b(t)
t
t
Figure 4.1: Response of the die and the package temperatures to changes in speed of a
core [5].
determining the task-to-core allocation as shown below.
max
sss,vss,M
∑
c
sc,ss, ∀c, (4.1)
s.t. T=− ˆA−1B[Pdyn(M,sss,vss)+Plkg,v(vss)+Plkg0], (4.2)
T≤ Tmax, Tpkg = Tpkg0, (4.3)
0≤ sc,ss ≤ kv (vc,ss− vth)
1.2
vc,ss max
b
(Tc0)1.19
, ∀c, (4.4)
0n×1 ≤ vss ≤ 1n×1. (4.5)
sc,ss and vc,ss are the steady-state speed and voltage of core c. Tpkg is the package temper-
ature vector. Equation (4.2) is the steady-state temperature derived by setting dTdt = 0 in
(2.9).
The solution to the above optimization problem is complicated by the fact that it
contains a mixed-integer non-linear function of Pdyn in (4.2) involving sss, vss and M.
Performing a mixed-integer non-linear optimization in real time is computationally hard.
50
A naive optimal solution would be a brute force enumeration of all possible task-to-core
assignments, whose number is qPn = q(q− 1)(q− 2) · · ·(q− n+ 1). Such an approach is
clearly not an option. However, with the aid of a few realistic simplifications, this problem
can be transformed to a linear assignment problem (LAP) [64]. These simplifications are
detailed in the following section.
4.2 Structure of the HotSpot Conductance Matrix
Given a floorplan, it is easy to construct the conductance matrix G using the HotSpot
thermal circuit model [54]. Figure 4.2 shows the sparsity plot of the conductance matrix
for the dual-core Alpha processor shown in Figure 3.4. A non-zero entry in the matrix
is represented by a dot in the plot. It is seen from the figure that most of the entries are
concentrated along the main diagonal and the off-diagonal elements of the die and the TIM
sections; the package and the spreader sections. The square diagonal blocks represent the
lateral resistances of the die and the TIM of the cores. The vertical resistances connecting
the die and the TIM layers of the chip are shown by the off diagonal elements. The vertical
and the horizontal strip of dots in the figure represent the vertical resistances between the
TIM and the spreader.
As seen from Figure 3.4, each core is surrounded by a large cache. These caches
act as cooler regions of the chip and help in reducing the lateral heat flow between the
cores. Hence fewer lateral resistances between the cores are seen in Figure 4.2. Based on
this inference, the sparse blocks of the matrix corresponding to the inter-core die and TIM
resistances can be neglected and the resultant conductance matrix is labeled as shown in
Figure 4.3.
The diagonal components of the conductance matrix in Figure 4.3 are named Gdie
and Gtim corresponding to the die and the TIM layers of the chip respectively. Gdie-tim
and Gtim-die denote the same vertical conductances connecting the die and the TIM layers.
Similarly, Gtim-spr and Gspr-tim denote the conductances connecting the spreader layer to
51
20 40 60 80
10
20
30
40
50
60
70
80
90
nz = 688
Core 1
Core 2
Core 1
Core 2
Spreader
Heatsink
Core 1 Core 2 Core 1 Core 2
Sp
rea
de
r
He
ats
ink
Vertical resistances
between die and TIM
Vertical 
resistances
between TIM 
and spreader
Lateral resistances
within die Lateral resistancebetween cores
Lateral 
resistances
within TIM, 
core 1
Vertical and
lateral
resistances
in spreader
and heatsink
layers
Die
Die
TIM
TIM
Figure 4.2: Plot of the conductance matrix showing the sparsity of the dual-core Alpha
processor floor plan shown in Figure 3.4 [4].
the TIM layer. Finally, the package conductances are denoted by Gpkg.
4.3 Simplified Temperature Computation
The total power dissipation of a processor for a given task allocation is given by:
GT= Pdyn(s,v, t)+GTT+Plkg,v(v, t)+Plkg0. (4.6)
Applying Kirchhoff’s current law for the die and the TIM layers, the following equations
are derived for task j executing on core i:
Gdie,ijTdie,i+Gdie-tim,iTtim,i = si jv2i jP
max
d, j +GT,iTdie,ij+Pv,ivi j +Pl0,i (4.7)
Gdie-tim,iTdie,ij+Gtim,iTtim,i+Gtim-spr,iTspr = 0. (4.8)
52
Gtim2-die2
Gtim1-die1
Gdie2-tim2
Gdie1
Gdie2
Gtim1
Gtim2
≈ 0
≈ 0
Gdie1-tim1
0
0
0
0
≈ 0
≈ 0
Gspr-tim1
0 0 0 0
0
0
0
0
Gtim1-spr
Gpkg
Core 1
Core 2
Core 1
Core 2
Heatsink
Core 1 Core 2 Core 1 Core 2
Spreader
Sp
rea
de
r
Heatsink
Gtim2-spr
Gspr-tim2
Die
TIM
Die TIM
Figure 4.3: Components of the thermal conductance matrix G of the dual-core Alpha pro-
cessor floor plan shown in Figure 3.4 [4].
Tdie,i and Ttim,i denote the temperature vectors of the die and the TIM units of core i respec-
tively. Tspr is the scalar temperature of the spreader center. Tspr can be computed uniquely
for a given Tpkg [4]. si j, vi j and Tdie,ij are the steady-state speed, voltage and temperature
of die units of core i when executing task j. Note that the subscript “ss” is dropped to keep
the notations clean.
To simplify the notations, the following are defined.
Gapp,i ,Gdie,i−Gdie-tim,iG−1tim,iGdie-tim,i−GT,i, (4.9)
Papp,i ,Gdie-tim,iG−1tim,iGtim-spr,iTspr+Pl0,i. (4.10)
With the above definitions, Tdie,ij is calculated from (4.7) and (4.8) as
Tdie,ij =G−1app,i[Papp,i+ si jv
2
i jP
max
d, j +Pv,ivi j]. (4.11)
53
The problem stated in Section 4.1 is re-formulated using the above simplifications.
max
s,M
M′s, (4.12)
s.t. Tdie,i =G−1app,i[Papp,i+ siv
2
i P
max
dyn,qM
T
i +Pv(vi)], ∀i, j, (4.13)
T≤ Tmax, Tspr = Tspr0, (4.14)
0≤ si ≤ kv (vi− vth)
1.2
vi max
b
(Tc)1.19
, ∀i, (4.15)
0n×1 ≤ v≤ 1n×1. (4.16)
Mi refers to the ith row of matrix M. This formulation is converted to a linear assignment
problem as illustrated in the next section. Note that the initial condition on Tpkg in (4.3)
has been replaced with the initial condition on Tspr in (4.14) as the Tpkg can be derived
exactly knowing Tspr [4].
4.4 Linear Assignment Problem
The optimal speeds and voltages of cores for a given allocation can be obtained by setting
Tdie,ij = Tmax and solving (4.11) and (3.18) simultaneously (the corresponding si j ≤ 1).
Thus the computation of speed for a core executing a given task is independent of the
speed computations of other cores. Let S be a matrix, whose elements are si j, where i, j
refer to the core and the task number respectively. This matrix is referred to as the speed
matrix.
Given this speed matrix S, the problem of optimal task-to-core allocation is formu-
lated as
max
M ∑i ∑j
Mi jSi j (4.17)
s.t.
n
∑
i=1
Mi j = 1, j ∈ {1, . . . ,q},
q
∑
j=1
Mi j = 1, i ∈ {1, . . . ,n} (4.18)
Mi j ∈ {0,1}, i ∈ {1, . . . ,n}, j ∈ {1, . . . ,q}. (4.19)
54
0 50 100 150 200
4
6
8
9
Time (s)
Th
ro
ug
hp
ut
with migration
without migration
Figure 4.4: Comparison of throughput with and without task migration.
This is a linear assignment problem and has an efficient polynomial time solution (O((nq)3))
using Munkres algorithm [65]. Once the allocation is determined using this algorithm, the
speeds and the voltages within the migration interval are determined for every scheduling
interval as described in Section 3.3.
4.5 Experimental Results
Performance Improvement through Task Migration
This section demonstrates the advantages of task migration in improving the overall per-
formance by comparing the throughput improvement by executing a set of tasks with and
without task migration. Figure 4.4 shows the plot of throughput of a four core processor
executing four tasks chosen out of a set of eight tasks for a duration of 200 s with the ini-
tial package temperature set at 35◦C. Throughput is measured as the sum of core speeds
weighted by their respective IPCs.
First, tasks are executed according to the optimal voltage-speed scaling alone as
derived in Section 3 for a fixed allocation of tasks to cores. Next, the same set of tasks are
executed using both the task migration (according to Section 4) and the optimal voltage-
speed scaling. We see 32.3% improvement in throughput with the task migration. Hence
demonstrating the need for task migration to improve performance.
55
0 20 40 60 80 100
4
6
8
9
Time (s)
Th
ro
ug
hp
ut
Optimal
P.TM
Figure 4.5: Comparison of temporal performance of the optimal allocation scheme with
the P.TM scheme.
Performance Comparison of the Optimal Task-to-core Allocation with the Power-based
Thread Migration
The comparison of performance of the proposed optimal task-to-core allocation with the
power based thread migration (P.TM) [14] method is presented here. The mapping al-
gorithms were required to choose four tasks among eight tasks to execute on a four-core
processor for every migration interval. Note that each core still executes only one task.
In P.TM, the cores are sorted by their current temperatures (increasing) and tasks
are sorted by their power dissipation numbers (decreasing). At the beginning of every
migration interval, task i is mapped to core i according to their respective lists, i.e. the
highest power dissipating task is assigned to the coldest core and the least power dissipating
task to the hottest core.
Figure 4.5 shows the plots of the throughputs (sum of core speeds weighted by their
IPCs) of the proposed optimal allocation with the P.TM technique for a four core processor
with eight available tasks. The simulation is performed for a duration of 100 s with the
initial package temperature set at 35◦C. The throughput within a migration interval is
obtained using the optimal voltage-speed scaling as described in Section 3.
56
No. Cores: 4 4 4 4 4 2
No. Tasks: 8 16 32 64 128 128
24 8 16 32 64
0
100
200
250
318
Ratio of tasks to cores
Th
ro
ug
hp
ut
Im
pr
ov
em
en
t(
%
)
Figure 4.6: Plot of throughput improvement of the proposed optimal algorithm against the
P.TM technique for various ratios of tasks and cores
Since the P.TM method does not predict the temperatures for all combinations of
tasks and cores, and considers only the current temperature, it cannot take into account
the heterogeneity of cores and the spatial variation of leakage power. Also, if some cores
have the same temperature, the P.TM cannot decide on the mapping of tasks to those cores.
Hence P.TM results in lower throughput compared with the proposed allocation technique.
The spatial variation of leakage power was set at 30% of the maximum leakage power. The
plots demonstrate that the proposed optimal allocation scheme has a throughput improve-
ment of 20.2% over the P.TM technique at the steady-state conditions, which is equivalent
of losing the throughput of one core among five cores.
Effect of Cores and Tasks on the Performance of Task-to-core Allocation
Figure 4.6 shows the performance of the proposed optimal task allocation for various com-
binations of number of tasks and cores against the P.TM scheme for similar configuration.
The plot shows the percentage improvement in throughput. Higher task-to-core ratio pro-
vides more flexibility for the proposed algorithm to extract the maximum throughput in
comparison with the P.TM scheme. This can be observed from the figure with improve-
ments as high as 3.2X.
57
Table 4.1: Computation times per core (in milliseconds) for DVFS, task-to-core allocation
and fan speed scaling
No. of cores 4 8 16 32 64
Voltage-speed scaling 17.4 18.3 16.7 19.1 18.7
Task-to-core Optimal 138 167 126 157 142
allocation P.TM 3.44 3.41 3.51 3.47 3.43
Fan speed scaling 104 113 97 101 107
Computation Times
Table 4.1 shows the computation time for both the proposed task-to-core allocation and
voltage-speed scaling methods for various number of cores under DVFS. The table also in-
cludes the simulation time for the P.TM method. The simulation was conducted in Matlab,
run on Intel Core 2 duo processor at 1.67 GHz with a 2 GB RAM. The computation time
for voltage-speed scaling method is the time taken to compute the speed and voltage by
every core for one scheduling interval.
Although the P.TM method consumes negligible time to compute the mapping of
tasks to cores, it is worth considering the trade-off between the loss in throughput and the
simulation time.
As mentioned previously, each core can compute its own voltage and speed. Al-
though solving the linear assignment problem to compute the task allocation cannot be
completely parallelized, the speed matrix can be computed in parallel (as each core com-
putes its own speed and voltage). Moreover, computational time for solving linear assign-
ment problem through Munkres algorithm [65] takes negligible time compared to the time
to compute the speed matrix. Hence only the computation time per core is shown in the
table. Note that the computation time per core in both the voltage-speed scaling and the
task-to-core allocation methods are almost the same across the number of cores. Thus
proving that the complexity of our algorithms is linear in the number of cores.
58
If the algorithms are compiled into executables, a modest 30X improvement is ex-
pected (verified through execution of few matrix operations implemented in Matlab and
C). With this it can be verified that the computation times shown in the table for both the
voltage-speed scaling and the task-to-core allocation schemes have small overhead (less
than 6%) compared to the scheduling and the migration intervals used in the experiments
respectively.
59
Chapter 5
Performance Optimal Dynamic Voltage and Frequency Scaling with Hard Deadlines
5.1 Introduction
This chapter addresses the problem of determining the feasible speeds and voltages of
multi-core processors with hard real-time and temperature constraints. This is an important
problem, which has applications in time-critical execution of programs like audio and video
encoding on application-specific embedded processors. Two problems are solved. The
first is the computation of the optimal time-varying voltages and speeds of each core in a
heterogeneous multi-core processor, that minimize the makespan – the latest completion
time of all tasks, while satisfying timing and temperature constraints. The solution to
the makespan minimization problem is then extended to the problem of determining the
feasible speeds and voltages that satisfy task deadlines. The methods presented in this paper
also provide a theoretical basis and analytical relations between speed, voltage, power and
temperature, which provide greater insight into the early-phase design of processors and
are also useful for online dynamic thermal management.
5.2 Problem Statement
Given an n-core processor executing n tasks, let Itot,i be the number of instructions to be
executed and IPCi be the corresponding instruction per cycle of task i. Let xi(t) denotes
the number of instructions of task i completed by time t. The problem is to determine
the speeds and voltages of cores such that all tasks complete their execution within their
deadlines td, while satisfying the thermal constraints.
Further, among the set of deadline feasible solutions, a solution with the minimum
makespan is sought. Note that this is a natural extension to the problem under consideration
as cores need to execute as fast as possible in order to satisfy the deadlines, especially when
constrained by a maximum temperature. This problem is formulated as follows.
60
min
s(t),v(t)
t f = max
1≤i≤n
t f ,i, (5.1)
s.t.
dxi(t)
dt
= IPCi(t)si(t), ∀t, i, (5.2)
xi(0) = 0, xi(td,i) = Itot,i, ∀i, (5.3)
Ti,b(t) = f (P(T(t),s(t),v(t))), ∀t, i,b, (5.4)
Ti,b(0) = Tib0, ∀i,b, (5.5)
Ti,h(t) = max
b
(Ti,b(t))≤ Tmax, ∀t, i, (5.6)
si(t)≤ ksi
(vi(t)− vth)1.2
vi(t)Ti,h(t)1.19
, (5.7)
0n×1 ≤ v(t)≤ 1n×1, ∀t. (5.8)
In the above formulation, t f represents the final completion time or the makespan of
all tasks, whose completion times are given by t f ,i. Each task starts at time 0 and should fin-
ish its execution by its deadline td,i as stated in (5.3). The rate of execution is related to the
speed of a core through (5.2). Each core has to be operated at a speed such that the temper-
ature of its hottest block Ti,h is less than the maximum temperature Tmax. In this work, Tmax
is the maximum specified junction temperature TJmax of a chip, which is the temperature
limit for safe and reliable operation of a processor as set by the manufacturer [66].
Equation (5.4) shows a key difficulty in the temperature computation of each block
in a core. f is some yet unknown function. It shows the cyclic dependency of temperature
with leakage power of a core. This cyclic depenedency of LDT is decoupled as shown in
§ 2.2.
61
With decoupling of the LDT, the original optimization problem is re-formulated as
follows:
min
s(t),v(t)
t f =
∫ t f
0
1 dt, (5.9)
s.t.
dxi(t)
dt
= IPCi(t)si(t), ∀t, i, (5.10)
xi(0) = 0, xi(td,i) = Itot,i,∀i, (5.11)
Ti,h(t) = max
b
(Ti,b(t)), ∀t, i, (5.12)
ζi,hRi,h[Tp(t)+P′i,h(si,vi, t)]≤ Tmax, ∀t, i, (5.13)
P′i,b(si,vi, t) = Plkg0,i,b+ k
v
i vi(t)+ si(t)v
2
i (t)P
max
dyn,i,b(t), ∀t, i,b, (5.14)
dTp(t)
dt
=−Tp(t)
R′pCp
+
P′(s,v, t)
Cp
, ∀t, (5.15)
P′(s,v, t),
n
∑
i=1
m
∑
b=1
ζi,bP′i,b(si,vi, t), (5.16)
Tp(0) = Tp0, (5.17)
si(t)≤ ksi
(vi(t)− vth)1.2
vi(t)Ti,h(t)1.19
, ∀i, (5.18)
0n×1 ≤ v(t)≤ 1n×1, ∀t. (5.19)
Ri,h, P′i,h and ζi,h correspond to the hottest block h of core i as determined in (5.12).
The die temperature (5.13) is the same as (2.19), but for the hottest block h and
replaces (5.6). Equation (5.15) represents the fact that the die temperature is expressed in
terms of the package temperature. The initial conditions on the die temperatures (5.5) are
changed in terms of the initial condition on the package temperature (5.17).
The above formulation falls under minimum-time problems in optimal control the-
ory [58]. N and Tp are state variables with fixed boundary conditions. The mixed control-
state point-wise inequality (5.13) complicates the derivation of the optimal solution. The
solution is obtained in two steps. In the first step, task deadlines are ignored, and the op-
timal speeds and voltages that minimize the makespan subject to a maximum temperature
62
constraint are derived (see Section 5.3). Next in Section 5.4 a solution that meets the dead-
lines is constructed which achieves the minimum makespan among all deadline feasible
solutions.
5.3 Optimal Solution for Minimum Makespan without Deadlines
If the deadlines are ignored then the constraint on completion times in (5.11) changes to
xi(t f ) = Itot,i,∀i. Then the formulation is similar to the one presented in Section 3.3. The
optimal solution is also the same as in Section 3.3. The optimal speed of core i is given by
s∗i (t) =

1, Ti,h(t)< Tmax,
0, Ti,h(t)> Tmax,
smaxh,i(t), Ti,h(t) = Tmax.
(5.20)
The quantity smaxh,i is the speed of core i which maintains the hottest block of core i at the
maximum temperature. The corresponding optimal voltage is given by (3.18).
To compute s∗i , several quantities need to be determined. These are: (1) Tp, (2)
h – the identity of the hottest block in core i and its temperature Ti,h, (3) tm,i – the time at
which the temperature of core i reaches Tmax, and (4) smaxh,i. These quantities are all inter-
dependent. Consequently, they must be computed iteratively at discrete points in time. The
time interval between two successive computations, denoted by ts, is called the scheduling
interval, which is on the order of the die thermal constant (i.e. a few milliseconds).
Let tk = kts denote the kth time point at which the speed si and voltage vi are to be
updated. The initial package temperature Tp(0) is known. The following is the sequence
of steps involved in computing (5.20).
1. Using (2.21), compute Tp(tk) = Tp(tk−1)+
dTp(tk−1)
dt ts.
2. Identify the hottest block h = hi(tk) in core i, and compute Ti,h(tk). Note that the
identity of the hottest block changes with time as does its temperature. However, at
63
a fixed t, the identity does not change with speed and voltage because all blocks are
affected in the same way by a core’s speed and voltage. Therefore, P′i,b(si = 1,vi =
1, tk),∀b is computed using (2.18). This is then substituted into (2.19) to compute
Ti,b(tk),∀b. Then Ti,h(tk) = max
b
{Ti,b(tk)}.
3. If Ti,h(tk) < Tmax then set s∗i (tk) = 1, and compute the corresponding v∗i (tk) using
(3.18). Otherwise, if Ti,h(tk) ≥ Tmax and Ti,h(tk−1) < Tmax, set tm,i = tk, and si(tk) =
smaxh,i(tk). The computation of smaxh,i(tk) is described next.
Determination of smaxh,i
smaxh,i is defined as the speed necessary to maintain the temperature of the hottest block at
the maximum temperature Tmax. Equating Ti,b to Tmax for the hottest block h in (2.19), the
package temperature is given by
Tp(t) = [Tmax−P′i,h(smaxh,i,vmaxh,i, t)Ri,h]/ζi,h, ∀i (5.21)
P′i,h is the apparent power dissipation of block h in core i that corresponds to the speed
smaxh,i and the voltage vmaxh,i.
Substituting the above equation in (2.21) P′i,h is expressed as,
dP′i,h(smaxh,i,vmaxh,i, t)
dt
=γi−αiP′i,h(smaxh,i,vmaxh,i, t)
−
n
∑
c=1
c6=i
m
∑
b=1
b6=h
βc,bP′i,b(smaxh,i,vmaxh,i, t), (5.22)
where
αi ,
Ri,h+ζi,hR′p
R′pCpRi,h
, βc,b ,
ζc,b
CpRc,b
and γi ,
Tmax
R′pCpRi,h
.
The solution to (5.22) is
P′i,h(smaxh,i,vmaxh,i, t) = P
′
i,h,0e
− tτi +P′i,h,ss(1− e−
t
τi ), ∀t > tm,i, (5.23)
64
where P′i,h,0 is the apparent initial power consumption and the apparent steady state power
consumption of the hottest block of core i, and τi is the time constant of the power curve of
the hottest block. These quantities are computed as follows.
P′i,h,0 = [Tmax−ζi,hTp0]/Ri,h, (5.24)
τi =
(
αi+ ∑
c∈na,c6=i
βc
)−1
, (5.25)
P′i,h,ss = γiτi. (5.26)
In (5.25), na is the set of active cores, whose operating frequencies are greater than zero.
All the terms on the R.H.S of (5.23) are known. The L.H.S of (5.23) is an expression
in the unknowns smaxh,i and vmaxh,i. The expression is obtained by substituting h for b,
smaxh,i for si and vmaxh,i for vi in (2.18). This results in one equation in the unknowns smaxh,i
and vmaxh,i. The second equation that relates these two quantities is (3.18). Solving these
two simultaneously yields the values of smaxh,i and vmaxh,i.
5.4 Minimum Makespan with Deadlines
Minimizing makespan does not guarantee that task deadlines will be satisfied. Figure 5.1
shows a hypothetical example of a two-core processor executing two tasks with deadlines.
Figure 5.1(a) shows the minimum makespan solution, that violates the task deadlines. Fig-
ure 5.1(b) shows a desired execution of the same tasks that respects both the task deadlines
and the thermal constraints. Since minimum makespan is a global objective, over some
interval it may execute a task at slower speed than the minimum speed required to meet the
task deadline.
This section addresses the problem of determining the speed si(t) and the voltage
vi(t) of each core so that the given task deadlines are satisfied.
The proposed method constructs a deadline feasible solution, if one exists. More-
over, such a solution will have the minimum makespan.
65
0 t
makespan
si(t)
1
Task 1
Task 2
0 t
td,2 makespan
si(t)
1 Task 1Task 2
td,2
deadline
violation
Desired execution 
satisfying the 
deadline constraints
Minimum makespan 
solution violating 
deadline constraints
(a)
(b)
Figure 5.1: Minimizing makespan may lead to deadline violations.
Without loss of generality it is assumed that the n tasks are numbered and ordered
by their deadlines such that td,1 < td,2 < · · ·< td,n.
If the MMS satisfies the deadlines, then it is the solution sought. Therefore it is
assumed that the MMS has at least one deadline violation.
In general, in a MMS, if the completion time te,n, of the task with the latest deadline
td,n, there can be no deadline feasible solution. This is because, by the definition of a MMS,
[0, te,n] is the shortest interval within which all instructions of all tasks are completed, and
hence, te,n cannot be reduced. Therefore it is assumed that task i, for some i≤ n−1, is the
first task that violates its deadline. Such a task is referred to as a critical task. The intervals
in which the speed of task i can be modified are denoted by I1 = [0, te,1], I2 = [te,1, te,2], . . . ,
Ii = [te,i−1, td,i] (see Figure 5.2(a)).
66
0 t
s(t)
Task 1
Task 2
td,1
deadline
violation
td,2 td,3
Task 3
Task 4
te,1 te,2 te,3 te,4
(a)
0 t
s(t)
Task 1
Task 2
td,1 td,2 td,3
Task 3
Task 4
te,1 te,2 te,4
(b)
I1 I2 I3
Figure 5.2: (a) An example showing a deadline violation for zero-slack execution of four
tasks on a four-core processor. (b) A possible deadline feasible solution for the example in
(a).
Speeding up task i to reduce its completion time must be done so that te,i = td,i. If
task i is sped up so that te,i < td,i, then this could lead to 1) deadline violations of earlier
tasks, or 2) as a consequence of the zero-slack policy, reduction in throughput which cannot
be regained, and can cause deadline violations of later tasks. In other words a deadline
feasible solution with te,i < td,i would have a makespan no smaller than one with te,i = td,i.
67
Critical Task Voltage-speed Determination
The method to adjust the speed profile of a critical task is explained with an example.
Figure 5.2(a) shows a MMS with four tasks, in which Task 3 is the critical task. Fig-
ure 5.2(b) shows the desired deadline feasible solution. The modification of the speed of
Task 3 is done by successively examining the intervals I3 = [te,2, td,3], I2 ∪ I3 = [te,1, td,3]
and I1∪ I2∪ I3 = [0, td,3].
Consider the interval I3 = [te,2, td,3]. The objective is to determine s3(t) and v3(t)
in I3. The initial package temperature in I3 is Tp(te,2), and is retained from the original
makespan solution. The final package temperature in I3, viz. Tp(td,3) needs to be reduced
to increase s3(t).
Recall that in a MMS, the speed s3(t) = 1 until the time tm,3, which is the time when
T3,h = Tmax. Therefore if s3(te,2)< 1, then tm,3 < te,2. If s3(te,2) = 1, then tm,3 ∈ [te,2, td,3].
Suppose s3(te,2)< 1. Then s3(t) and v3(t) for t ∈ [te,2, td,3] are the same as smaxh,3(t)
and vmaxh,3(t). They are determined as follows.
1. Since te,2 is the left endpoint of the interval, the apparent initial power P′i,h,0 in (5.24)
refers to P′3,h(te,2). Therefore P
′
3,h(te,2) = [Tmax−ζ3,hTp(te,2)]/R3,h.
2. Let P′3,h,lb(td,3) and P
′
3,h,ub(td,3) be the lower and the upper bounds on P
′
3,h(td,3) cor-
responding to speeds s3(td,3) = 0 and s3(td,3) = 1 respectively. Note that the cor-
responding v3(td,3) is derived from (3.18) with T3,h(td,3) = Tmax. Let Tp,lb(td,3) and
Tp,ub(td,3) be the corresponding lower and upper bounds on the package temperature
Tp(td,3) obtained by substituting P′3,h,lo(td,3) and P
′
3,h,ub(td,3) respectively for t = td,3
in (5.21).
3. Perform a binary search on Tp(td,3)∈ [Tp,lb(td,3),Tp,ub(td,3)] and find the correspond-
ing P′3,h(td,3) = [Tmax−ζ3,hTp(td,3)]/R3,h. Substituting P′3,h(td,3) in (5.23) for t = td,3
68
and using (5.25), compute P′3,h,ss and τ3,h numerically.
4. Now the R.H.S of (5.23) is known for all t ∈ [te,2, td,3]. This is equated to the R.H.S
of (2.18) for P′3,h, which yields one equation in smaxh,3 and vmaxh,3. These are solved
numerically along with (3.18).
5. Let Irem,3 = Itot,3−
∫ td,3
0 IPC3(t)s3(t) dt, be the remaining number of unexecuted in-
structions in [0, td,3]. If Irem,3 6= 0, Steps 3, 4 and 5 are repeated until Irem,3 = 0. This
is done by repeatedly choosing Tp(td,3) ∈ [Tp,lb(td,3),Tp,ub(td,3)].
If s3(te,2) = 1, then tm,3 ∈ [te,2, td,3]. tm,i is determined through a binary search. For
every binary search on tm,i, the above steps 1 – 5 are repeated, replacing te,2 by tm,3.
If the above steps do not yield a deadline feasible solution for Task 3, then the
search interval is expanded to I2∪ I3 = [te,1, td,3] by setting Tp(td,3) = Tp,lo(td,3). The above
steps are repeated to search the appropriate package temperature at te,2, i.e. Tp(te,2) (instead
of Tp(td,3)), such that Irem,3 = Itot,3−
∫ td,3
0 IPC3(t)s3(t) dt = 0.
Speeds of non-critical tasks
Let [tsc, tec] be the interval, where the critical task speed was altered from its original MMS.
Therefore, the speeds of other non-critical tasks need to be determined only in this interval.
First, the package temperature Tp(t), t ∈ [tsc, tec] is obtained that corresponds to the power
profile of the hottest block of the critical task P′i,h from (5.21). Then
dTp(tk)
dt (tk = kts) is
computed for every scheduling interval ts as
dTp(tk)
dt
≈ Tp(tk)−Tp(tk−1)
ts
, (5.27)
Since both dTp(tk)dt and Tp(tk) are known, the total apparent power consumption for
all cores P′(tk) in the kth scheduling interval can be determined from (2.21). Let P′(tk) =
P′(tk)−P′i (tk) (critical task is i) be the remaining total apparent power budget that has to
be allocated to the remaining tasks at time tk. This allocation needs to ensure
69
Procedure: non critical(critical task, interval, first act task);
Input: Critical task: i, Interval: [tsc, tec]
Output: sc(t),vc(t),∀c = 1, · · · ,n,c 6= i, t ∈ [tsc, tec]
Find Tp(t), t ∈ [tsc, tec] using P′i,h in (5.21);
Run zero-slack policy for tasks i+1, · · · ,n in [td,i, td,n] with Tp0 = Tp(td,i) and
I′tot,c = Itot,c−
∫ tsc
0 IPCc(t)sc(t)dt, ∀c = i+1, · · · ,n;
Let D be an ordered set of tasks that violated deadlines;
I′rem,c = Itot,c−
∫ td,c
0 IPCc(t)sc(t)dt, ∀c = i+1, · · · ,n;
for tk ∈ [tsc, tec] do
Find dTp(tk)dt (5.27) and P
′(tk) (2.21);
P′(tk) = P′(tk)−P′i (tk);
for task c from first act task to n do
Find P′c,max(tk) from (2.20) (Tc,h = Tmax);
if (c < i and
∫ tk
tsc IPCc(t)sc(t)dt < I
′
rem,c) or (c > i) then
if P′c,max(tk)≤ P′(tk) then
P′c(tk) = P′c,max(tk);
else
P′c(tk) = P′(tk);
end
else
P′c(tk) = 0;
end
Find sc(tk) and vc(tk) by solving (2.18), (2.20) and (3.18) simultaneously;
P′(tk) = P′(tk)−P′c(tk);
if P′(tk) = 0 then
break;
end
end
end
Algorithm 2: Procedure for speed determination of non-critical tasks.
70
1. that the previous tasks 1, · · · , i−1 satisfy their deadlines, and
2. the number of deadline violations in future tasks i+1, · · · ,n are minimized.
Since ensuring that the previous tasks satisfy their deadlines is important, they get higher
priority, in the order of their deadlines, while allocating the power budget. Note that the
previous tasks are not allocated any more power than it is required to satisfy their deadlines
and if any of the deadline constraints is not satisfied, then there is no deadline feasible
solution to satisfy deadlines of all tasks.
To determine the priority of allocation of power budget to the rest of the tasks
i+1, · · · ,n, the following procedure is followed.
1. Run the zero-slack policy for the interval [td,i, td,n], for tasks i+1, · · · ,n, with I′tot,c =
Itot,c−
∫ tsc
0 IPCc(t)sc(t)dt,∀c instructions.
2. Let D denote the temporal order of tasks whose deadlines have been violated un-
der the zero-slack execution. Let I′rem be the corresponding number of unexecuted
instructions for tasks D .
3. The tasks in D are processed first, followed by the remaining tasks in the order of
their deadlines.
The above steps are necessary to determine the number of instructions that are critical to
avoid future deadline violations. Hence those tasks that have deadline violations are given
higher priority. During the allocation of the power budget in the interval [tsc, tec], if any
of the tasks from D execute I′rem, then those tasks are removed from D and are assigned
normal priority as they are no more critical. Note that during the allocation of power, the
maximum power that can be allocated to a task in a scheduling interval is fixed. The power
71
consumption of the hottest block is bounded by Tp and Tmax as seen from (2.19). This in
turns limits the maximum speed of that core and its total power consumption.
Procedure: find spd deadline(Tp(0),Tmax;
Let L = 0;
Run the zero-slack policy for all cores (Section 5.3);
while any task violates deadlines do
Let task i first violate (critical task);
Let R = i, tsc = te,R−1 and tec = td,i;
while task i violates deadline do
Determine (si, vi) profile for task i for interval [tsc, tec] (Section 5.4);
if
∫ td,i
0 IPCi(t)si(t)dt 6= Itot,i then
if R = L then
No deadline feasible solution exists;
return;
else
R = R−1, tsc = te,R−1 and tec = te,R;
end
else
non critical(si, [tsc, tec], i);
L = i;
Run zero-slack policy for cores i+1, · · · ,n;
end
end
end
Algorithm 3: Procedure for feasible speed determination of tasks with deadlines under
thermal constraints.
The details of the voltage-speed determination of the non-critical tasks is summa-
rized in Algorithm 2 and the overall procedure of deadline feasible speed determination in
Algorithm 3.
The above procedure is applicable even in the case of multiple tasks sharing same
deadline. One among the multiple tasks is identified as the critical task and rest of the tasks,
including tasks that shared the critical task deadline, are grouped as non-critical tasks.
The time-complexity of the deadline-feasible speed determination algorithm is O(n(log n)td,n/ts),
where td,n/ts is the total number of scheduling intervals, and log n is due to the binary
72
25.3 33 36.4
60
85
110
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
25.3 31.9 36.4
0
1
2
A
Time (s)
Sp
ee
d
(G
H
z)
basicmath qsort IFFT GSM dec.
25.3 31.9 36.4
0
0.7
1.2
Time (s)
Vo
lt
ag
e
(V
)
Figure 5.3: Plot of speeds, voltages and temperature of the hottest block of cores when
executing under zero-slack policy
search on Tp, to determine the speeds of critical tasks.
5.5 Experimental results
Makespan Minimization for Tasks without Deadlines
Figure 5.3 shows the plot of speeds, voltages and temperatures of the hottest blocks of
cores under the optimal MMS for a four core processor executing four tasks. The details
of the tasks along with their instruction length are shown in Table 5.1. Since the instruc-
tion lengths of the original benchmarks were very small to observe appreciable influence
of temperature on the execution of tasks, we repeatedly executed the benchmarks for the
duration of the package time constant. The corresponding number of instructions are noted
73
25 40 60 75
35
60
90
110
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
25 40 60 75
0
1
2
A
Time (s)
Sp
ee
d
(G
H
z)
basicmath qsort IFFT GSM dec.
25 40 60 75
0
0.7
1.2
Time (s)
Vo
lt
ag
e
(V
)
Figure 5.4: Plot of speeds, voltages and temperature of the hottest block of cores when
executing under optimal MMS policy
in Table 5.1. The deadlines were chosen such that an individual benchmark when executed
alone on a multi-core processor at the maximum speed should complete its execution by its
deadline, but not when executed with other tasks at the maximum feasible speeds. When
multiple tasks execute on a processor, the power dissipation from all the tasks increase the
package temperature, thus reducing the operational speed of cores.
As mentioned in Section 5.3, the MMS determines the core speeds and voltages
such that the temperature of the hottest block is maintained at the maximum temperature
as seen in Figure 5.3. Note that the fluctuations in the core speeds and voltages are due to
the time-varying power profiles.
74
Tasks basicmath qsort IFFT GSM dec.
Deadlines (seconds) 60 25 40 75
Table 5.1: Instruction length, IPC, average dynamic power and deadlines of the tasks used
in the experiment.
For the purpose of demonstrating the MMS, we ignore the deadlines in this section.
Note that the completion of a task allows other tasks to execute at a higher speed due to
reduced power dissipation. This we observe after the completion of task qsort in Figure 5.3.
Since we did not find equivalent policies that determines time-varying DVFS con-
sidering time-varying power profiles for multi-core processors, we could not provide a
comparison of our policy with existing work.
Makespan Minimization for Tasks with Deadlines
We add the deadline constraints as shown in Table 5.1 to demonstrate that the optimal
MMS only ensures achieving the minimum makespan but does not guarantee satisfying the
deadlines. The deadline constraints are taken into consideration by our modified algorithm
as demonstrated in this section.
Tasks 4 8 16 32 64
MMS (s) 36.4 45.3 32.9 48.8 47.6
Modified algo. (s) 73.1 86.5 77.3 88.9 86.4
# violations 1 2 3 2 4
Comp. time/core (min) 2.1 2.15 2.03 1.97 2.18
Table 5.2: Comparison of makespan for the optimal MMS and the modified algorithm, the
associated deadline violations of the optimal MMS, and the computation times.
Figure 5.4 shows the plot of speeds, voltages and temperatures of cores when sched-
uled according to the modified makespan minimization algorithm which satisfies the dead-
lines. Unlike the MMS which violated the deadlines (Figure 5.3), the modified algorithm
resulted in a higher makespan (73.1 s) to ensure that both the thermal and the deadline
constraints are satisfied for all tasks as seen from Figure 5.4.
75
Annotated portion ‘A’ in both Figs. 5.3 and 5.4 illustrate the differences in the speed
control between the optimal MMS and the modified algorithm. Task qsort could not meet
its deadline (25 s) when scheduled according to the makespan minimization procedure.
This is because, the speeds of all cores were kept at their thermally maximum speeds
to minimize the overall makespan. This resulted in constraining the speed of task qsort,
which had the earliest deadline. On the other hand the modified procedure ensured that
the deadline constraint of task qsort is met by increasing its speed and constraining the
speeds of other non-critical tasks optimally. Among the non-critical tasks, IFFT had the
earliest deadline at 40 s. Hence it was allocated the maximum power out of the remaining
total power budget. This ensured that it executed at a higher speed compared to other non-
critical tasks as seen in Figure 5.4. Note that the speeds of tasks basicmath and GSM dec.
are zero in the first interval [0–25] as the remaining power budget was not sufficient to
allocate for these tasks.
Table 5.2 shows the results of additional experiments with the number of tasks (or
cores) ranging from 4 to 64. The number of instructions and the corresponding deadlines
for tasks were chosen randomly. It also lists the number of deadline violations, when
scheduled according to the MMS and the computation time for the modified algorithm.
Our policies are implemented in Matlab and runs on Intel Core 2 duo processor at 1.67
GHz with 2 GB RAM on Linux operating system. Although the computation time is high
due to the large number of numerical simulations required by our algorithm, it is to be
noted that the algorithm is run offline. Hence the computation time is not critical. Also
note that our algorithms can be parallelized to reduce the computation time.
76
Chapter 6
Minimizing Peak Temperature using Dynamic Voltage and Frequency Scaling
6.1 Introduction and Background
We see have seen DTM being used for maximizing performance under thermal constraints
with and without deadline constraints on tasks. However, a more critical issue with high
performance multi-cores will be reliability due to increasing die temperatures. A 10◦C-
15◦C increase in temperature can reduce the lifespan of a device by half [67]. The ITRS
has predicted a rapid onset of significant lifetime reliability problems. It is expected in
the future, processor cost and performance specifications will be significantly affected by
the lifetime reliability and will take over as the primary factor in the processor design,
superseding performance requirements.
Several researchers have recently focused on DTM targeting reliability. Refer-
ences [68, 69] proposed models for reliability estimation at the chip-level and showed that
the reliability can be traded with performance. The authors of [70] provided an analysis
on the trade-off of power consumption with performance and reliability through the use
of various power management techniques. In [23], a mixed-integer linear program was
used to determine the optimal task schedule to reduce the peak temperature with task dead-
lines. A heuristic was developed in [22] to sequence the tasks on a single-core processor
to minimize the peak temperature with timing constraints. The authors of [71] evaluated
a large number of techniques to study the effect of job scheduling and power management
techniques on the system reliability.
The above works attempted to address the issue of system reliability from the DTM
perspective and proposed techniques for optimal task scheduling and sequencing. However,
these previous works do not address the important transient optimal control of frequencies
of the processor. The determination of these frequencies are complicated by the non-linear
cyclic relation between the leakage power and the temperature. This determination of
77
frequencies gets even more complicated with the addition of task deadlines as seen in the
previous chapters. In this paper, we present the first quasiconvex programming solution
for determining the speeds of a multi-core processor to ensure that all tasks meet their
deadlines with the minimum peak temperature. Experimental results demonstrate that our
approach helps in improving the lifetime of devices by reducing the peak temperature by
8◦C for a sample execution of tasks from SPEC benchmarks. We also show a practical
implementation of our approach with discrete speed states.
6.2 Reliability Model
Reliability of a core is given by the reliability of the weakest interconnect in the core. The
granularity of our thermal and reliability analysis is at the level of functional blocks of
cores. Every core has a block which remains the hottest irrespective of the frequency of
operation, called the hottest block. Hence the reliability of a core can be computed by
calculating the reliability of the hottest block alone. Electromigration (EM) plays a major
role in the interconnect breakdown, thereby reducing the lifetime of a processor. The mean
time to failure (MTTF) t f ,i of core i caused due to EM is given by Black’s equation [68].
t f ,i =
A
jni,h
e
Q
kTi,h , (6.1)
where A is a constant based on interconnect structure, Q is the activation energy (0.6 ev for
Aluminum), ji,h and Ti,h are the current density and the temperature of the hottest functional
block h in core i. k is Boltzmann’s constant. n= 1 or n= 2 depending on the failure mode.
The only issue of relevance of (6.1) is that the MTTF decreases exponentially with increase
in the temperature. Thus to maximize the lifetime of a device, the peak temperature of
operation has to be minimized.
6.3 Problem Statement and Approach
Problem Description
Consider an n core processor with each core i executing a task starting at time ts,i with Ni
instructions to be executed within deadline td,i. Figure 6.1 shows an example of a four-
78
tTask 3
Task 2
Task 1
0
Task 4
td,1 td,4 td,2 td,3
1 2 3 4
ts,4ts,2 ts,3
5 6 7
Figure 6.1: Creation of slots based on start and end times of tasks.
core processor executing four tasks starting and ending at various times. The slots are
numbered in an increasing fashion. Given this, the problem is to determine the optimal
transient speeds of cores such that the deadline constraints for the tasks are satisfied while
minimizing the peak temperature of operation Tmax. The problem is formulated as
min
s(t),Tmax
Tmax, (6.2)
s.t.
dx(t)
dt
= s(t), ∀t, (6.3)
x(ts) = 0, x(td) = I, (6.4)
Ti,b(t) = ζi,b[Tp(t)+(Pl,i,b+ si(t)Pdyn,i,b)Ri,b], ∀t, (6.5)
T(t)≤ Tmax, ∀t, T(0) = T0. (6.6)
In the above formulation, the objective is to minimize the peak temperature of exe-
cution while satisfying the deadline constraints specified by (6.4). ts and td are the vector
notations of the start and the end times respectively. The core speeds s and the peak tem-
perature Tmax are the variables of optimization.
In the previous chapter, we solved a version of this problem, where we minimized
makespan under a thermal constraint given task deadlines. Using this solution, an outline
of the solution of the above problem is presented in the following solution.
79
Solution Outline
The entire duration of execution is partitioned into several time slots based on the task
deadlines (see Figure 6.1), and the above optimization problem is solved for each time slot
in three steps.
1. Determine the optimal speed profiles of the multi-core processor to maximize the
throughput (sum of speeds) for given tasks under a fixed maximum temperature and
no deadline constraints.
2. Solve the unknown parameters of the above speed profiles to satisfy the boundary
conditions imposed by the start and the end times of tasks.
3. Find the minimum peak temperature under which Step 2 is still satisfied.
The first two steps were solved in the previous chapters. It will also be shown later
in Section 6.3 that Step 2 is a quasiconvex function of the initial package temperature of the
slot and the peak temperature Tmax. By quasiconvex, we mean that the function is unimodal
and a unique minimum value. Thus a quasiconvex optimization problem can be solved to
determine the minimum Tmax. The quasiconvex optimization is elaborated in the following
section.
Quasiconvex Programming Solution for Minimizing Peak Temperature
From the previous chapter, we know how to determine the optimal core speeds to satisfy
task deadlines within a slot. We now need to determine the global minimum peak tem-
perature such that all tasks satisfy their deadlines. We note that in order to determine the
minimum peak temperature, the initial package temperature of each slot has to be deter-
mined optimally. This is necessary as the optimal core speeds in each slot are guided by
the initial package temperatures.
80
t
Task 1
Task 2
0 td,1 td,2
1 2 3
ts,2
Tp1
Figure 6.2: Two task example to demonstrate the quasiconvexity of (6.8).
We formulate the problem of determining the optimal initial package temperatures
of slots and the minimum peak temperature as a quasiconvex optimization problem.
max
Tp,Tmax
Tmax, (6.7)
s.t. s= find spd deadline(Tp,l,Tmax), ∀l, (6.8)
Tp ≤ Tmax, Tp(0) = Tp0. (6.9)
Here Tp,l is the initial package temperature of slot l. In the interest of clarity and lack of
space, the proof of the quasiconvexity of the above formulation is omitted.. It is trivial to
show that the objective (6.7) is quasiconvex. Procedure find spd deadline (see Algo-
rithm 3) finds a feasible solution for Formulation (6.3) – (6.6) for a slot l. We give an intu-
itive explanation for the quasiconvexity of Procedure find spd slot in (6.8) by consider-
ing a two task scenario with start and end times as shown in Figure 6.2. In this case, there is
only one package temperature Tp1 to be determined. Let the optimal T ∗p1 ∈ [Tp1,min,Tp1,max].
Consider Tp1 = Tp1,max+ε , where ε is any small value, violates the deadlines in either slot
2 or 3. This can be due to the lower initial speed in slots 2 and 3 caused by higher Tp1,
which affects the completion times of task 1 and 2. Increasing ε to any higher value will
only reduce the initial speed in slot 2 and 3 further. Similarly on the other hand, consider
Tp1 = Tp1,min−ε leading to deadline violation of task 1 due to lower final speed of task 1 in
slot 1. Increasing ε to any higher value will only decrease the speed of task 1 in slot 1. Thus
there is a single continuous range of satisfiable values for Tp1. Hence (6.8) is quasiconvex
over Tp1.
81
Scenario Parameters bzip gap mcf twolf
1
Start times (s) 0 10 24 33
End times (s) 60 45 124 130
Act. end Opt. policy 58.7 34.6 122.2 130
times (s) max-tput. 51.4 31.7 111.3 119.3
2
Start times (s) 0 10 24 33
End times (s) 45 30 104 112
Act. end Opt. policy 45 29.4 101.7 109.8
times (s) max-tput. 51.4 31.7 111.3 119.3
Table 6.1: Characteristics of Tasks used in the experiments.
6.4 Experimental Results
Optimal Policy vs Max-throughput Policy
Here we compare our proposed optimal policy with the max-throughput policy [16] for
two scenarios listed in Table 6.1. The tasks, their start times, deadlines and the actual
end times under both the policies are also listed in the table. For the sake of clarity, only
four tasks are used in the experiments, although our method works for any number of
cores and tasks. Consider the execution of Scenario 1 under both the optimal procedure
and the max-throughput method as shown in Figs. 6.3a and 6.3b respectively. We find
that both the policies satisfy the deadlines, but our optimal policy executes the tasks at
a lower temperature of 102◦C, while the max-throughput policy executes the tasks under
the nominal temperature of 110◦C until completion. This reduction in the peak temperature
can be even lower if the task deadlines are relaxed much further. It is interesting to note that
the task gap finishes its execution early at 34.6 s under the optimal policy when its actual
deadline is 45 s. This is because, finishing the task gap any later would lead to deadline
violation of atleast one of the tasks with future deadlines with 102◦C as the maximum
temperature.
Next we execute tasks from Scenario 2 under both the max-throughput policy and
the optimal policy as shown in Figure 6.4a and Figure 6.3b respectively. Since the max-
82
throughput policy only tries to maximize the overall completion of instructions it results
in the same core speeds for a given set of instructions irrespective of any deadlines (for
both Scenarios 1 and 2) and hence the same plots. From the figures, we see that the max-
throughput policy fails to satisfy the deadlines for tasks bzip, gap and mcf as it is not
able to increase the peak execution temperature beyond 110◦C, while our optimal policy
executes the tasks optimally at the peak temperature of 118◦C. The method also selectively,
but optimally increases the speeds of the critical tasks to ensure that their deadlines are
met. This demonstrates the flexibility and also the optimality of our proposed method in
adopting to different tasks scenarios.
Figure 6.4b shows the practical speed implementation of the optimal scheduling
policy for the workload specified in scenario 2 with eight speed states. The optimal speeds
are discretized by selecting the highest discrete speed state that is feasible under the tem-
perature constraint at every instant. Note that due to the discretization of speeds, which is
an approximation of the optimal solution, the peak temperature increases by 4◦C.
83
0 10 24 33 60 130
0
2
4
5
Time (s)
Sp
ee
d
(G
H
z)
bzip gap mcf twolf
0 10 24 35 60 122
35
60
80
102
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
(a) Optimal
0 10 24 33 52 111
0
2
4
5
Time (s)
Sp
ee
d
(G
H
z)
bzip gap mcf twolf
0 10 24 32 52 119
35
60
90
110
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
(b) Max-throughput
Figure 6.3: Core speeds and temperatures of hottest blocks for Scenario 1.
84
0 10 24 33 45 102
0
2
4
6
Time (s)
Sp
ee
d
(G
H
z)
bzip gap mcf twolf
0 10 29 45 112
35
60
90
118
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
(a) Continuous
0 10 24 33 45 102
0
2
4
6
Time (s)
Sp
ee
d
(G
H
z)
bzip gap mcf twolf
0 10 29 45 112
35
60
90
122
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
(b) Discrete
Figure 6.4: Optimal speeds and temperatures for Scenario 2.
85
Chapter 7
Energy-efficient DTM using DVFS, Task Migration and Active Cooling
7.1 Introduction
In the previous chapters, we saw that performance cannot be maximized without increasing
temperature. In general, performance and energy consumption are opposing metrics - im-
proving one is often achieved at the expense of the other. Figure 7.1 shows the results of
simulating the execution of a set of benchmark programs on a processor, using simulator
MAGMA [62], and monitoring the total energy consumed and the latest completion time of
the tasks (makespan), while varying the clock frequency. The plot shows that as the delay
(makespan) increases, the energy consumption decreases monotonically, whereas the ratio
of the throughput to energy has a unique maximum value. This ratio is a measure of the
energy efficiency and is referred to as performance-per-watt [72] (PPW). It is equivalent to
the number of instructions executed per Joule of energy.
Often, the primary goal in the design of any computing system is to maximize
its performance under various constraints, and the real price is the total energy expended
by the system. This energy cost should not only include the energy for performing the
computation but also the cost of cooling the system. For this reason, the PPW metric is
the most suitable, whether it is a laptop computer, a desktop workstation or servers in a
datacenter for which cooling costs amount to almost half the total energy [27].
300 600 900 1,200 1,500
5
10
15
Delay or Makespan (s)
To
ta
l
en
er
gy
(k
J)
Energy
200
300
400
500
Optimal
M
IP
S/
W
at
t
PPW
Figure 7.1: Energy-delay curve.
86
The three different controls, namely, the fan speed, the task-to-core assignment and
the voltage-speed of each core operate on different time scales (see Figure 3.1). The fan
is relatively sluggish, and its speed can be changed every 1-3 seconds. The chip-wide
migration interval for assigning tasks to cores is between 50 ms to 100 ms, and the core
speed and voltage can be changed every 5 ms to 10 ms, as mentioned in Section 3.1.
The use of a detailed thermal model that accounts for the various blocks on the die
and package, and the accounting of the key relationships among the various quantities, and
the unified control scheme, makes this problem a constrained multi-dimensional optimiza-
tion problem. The solution presented in this paper is efficient so that it can be used as an
on-line DTM scheme. This is possible due to a useful property that is demonstrated in this
paper, namely, that the PPW is a quasiconcave (unimodal) function of the core and fan
speeds. This property guarantees a unique optimal solution.
The paper also considers a simple and useful extension of the PPW metric –
performanceα /Watt (PαPW). The relative importance of performance or energy is con-
trolled by α . Setting α > 1, gives much greater weight to performance, whereas with
0 < α < 1 emphasizes the importance of energy reduction over performance. This metric
is useful in scenarios, where certain operations demand higher performance, while other
applications can be slowed down to lower energy consumption. For instance, in a smart-
phone, operations such as channel estimation, demodulation, audio and video modulation-
demodulation are a high priority and demand higher performance, whereas applications
such as web browsers, email clients, rasterization, pixel blending can afford a slower exe-
cution and can thus contribute towards energy savings.
The effectiveness of the proposed solution is demonstrated through simulations.
The proposed solution executes 3.2X more instructions per Joule than the performance-
optimal solution for a four core processor. Included in the experiments is an evaluation
of the advantages of task migration and fan speed scaling on improving the PPW. Using
87
PαPW, and varying α , a comparison is made of the maximum energy-efficient solution,
the performance-optimal solution and the low power solution. Finally, the behavior of
PPW as a function of the number of cores is examined. It is shown that the PPW need not
necessarily increase monotonically with the number of cores, but an optimal configuration
of cores exists, which is determined by the factor by which scaling reduces the power per
core.
7.2 Problem Description and Optimal Solution
Consider q tasks (not necessarily identical). The problem is to determine the optimal con-
figuration of fan speed, assignment of tasks to cores and the transient voltages and speeds
of cores, such that the PPW of the system is maximized, subject to the constraints on the
maximum temperature T max and the voltage-speed relationship given in (2.22).
Let w be the q×1 vector of task weights or priorities. Then PPW is defined as the
ratio of total performance to the total power.
PPW (s,v,M,s f an, t) =
sT (t)Mw
PT (s,v,M,T, t)×1N×1+Pf an(s f an) . (7.1)
The numerator is the sum of core speeds weighted by the weights of the tasks allocated to
those cores. PT (s,v,M,T, t)×1N×1 is the sum of all elements in P(s,v,M,T, t). Note that
the term P(s,v,M,T, t) in (7.1) is same as defined in Section 3.2.
With this, the problem of determining the optimal functions of core speeds, voltages
88
0 0.5
1 1.5
2
0
1
2
0
10
20
Core 1
speed (G
Hz)
Core 2 speed (GHz)
PP
W
(M
IP
S/
W
at
t)
Figure 7.2: Quasiconcave nature of PPW metric
and fan speed to maximize the overall PPW is formulated as follows.
max
s(t),v(t)
M,s f an(t)
sT (t)Mw
PT (s,v,M,T, t)1N×1+Pf an(s f an(t))
(7.2)
s.t.
dT(t)
dt
= Aˆ(s f an)T(t)+BPˆ(s,v,M,T, t), ∀t (7.3)
T(t)≤ T max, T(0) = T0, ∀t (7.4)
sc(t)≤ ksc
(vc(t)− vth)1.2
vc(t)max(Tc(t))1.19
, ∀t,c (7.5)
0n×1 ≤ v(t)≤ 1n×1, ∀t. (7.6)
As mentioned in the introduction, the voltage-speed control, the task migration and
the fan speed scaling happen on different time scales due to high migration overhead and
the sluggishness of the fan. Hence, the optimization of PPW for each of these controls are
handled separately as described in the following sections.
Characteristics of PPW metric
Before proceeding to the solution of the above optimization problem, it will be worthwhile
to examine the objective function as a function of its parameters. For a description of the
setup of the experiments, refer to Section 3.6.
89
0 0.5 1 1.5 2
0
5
10
15
20
Core speed (GHz)
PP
W
(M
IP
S/
W
at
t)
40 ◦C 60 ◦C 80 ◦C 100 ◦C
Figure 7.3: Effect of temperature on PPW
Figure 7.2 shows the plot of PPW against the core speeds for a dual core processor.
Note the concave nature of the surface. We prove that the surface is, in fact, is quasiconcave
(a weak form of convexity). The proof is given in Appendix A.3. This key property ensures
that the optimization problem (7.2)–(7.6) has a unique solution of core speeds.
Next, the effect of the temperature on the PPW metric is examined in Figure 7.3.
As expected, PPW decreases as temperature increases, but the optimal core speeds increase
with temperature. An increase in temperature causes an increase in leakage power, mak-
ing the denominator of the PPW larger. Therefore, the core speed (performance) has to
increase to counteract the increased contribution of the leakage power. The fact that higher
temperature requires a higher core speed, which in turn causes a higher temperature, seems
to indicate that this will lead to thermal runaway. However, this can never happen due
to the fact that a thermal run-away situation contributes a very low PPW (close to zero)
as the denominator would increase without bound. However, a higher PPW can be easily
achieved by decreasing the core speed, which would decrease the overall temperature and
leakage power significantly.
The next two plots illustrate the intricate relationship between the core speed, fan
speed and the ratio of leakage power to the total power. The figures shows that the opti-
mal PPW is not only dependent on the core speeds, but also very much dependent on the
90
0 0.5 1 1.5 2
0
5
10
15
20
Core speed (GHz)
PP
W
(M
IP
S/
W
at
t)
No fan 500 rpm 1500 rpm 3000 rpm
Figure 7.4: Effect of fan speed on PPW
0 10 20 30 40 50
0
10
20
30
40
Leakage power/total power (%)
PP
W
(M
IP
S/
W
at
t)
No fan 500 rpm 1500 rpm 3000 rpm
Figure 7.5: Effect of leakage power on PPW
fan speed (Figure 7.4), and the ratio of contribution of leakage power to the total power
(Figure 7.5).
Voltage-Speed Control
We use the same technique as in Section 3.4 for determining the optimal DVFS for the
formulation (7.2) – (7.6) using convex optimization. The resultant formulation is given
below:
91
max
s(k)
PPW (s(k)) =
n
∑
c=0
wcsc(k)
PT (s,v,T,k)1N×1
(7.7)
s.t. T(k) = ET(k−1)+RPˆ(s,v,k) (7.8)
T(k)≤ T max (7.9)
sc(k) = ksc
(vc(k)− vth)1.2
vc(k)max(Tc(k))1.19
, ∀c (7.10)
0n×1 ≤ v(k)≤ 1n×1. (7.11)
wc is the weight of the task executed on core c. Note: The term Pf an(s f an(t)) present
in (7.2) is not shown in (7.7), as the fan speed is fixed. Further, the inequality in (7.5)
has been modified to an equality in (7.10). By forcing the voltage to its lower bound,
power consumption is lowered without sacrificing performance, thereby, increasing energy
efficiency. In addition, the use of equality halves the number of variables required to be
solved, thus greatly reducing the computational complexity.
The above optimization problem is solved at the start of every scheduling interval.
The following properties allow for an efficient solution of the above optimization problem
(proofs are in Appendices A.3 and A.2, respectively).
Theorem 1. PPW (s) = ∑
n
c=0 wcsc
PT (s,v,T)1N×1
is a quasiconcave function of core speeds s.
A function is called quasiconcave [73], if it has at most one maximum. It is a
weak form of convexity. This theorem guarantees a unique solution of core speeds that
maximizes the PPW. This property was earlier illustrated in Figure 7.2.
Theorem 2. The core temperatures T and core voltages v are convex functions of core
speeds s.
92
The above properties allow the use of standard gradient search techniques [73] or
any one of several other methods like the ellipsoid and interior-point methods to find the
optimal solution. For the experimental results, we used fmincon function in Matlab to
solve the optimization problem.
Task-to-Core Allocation
For task-to-core Allocation (TCA), we use the same linear assignment problem (LAP)
formulation used in Chapter 4. However, unlike for the case of maximizing throughput,
the objective in this case, PPW, is not a linear sum of core speeds. Hence the definition of
PPW is modified from the ratio of the total throughput to the total power to the sum of the
PPWs of individual cores as shown below
PPW =∑
c
wcsc
PTc 1n×1
. (7.12)
We claim that the modification of the definition of PPW does not cause a significant
loss in accuracy and is illustrated through Figure 7.6. Figure 7.6 shows a comparison of the
PPW when solving the TCA problem using the LAP formulation (i.e. with the simplifying
assumptions) and the brute-force TCA by enumerating all the possible assignments of 8
tasks on 4 cores (total 1680 assignments). The TCA was repeatedly solved every 100 ms,
over a period of 100 s. Further details of the experimental setup is described in Section 3.6.
The following observations were made from the plot:
1. The maximum error of the proposed TCA with the brute force TCA is less than 8%.
From the plot, we see an initial jump in error, which decreases over time. With a set
initial temperature, the proposed and the brute force allocation deviate in the final
optimum temperature. However, as time progresses, the proposed allocation contin-
ually adjusts its final temperature to maximize PPW, and this temperature approaches
the optimal temperature at the steady-state durations, as see in the figure. Note that a
93
0 20 40 60 80 100
370
390
410
Time (s)
PP
W
(M
IP
S/
W
at
t) Proposed Brute force
0
2
4
6
8
E
rr
or
(%
)
PPW error
Figure 7.6: Comparison of PPWs of the brute-force TCA with the proposed TCA (migra-
tion interval - 100 ms)
constant error between the proposed TCA and the brute-force TCA remains as both
methods are solving different version of the PPW objective.
2. The computation time of the brute-force TCA is much larger than the proposed TCA.
The brute force method required 304 s for computing allocation at every migration
interval, whereas the proposed TCA required just 3.8 s, which is a 80 X speedup.
Note that the order of complexity of the brute-force TCA increases exponentially
with the number of cores and tasks, while it is linear for the proposed TCA.
Fan Speed Control
The fan speed is changed over relatively longer time intervals – typically every few sec-
onds. This is an order of magnitude lesser than the package thermal time constant (20 s
– 30 s [54]). As such, a significant change in the package temperature is not seen in the
interval during which the fan speed is held constant. Once the fan speed is determined
based on the current power dissipation and TCA, it remains fixed for the interval. Note
that the above assumption is used only in the computation of the fan speed and not in the
actual DTM control of the processor, which happens according to Figure 3.1, i.e., fan speed
scaling is done in intervals of 1 s–3 s, TCA happens in intervals of 50 ms–100 ms between
94
changes in the fan speed (using the computed fan speed) and DVFS occurs at intervals of
5 ms–10 ms between the migrations (using the computed fan speed and TCA).
The problem of determining the optimal fan speed for maximizing PPW, also re-
quires the computation of optimal core speeds. Thus the required formulation for optimal
fan speed scaling at the kth interval is given by
max
s(k),s f an
n
∑
c=0
wcsc(k)
PT (s,v,T,k)1N×1)+Pf an(s f an)
(7.13)
s.t. T(s,s f an,k) = E(s f an)T(k−1)+R(s f an)Pˆ(s,v,k) (7.14)
T(s,s f an,k)≤ T max (7.15)
sc(k) = ksc
(vc(k)− vth)1.2
vc(k)max(Tc(k))1.19
, ∀c (7.16)
0n×1 ≤ v(k)≤ 1n×1. (7.17)
wc is the weight of the task assigned to core c. Rconv is given by (2.24). E and R are defined
in (3.21) and (3.22), respectively. Note that E and R are now function of s f an, because A
is a function of s f an, as seen from (2.26).
From Section 7.2, we know that the above formulation is quasiconcave w.r.t core
speeds. The following theorem states that the above formulation is also a quasiconcave
function of s f an. Hence, the same solution techniques discussed in Section 7.2 may be
used to determine the optimal fan speed. Proof of the following theorem is given in Ap-
pendix A.4.
Theorem 3. T is a convex function of the cores speeds s f an and
PPW (s,s f an) =
n
∑
c=0
wcsc
PT (s,v,T)1N×1)+Pf an(s f an)
is a quasiconcave function of s f an.
95
The overall procedure of DTM control for maximizing PPW is summarized in Al-
gorithm 4.
Input: P(s,v,T,M, t),T(0),A(s f an),B, t ∈ [0, tend]
Output: s(t),v(t),M,s f an,T(t), t ∈ [0, tend]
k f ,km,kd ∈Z ;
for t = k f t f an, t f an ∈ [1 s,3 s] do
Find s f an that maximizes PPW (Section 7.2);
for t = kmtmig, tmig ∈ [50 ms,100 ms] do
Find M that maximizes PPW (Section 7.2);
for t = kdtdv f s, tdv f s ∈ [5 ms,10 ms] do
Find s that maximizes PPW (Section 7.2);
end
end
end
Algorithm 4: Overall procedure of DTM control for maximizing PPW
7.3 Results
Comparison of Voltage-speed Scaling Schemes
In this section, the comparison of performance, power and temperature is done under three
objectives: maximum performance (α = 3), maximum PPW (α = 1) and the minimum en-
ergy (α = 0.25). α = 3 and α = 0.25 were chosen to represent the maximum performance
and minimum energy solutions, respectively, since there was no significant improvement in
performance for α > 3 and similarly, there was no significant energy savings for α < 0.25.
This is observed in Figure 7.7, which shows the plot of total energy consumption v/s the
task completion times for various values of α .
The comparison of core speeds and voltages, temperature of the hottest blocks, total
power consumption and the corresponding PPW for a four core processor executing tasks
Table 7.1: Task completion times of benchmarks used in the experiment
Benchmark basicmath qsort IFFT GSM dec.
Completion
times (sec)
α = 3 184 260 142 106
α = 1 389 512 306 211
α = 0.25 1299 1617 1083 565
96
from Table 8.1 is shown in Figure ??. The following key observations can be made from
the plots:
1. For maximum performance, either the speed of the cores have to be at the maximum
or the temperature of the hottest blocks should be at the maximum as seen from Fig-
ure 7.8a. This is consistent with the performance-optimal policy derived in [17]. For
instance, over the interval [0 s,106 s], all four tasks are running at 2 GHz, as the
die temperatures are below the maximum temperature. Near t = 106 s, the temper-
ature reaches the upper limit of 110◦C. Task GSM dec ends at t = 106 s. During
(106 s,142 s], the speeds of the three remaining cores are lowered to maintain the die
temperature at 110◦C. Task IFFT ends at t = 142 s. This allows the speeds of the two
remaining cores to increase toward 2 GHz, while maintaining the die temperature at
110◦C. Finally, during (184 s,260 s] task qsort is sped up to 2 GHz, while the die
temperature starts to decrease as a result of the other three cores becoming idle. The
completion times of tasks are noted in Table 7.1.
2. Regardless of the objective function, energy-efficiency or PPW of tasks decreases as
the number of active tasks decreases. This is because as the number of active tasks
decreases, performance as measured by the total throughput decreases faster then the
power decreases.
3. Table 7.2 summarizes the results of Figure ?? showing the overall delay, energy
consumption and the PPW. As seen from the plots in Figure ??, the max-performance
objective (α = 3) executes the tasks with the least possible delay, but consumes
the maximum energy, while the min-energy scheme (α = 0.25) consumes the least
energy, but with a large execution delay. The max-PPW objective (α = 1) offers a
trade-off between both extreme objectives with a reasonably higher delay, but with
large savings in energy, thereby, maximizing energy-efficiency. In fact, the PPW of
97
200 400 600 800 1,000 1,200 1,400 1,600
0
5
10
α = 3
α = 2
α = 1.5
α = 1
α = 0.75 α = 0.5
α = 0.25
Delay or Makespan (s)
To
ta
l
en
er
gy
(k
J)
Figure 7.7: Energy delay curve
Table 7.2: Comparison of overall delay, energy and PPW for schemes shown in Figure ??.
α = 3 α = 1 α = 0.25
Delay (s) 260 512 1617
Total energy (kJ) 14.08 4.44 3.65
PPW (MIPS/Watt) 170.5 550.9 360.9
maximum PPW objective is 3.2X (550.9/170.5) that of the max-performance scheme.
Effect of cores on PPW
In this section, we investigate the effect of increasing the number of cores on energy-
efficiency. We assume Pollack’s Rule [74] for scaling the performance with doubling of
cores, i.e., every doubling of number of cores increases the overall throughput by 40%,
reduces the maximum frequency of each core by 30%, and reduces the power consumption
of each core by some p%. In this experiment, three values of p are considered, 35%, 40%
and 45%. The results are shown in Figure 7.7. The plots show that that a greater number
of cores does not always translate to higher energy efficiency. The figure shows that there
is an optimal configuration of cores which maximizes the PPW. This optimal configuration
is dependent on the factor of the power reduction for every doubling of cores. This implies
that there is a upper bound on the power consumption of a core, which limits the number
of cores that can be integrated on a die.
98
106 142 184 260
50
70
90
110
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
106 142 184 260
0
1.0
2.0
Time (s)
Sp
ee
d
(G
H
z)
106 142 184 260
0.2
0.5
0.8
1.2
Time (s)
Vo
lta
ge
(V
)
106 142 184 260
100
200
300
400
500
Time (s)
PP
W
(M
IP
S/
W
at
t)
PPW
40
60
80
To
ta
l
Po
w
er
(W
)
Power
(a) Maximum performance (α = 3)
Figure 7.8: Comparison of speed, temperature and power of execution for various objec-
tives.
99
211 306 389 512
40
45
50
55
Time (s)
211 306 389 512
0
0.5
1
1.5
Time (s)
basicmath qsort IFFT GSM dec.
211 306 389 512
0.2
0.4
0.6
0.8
Time (s)
211 306 389 512
400
600
800
Time (s)
PPW
6
8
10
12
Power
(b) Maximum PPW (α = 1)
Figure 7.7: Comparison of speed, temperature and power of execution for various objec-
tives.
100
565 1,083 1,617
35
40
45
50
Time (s)
565 1083 1299
0
0.2
0.4
0.6
0.8
Time (s)
565 1083 1299
0.2
0.3
0.4
0.5
Time (s)
565 1,083 1,617
200
300
400
500
Time (s)
PPW
3
4
5
6
Power
(c) Minimum energy (α = 0.25)
Figure 7.6: Comparison of speed, temperature and power of execution for various objec-
tives.
101
1 2 4 8 16 32
50
100
150
200
No. of cores
M
IP
S/
W
at
t
45% 40% 35%
Figure 7.7: Plot of MIPS/Watt against the number of cores for various factors of power
reduction per core
0 20 40 60 80 100
4
6
8
9
Time (s)
Th
ro
ug
hp
ut
Optimal
P.TM
Figure 7.8: Comparison of throughput, total power and PPW for various αs
Comparison of Task migration schemes
Here, we compare the throughput, power consumption and the PPW of execution of eight
tasks on a four core processor. The task migration was conducted for the three different
policies mentioned earlier, i.e., max-performance (α = 3), max-PPW (α = 1) and max-
energy (α = 0.25). The results are plotted in Figure 7.8. The results are consistent with the
results from the voltage-speed scaling, i.e. max-performance achieves the best throughput,
but the worst PPW due to high power consumption, while max-energy policy achieves the
least power consumption, but due to low throughput, has low PPW. Only the max-PPW
achieves the maximum PPW as seen from the figure.
102
0 20 40 60 80 100
0
10
20
Time (s)
To
ta
l
po
w
er
(W
)
0 20 40 60 80 100
0
2
4
6
8
Time (s)
T
hr
ou
gh
pu
t
(G
IP
S) With fan speed scaling without fan speed scaling
0 20 40 60 80 100
200
250
300
350
400
Time (s)
PP
W
(M
IP
S/
W
at
t)
Figure 7.9: Comparison of throughput, total power consumption and PPW of DVFS with
and without fan speed scaling
Improvement in PPW through Fan Speed Scaling
Figure 7.9 shows the comparison of throughput, total power consumption and PPW of
DVFS with and without the use of variable fan speed. The fan speed at every instant is
obtained by solving the optimization problem proposed in Section 7.2. We see a PPW
improvement of 35.8% over the baseline approach of using no fan. Hence, demonstrating
the need for fan speed scaling for improving energy-efficiency of processors.
103
Chapter 8
Exploiting Reliability and Bandwidth Slacks for Improved Memory Energy Management
8.1 Introduction
Advancements in scaling technology have made it possible to achieve many-fold improve-
ments in microprocessor performance, but at the same time have made thermal-related reli-
ability and increased power consumption a major concern in integrated circuit design. This
is especially true of on-die memory subsystems, e.g., caches, scratchpad memories, etc.,
which dominate processor in terms of area and power consumption. High temperatures
in memories cause timing-related memory failures. In order to combat for the increased
memory bit-error rates (BER), designers are often forced to add extra margins to their de-
signs. Supply voltage being one of these margins, results in higher power consumption and
thereby higher temperatures, which further increase the BERs, resulting in a vicious cycle.
Fortunately, unlike the logic cores of a processor, memories are designed to be error
resilient. Moreover, certain applications such as multimedia have inherent error resiliency,
and can tolerate a certain BER depending on the application. This fact can be exploited
to reduce supply voltages to remove any additional slack in memory reliability towards
reducing power consumption. An important application of this concept is in extending
battery life while running multimedia. The quality of a multimedia object is directly related
to BER of its data. An example of this can be seen in battery-limited systems. In a low
battery situation, a user might be willing to sacrifice the quality of a video, if it helps in
extending the watching time of the video. This can be achieved by artificially increasing
the accepted BER tolerance of applications, which allows the supply voltage to be lowered,
thereby increasing battery life.
Scaling memory clock rate is another mechanism to reduce energy consumption.
On-die memory systems are usually clocked at the same frequency as CPU to prevent CPU
stalls due to waiting on memory. However, the memory bandwidth is under utilized on
104
Dynamic
Power
Leakage 
Power
Memory 
Temp
CPU 
Temp
Memory 
BER
Memory 
Freq
Memory
Vdd
e
e
e
e
e
Pe
Vdd
Temp
fre
q e
B
C
cyclic
A
D
: Linear increase
e : Superlinear increase
e : Superlinear decrease
Figure 8.1: Figure illustration the complex interaction between various physical quantities
in a processor
average, thus wasting energy. Instead of clocking memory at the same rate as the CPU,
memory clock rate should be based on the memory access rate and the misses requested.
With multicores running different application with differing bandwidth requirements, the
need for separate voltage planes for multiple banks of memory and individual DVFS is
very high.
Unfortunately, solution to the above problems is not easy, as there is no simple
relation connecting supply voltages, power consumption, and memory BER. Figure 8.1
illustrates the complex cause and effect relationships that exists between supply voltage,
frequency, temperature, power consumption and error rates. A ‘+’ symbol along an arrow
in the figure indicates a positive effect on the quantity (power/temperature/BER) pointed by
the arrow, by the quantity at the tail of the arrow, whereas a ‘–’ symbol denotes the negative
effect. For example, annotated portion A in the figure shows that increasing frequency and
105
temperature increases the memory BER, while increasing the supply voltage decreases
the BER. However, increasing voltage also increases both dynamic and leakage power
(see B in the figure), and due to the positive feedback between the leakage power and
the temperature, the temperature increases quickly (see C in the figure). In addition to
this, there is a strong coupling between temperatures of cores and memory (see D in the
figure), which adds additional complexity. The situation becomes much more complex for
a many-core and multi-memory system. Thus optimizing for minimum energy operation
of a processor-memory system with constraints on temperature and application BER poses
very challenging problems.
Memory reliability models
One of the reasons for failures in memory read/write errors are caused due to processor
reading or writing data faster than the delay supported by the memory. Memory access time
is a function of the architecture, the circuits, the process technology, the supply voltage and
the temperature. For optimization purposes, we need a model that relates the delay δ of a
circuit block to some of the above key parameters. Such a model can serve as a basis for
setting bounds on the clock speed. There has been a large body of work on modeling of
timing failure in memory cells due to process variations [75–79]. One such model [53]
that has been constructed by experiments using 65 nm technology is shown below.
δmi = ki
vi(t)max(Tmi (t))
1.19
(vi(t)− vt)1.2 (8.1)
δmi signifies the delay associated with memory of core i. T
m
i is the temperature of thermal
blocks in the memory associated with core i. vt is the threshold voltage and ki is the con-
stant of proportionality. The maximum operational frequency for a memory i is si = 1/δmi .
It is important to note that the delay increases with temperature and threshold voltage, but
decreases with the supply voltage.
106
memory 
access delay
vdd
decreased 
error rates
memory 
errors
pr
ob
ab
ilit
y d
en
sit
y f
un
cti
on
clock period
Figure 8.2: Effect of process variations and in timing related memory errors
Since vt is not a single value, but a distribution of values, as it is affected by random
dopant fluctuations (RDF) [79, 80], which are dominant in the sub 100 nm designs. vt is
typically modeled as a Gaussian random variable:
vt ∼N (µvt ,σvt ) (8.2)
where µvt is the mean and σvt is the standard deviation, which depend on the doping profile,
circuit size and the manufacturing process.
It has be shown that the memory access delay δmi can itself be considered as a
Gaussian random variable [30], which is dependent on the statistics of vt :
δmi ∼N (µδmi ,σδmi ). (8.3)
When the memory read/write clock period 1/si (fixed) exceeds the memory access delay
δmi (a distribution), there will be memory read/write errors as illustrated in Figure 8.2.
Note that increasing clock frequencies lead to a higher probability of memory errors as
they enforce tighter bounds on the memory access times, whereas increasing the supply
voltage decreases the error rates as it reduces the access times of memory cells.
Since the delay is a Gaussian random variable, the probability of bit-errors can be
computed as follows:
Pe(si,vmi ,T
m
i ) = P[δ
m
i > 1/si] = Q(ρi). (8.4)
107
Q is the tail of the Gaussian distributions, and is given by
Q(ρi) =
1√
2pi
∫ +∞
ρi
e
−x2
2 dx. (8.5)
Standard tables are available to compute Q functions for any ρi. ρi in (8.4) is defined as
ρi =
1/si−µδmi
σδmi
=
vmint,i −µvt
σvt
. (8.6)
vmint,i is obtained from (8.1) by substituting δ
m
i = 1/si.
The voltage-speed-temperature relationship for cores, unlike in memories, cannot
tolerate any timing errors, hence it is given by (reciprocal of (8.1))
si(t)≤ kci
(vci (t)− vt)1.2
vci (t)max(Ti(t))1.19
, ∀i ∈ cores. (8.7)
Figure 8.3 illustrates the relation between memory BER, memory frequency and
voltage. As seen from the figure, for every frequency, increasing voltage, decreases the
BER. However, increasing the voltage beyond some point results in an increase in BER
due to higher temperature and the cyclic dependency on the leakage power consumption.
There is an optimal voltage that corresponds to a minimum BER, and this voltage depends
on the assigned frequency.
8.2 Opportunities for exploiting memory reliability and bandwidth slack
Limitations of pre-determined voltage-frequency pairs
Most of today’s processors provide support for dynamic voltage and frequency scaling
(DVFS) in the form of Power(P)-states. These P-states are fixed pairs of voltage-frequency
states derived to satisfy timing requirements under highest allowable temperature and a
very low BER. Under normal workload conditions, the temperature of an on-die memory
rarely reaches the stipulated maximum temperature. Thus, there is a slack in the voltage
created due to lower temperatures and lower application BER, which can be exploited to
reduce the supply voltage to save memory energy consumption.
108
Figure 8.3: Memory BER as a function of memory voltage and frequency
Inherent error resiliency of multimedia
Multimedia content like images, video, and audio contain elements which are very highly
correlated, e.g. pixels in an image. This property helps in restoring images which are
corrupted by uncorrelated noise like channel noise, sensor noise, etc. Figure 8.4 illustrates
the extent of image restoration possible under various BERs. The restoration was done
using median filter.
Under utilized last-level cache
With reducing feature size, chip designers are able to pack more and more hardware fea-
tures on the chip. Advanced out of order cores and massive graphics engines are fundamen-
tal requirements of any modern high end processor. ‘Memory-wall’ presents a challenge
to all such advancements to deliver higher performance. Last level cache (LLC) plays a
crucial role in lowering the memory wall. Increasing LLC capacity is the option favored to
increase the memory bandwidth. However, a large LLC is mostly under utilized leading to
wastage of cache power. Figure 8.5 shows the bandwidth demand from L1 and L2 caches
from memory for three different Parsec benchmarks. Note that bandwidth requirement for
109
BER = 10
-3
BER = 10
-2
BER = 10
-1
Noisy images
Restored images
Figure 8.4: Examples of restoring images corrupted by various levels of noise, and the
possible energy savings that is obtained by operating a memory at a supply voltage that
satisfies the corresponding BER.
every application varies and the average bandwidth requirement is far below than the peak
demand. Added to this, in today’s processors LLC contribute to significant amount of pro-
cessor power sometimes on par with CPU. Thus LLC can greatly benefit from DVFS that
includes adjusting LLC clock rates based on bandwidth requirement and adjusting LLC
supply voltage based on BER constraints.
Unlike adjusting LLC clock rate based on bandwidth requirement, which is simpler,
the supply voltage of a LLC cannot be lowered to allow any increase in BER, which can-
not be corrected by error correction control (ECC) mechanisms. This is because, a cache
contains data that is required by various programs, and some programs cannot tolerate any
error, e.g. program instructions, operating system calls, data structures, etc. Any error will
cause either the program to crash or the entire system might restart. Thus lowering the sup-
110
Figure 8.5: Plot of bandwidth requirement for LLC for three different Parsec benchmarks
Core 
0
Core 
1
Core 
3
cache 
bank0
cache 
bank2
cache 
bank3
cache 
bank1
Core 
2
L1 , L2 L1 , L2 L1 , L2 L1 , L2
Core 
0
Core 
2
Core 
1
Core 
3
cache
bank0
cache 
bank2
LLC:
bank1NM  bank3 NM  bank1 L1 , L2 L1 , L2 L1 , L2 L1 , L2(b) In traditional LLC all banks act as cache (c) Bank 1 & 3 configured as Near Memory(a) Near-memory allows each bank of cache to be configured as either cache or Near-Memory (NM) with individiual voltage and frequency controlCore 0 Core 2Core 1 Core 3Cacheor NM Bank 0 Cacheor NM Bank 2 Cacheor NM Bank 3v0 v1 v2 v3L1 , L2 L1 , L2 L1 , L2 L1 , L2Cacheor NM Bank 1 Figure 8.6: Proposed near-memory architecture for LLCply voltage inorder to alter BER to save power cannot be directly applied to caches. Thisrequires an architecture modification that provides software support that allows a programto selectively store required data in a specialized memory called ‘Near-memory’.Near-Memory architecture for LLCFigure 8.6 shows a conceptualization of a near-memory architecture. In this proposedarchitecture, a memory bank of the LLC is dynamically configured as a near-memory.Near-memory means that the content residing inside this near-memory is managed by an
application program, as opposed to the hardware-managed caches. This near-memory does
111
not generate any misses and the data in the memory is assumed to be available for the
application just like in the main memory. In essence, near-memory is a software controlled
memory, as against the hardware managed caches. However, the latency of this near-
memory is that of LLC, which is significantly less than the off-chip DRAM-based main
memory, which gives it the name ‘Near memory’.
Near-memory architecture is specifically beneficial to multimedia applications, where
the application can store multimedia specific data, which can tolerate high BER. Depending
on the required BER, the voltage of a near-memory is selectively reduced below the normal
voltage for LLC. This voltage reduction of a Near-memory does not affect the operation or
the supply voltage of other parts of the LLC, which is used as a cache, thereby avoiding
any situation leading to a program crash. The BER increase due to lowering the voltage of
the near-memory is within the acceptable range specified by the application. Furthermore,
with L1 and L2 private caches still acting as regular caches, this near-memory is accessed
at the same bandwidth and latency as the LLC. It is possible to create a special paging
scheme in the near-memory for different applications to share the data in near-memory.
8.3 Problem Statement and Approach
Using the above system models, the goal is to determine the transient speeds and voltages
of cores and memories to minimize the total power consumption, subject to the constraints
on the maximum temperature Tmax and the target BER Pmaxe . The formulation is given
below:
112
max
s,vc,vm
P(s,vc,vm,T, t))T1N×1 (8.8)
s.t.
dT(t)
dt
= AˆT(t)+BPˆ(s,vc,vm, t), ∀t (8.9)
T(t)≤ Tmax, T(0) = T0, ∀t (8.10)
f (Q(ρ))≤ Pmaxe,i , ρi =
vmint,i −µvt
σvt
, ∀i (8.11)
si(t) = kmi
(vmi (t)− vmint,i )1.2
vmi (t)max(Ti(t))1.19
, ∀t, ∀i ∈mem (8.12)
si(t)≤ kci
(vci (t)− v1.2t )
vci (t)max(Ti(t))1.19
, ∀t, ∀i ∈ cores (8.13)
smin ≤ s(t)≤ 1n×1, ∀t. (8.14)
0n×1 ≤ vc,vm(t)≤ 1n×1, ∀t. (8.15)
The objective in (8.8) denotes the sum of power consumption of all thermal blocks. Equa-
tion (8.9) is same as (2.9). The temperature constraint on the cores and the memories is
described in (8.10). Equation (8.12) is reciprocal of (8.1), and (8.13) is same as (8.7). The
data rates of tasks determine the minimum speed of cores and their associated memories as
mentioned in (8.14).
f in (8.11) refers to the function that computes the target BER from BERs of in-
dividual cores. It is important to distinguish between the target BER, which is set by the
application needs and the memory BER. For example, in the case of a multimedia appli-
cation, the target BER can be translated to peak signal to noise ration (PSNR), which is
the ratio of maximum value of all pixels to the mean square error (MSE) of the noisy im-
age from the original image. This function is specific to the applications processed [81].
In general, f can be obtained through system simulation at the functional level. In some
cases, though, analytical derivations are possible [81].
113
Since the core and the memory speeds and voltages can be changed only at dis-
crete time intervals, the above formulation needs to be discretized in time. These fixed
time intervals, called the scheduling intervals ts are usually chosen to be the die thermal
time constants of the processor (few ms [43]), as significant temperature evolution is not
observed for shorter intervals of time. The solution of the above formulation in discrete
time is presented in the following section.
Implementation Details
The discrete version of (8.8) – (8.15) is obtained for a kth scheduling interval by substituting
t = kts. Note that core speeds are held constant for the duration of ts. The temperature at the
end of time kts is obtained by discretizing (8.9) and is expressed using matrix exponentials
as shown below:
T(k) = eAˆtsT(k−1)+ Aˆ−1(eAˆts− IN×N)BPˆ(s,vc,vm,k). (8.16)
For simplicity, kts is denoted by just k. The above equation is rewritten with E = eAˆts and
R= Aˆ−1(eAˆts− IN×N)B as
T(k) = ET(k−1)+RPˆ(s,vc,vm,k). (8.17)
With this the discrete version of (8.8) – (8.15) is given below:
max
vc,vm
P(smin,vc,vm,T,k))T1N×1 (8.18)
s.t. T(k) = ET(k−1)+RPˆ(smin,vc,vm,k) (8.19)
T(k)≤ Tmax, T(0) = T0 (8.20)
f (Q(ρ))≤ Pmaxe,i , ρi =
vmint,i −µvt
σvt
, ∀i ∈mem (8.21)
smini (k) = k
m
i
(vmi (k)− vmint,i )1.2
vmi (k)max(Ti(k))1.19
, ∀i ∈mem (8.22)
smini (k) = k
c
i
(vci (k)− v1.2t )
vci (k)max(Ti(k))1.19
, ∀i ∈ cores (8.23)
0n×1 ≤ vc,vm(k)≤ 1n×1, ∀t. (8.24)
114
Note the following major changes in the above formulation compared with the for-
mulation (8.8)–(8.15):
• The speeds of cores and memories are no more variable, but replaced by their respec-
tive minimum value, i.e., s= smin. This is because minimizing total power consump-
tion implies executing cores and memories at lower speeds and voltages. Since using
minimum voltage may violate BER constraints, only minimum speed is used. Note
that using the minimum speed does not violate constraints on maximum temperature
or BER. Additionally, using the minimum speed reduces the number of control vari-
ables by half, thus improving the computation speed significantly. Now the control
variable are just the voltages of cores and their respective memories.
• Since lowering power consumption demands minimum voltages, the inequalities in
(8.11) and (8.13) are transformed to equalities (8.21) and (8.23), respectively.
Theorem 2 and the following theorem (for proof see Appendix A.5) make the above
formulation a quasiconvex optimization problem [73].
Theorem 4. Q is a quasiconvex function of vm.
8.4 Experimental Analysis
The reliability and bandwidth aware DVFS was evaluated for general-purpose computing
multi-threaded workloads.
Experimental setup
A set of programs (detailed in Table 8.1) from Parsec [82] benchmark suite were simulated
on Sniper [83] simulator for Intel’s Nehalam microarchitecture based 45 nm Xeon quad-
core processor specification with 8 MB of LLC consisting of four banks. This architecture
represents a widely used, first class server machine for high performance computing. De-
tailed trace files containing performance events from Sniper were fed to McPat [84] to
115
Intel 
Nehelam
architecture
Parsec 
benchmark 
suite
LLC
Core
Core
Core
Core
L2
Sniper
McPAT
HotSPOT
Controller
voltage and freq controls 
for individual cores and 
last level cache
Figure 8.7: Experimental setup
obtain power statistics. The proposed optimizer determines the optimal supply voltage for
LLC based on application BER tolerance, and optimal memory clock rate based on band-
width demand from L1 and L2 caches. The resultant power is fed through HotSpot [54]
thermal simulator to ensure that the thermal maximum constraint of 100◦C is not violated.
Table 8.1: Characteristics of Parsec benchmarks used in the experiments
Benchmark avg. power (W) avg. bandwidth (GB/s) no. threads
blacksholes 33.59 0.04 3
bodycluster 34.71 0.75 2
canneal 28.3 7.97 3
fluidanimate 31.39 0.73 2
freqmine 34.74 0.56 1
raytrace 37.94 0.58 3
streamcluster 33.68 8.1 3
swaptions 34.09 0.001 3
vips 39.45 2.26 1
Solution process
The solution to the problem of BER and bandwidth based optimal DVFS for LLC is ob-
tained through a convex optimization process. This is possible as the power and thermal
models adopted in this work are convex w.r.t. to the frequencies and voltages of cores and
116
Bla
cks
cho
les
Bod
yclu
ster
Can
nea
l
Flu
idan
ima
te
Fre
qmi
ne
Ray
trac
e
Stre
amc
lust
er
Swa
ptio
ns vips
Parsec Benchmarks
0
20
40
60
80
100
En
er
gy
 im
pr
ov
em
en
t o
ve
r B
as
el
in
e
BW only voltage scaling 
BER only voltage scaling
Figure 8.8: Energy savings from the proposed BER and bandwidth aware DVFS for LLC
cache [17]. More importantly, the relation between the supply voltage of LLC and the
application BER is monotonic. Together, the problem can be solved as a convex optimiza-
tion problem. A convex optimization problem can be solved by one of several gradient
search methods or interior-point techniques [73]. The solution time varies depending on
the dimension of the problem and the smoothness of the objective and constraints of the
problem.
Energy savings from the proposed DVFS for LLC
Figure 8.8 shows the improvement in memory energy savings as a result of optimal DVFS
of LLC based on application BER and bandwidth demand. For this experiment, the ap-
plication BER constraint was set at 10−6. The plot also shows the ratio of energy savings
from pure BER-based voltage scaling to the energy savings from frequency scaling of LLC
based on bandwidth demand from L1 and L2 caches. It is important to note that the en-
ergy savings do not fluctuate much among the benchmarks, i.e. they are not affected much
whether application is CPU or memory bound. This is mainly due to the fact that the leak-
age power dominates the power consumption in an LLC, and the quadratic effect of voltage
117
Figure 8.9: Energy savings increase proportionally with increasing BER tolerance
scaling which is more prominent than the linear effect of frequency scaling. Note that the
above results did not change significantly, whether the DVFS was applied to cores or not,
and hence not reported.
Energy savings by lowering BER
As was discussed earlier, multimedia has very high error resiliency. This experiment was
conducted to see the effect of lowering BER on improving energy savings. Figure 8.9
shows a super linear increase in energy savings from lowering BER. This fact can be ex-
ploited in a low battery scenario to improve the battery life.
118
Chapter 9
Temperature-aware Robust Controller: Accurate Modeling and Prediction for Multi-core
Processors
9.1 Introduction
Unlike for single-core processors, where most DEM policies can be easily realized on a
processor through the use of simple control implementations like PID controllers [44, 45],
there are no such straight-forward techniques available for multi-core DEM implemen-
tations. The reason being, a multi-core processor is basically a multi-input-multi-output
(MIMO) device, and as such, MIMO controllers are inherently complicated to design, and
assuring stability and robustness of those controllers are very challenging.
Several new controller designs have been suggested in the recent years for multi-
core DEM [46–49]. Some of them are based on statistical techniques like partially observ-
able Markov chains [47], while others are based on control-theoretic techniques [46,48,49].
The statistical approach has the advantage of not requiring apriori knowledge of power and
thermal models in order to determine optimal processor frequency and voltage for achiev-
ing a given objective; however, these statistical approaches operate on discrete values, and
the complexity of computation of discrete optimal DVFS states grows exponentially with
the number of cores, which is not practical for online DEM.
The recent works on control-theoretic techniques are targeted mainly at maintaining
the processor temperature and power consumption below a specified maximum [46,48,49].
Core
speeds
Updated
package
 temperature
DTM
Controller
ProcessorKalman state
estimator
Core utilization
Power and 
Temperatures
of cores
+
Process noise Sensor noise
non-linearvarious
objectives
Predicted power and 
temperatures of cores
+
Figure 9.1: Structure of the closed-loop controller with its various components
119
Ref. [48] and [49] use model-predictive control (MPC) to determine optimal control states
for one or more control time steps in the future, by solving a constrained optimization prob-
lem. Some of the limitations of the above works are: (i) ignoring workload characteristics;
(ii) including power and thermal models that are either simplistic, leading to sub-optimal
controller actions or using higher order models than necessary that increases computa-
tional complexity; (iii) neglecting the leakage power dependence on temperature; (iv) lack
of flexibility in handling objectives functions and constraints of various kinds, e.g. a highly
non-liner objective function like performance/Watt, where the control variables are present
both in the numerator and the denominator; (v) high computation complexity of controller
action determination, which increases exponentially with number of cores. Ref. [46] ad-
dressed the last limitation by developing a distributed system, where each core/group of
cores determine the control action for their own group, with minimal communication with
other cores. However, their policy is mainly heuristic and does not guarantee either the
optimality, or the bounds on the optimality of the controller action.
The above limitations are overcome with a system consisting of online power and
thermal model estimation, and minimization of prediction error based on Kalman filter.
Figure 9.1 depicts the structure of the proposed closed-loop controller. The structure con-
sists of a DEM controller that generates optimal DVFS states for next time interval based on
the computed power and thermal models, current package temperature and core utilization.
Kalman filter is used to reduce the prediction errors caused either by model inaccuracies or
by sensor noise. The Kalman filter utilizes the current prediction errors and the model to
compute the correction to the state variables. An outline of the procedure of the proposed
closed-loop controller is described in Algorithm 5.
The salient features of the proposed approach are (i) estimation of dynamic and
leakage power of cores using just the total power of all cores, and core temperatures; (ii)
include workload characteristics to better power and temperature prediction; (iii) handle
120
Input: Number of cores n; Sensor measurements: processor power Pm, core temperatures
Tm and core utilizations (u); Objective to maximize
Output: Core speeds (s)
Build power and thermal models by executing benchmarks and measuring Pm; Tm and u
(Section 9.3);
Let package temperature Tp(0) = Tambient(0);
for every kts, k ∈N , ts ∈ [1 ms,10 ms] do
Using models, compute s(k) that maximizes the objective while ensuring predicted
T (k)≤ Tmax (Section 9.5);
Compute Tp(k+1) using Tp(k) and P(k) (9.3);
Error e(k) = [TT (k)P(k)]T − [TTm(k)Pm(k)]T ;
Using e(k) adjust Tp(k+1) to minimize future prediction error (Section 9.4);
end
Algorithm 5: Overall procedure of closed-loop control for multi-core DEM
objective functions of various kinds; (iv) minimal error in prediction of core temperatures
and power consumption; (v) fast computation time. We did not perform stability or ro-
bustness analysis for the controller as (i) the open loop controller is inherently stable; (ii)
modern processors are inbuilt with a safety mechanism that shuts the processor at high tem-
peratures, thus protecting them [85]. We did not encounter any robustness issues during
our experimentation (see below) of our implementation of the proposed controller on the
Intel Sandy Bridge processor.
Experimental results from the implementation of the proposed closed-loop con-
troller on an Intel Sandy Bridge processor shows that the controller is able to track tem-
peratures with 0.1◦C mean error and 2.4◦C standard deviation, while power consumption
is tracked with a mean error of 0.06 W and a standard deviation of 3.04 W. In compar-
ison, the standard deviation of noise for temperature sensors is 2.05◦C, and 2.45 W for
power sensor. Since the prediction error is close to the sensors noise range, we claim that
the proposed controller has excellent tracking ability that has been demonstrated on a real
processor, while most of the existing mechanisms have demonstrated their tracking ability
only on a simulator, which may not be true on a real system.
121
The experiments on regulating core temperatures below a specified maximum have
shown that the core temperatures can deviated by 3◦C of the specified maximum when
using the proposed controller. Another set of experiments on maximizing energy-efficiency
have demonstrated that the proposed controller achieved a 32% improvement in energy-
efficiency when compared with existing techniques for DEM provided by Intel on Linux.
9.2 Power and Thermal Models
Characterizing thermal and power behavior of a processor is necessary for estimation of
core temperatures and power consumption for given core speeds. Among the existing mod-
els for characterizing processor temperature, compact thermal models [38] are the simplest,
yet provide accurate modeling of power-temperature relationship. The thermal model used
in this chapter is same as in Section 2.2. The power-temperature relationship is summarized
below:
dTp(t)
dt
=− 1
RpCp
Tp(t)+
n
∑
c=1
Pc(sc,Tc, t)
Cp
(9.1)
T(t) = Tp(t)+RP(s,T, t). (9.2)
sc is the normalized ([0,1]) frequency/speed of core c. n is the total number of cores. In the
above equations, the power consumption depends on both core speed (dynamic power) and
core temperature (leakage power), but does not explicitly depend on core voltages. This is
because in most processors, changing a core speed automatically adjusts the core voltage
appropriately. Note that vectors and matrices are denoted in bold, e.g. P denotes a vector
of Pc of all cores.
Since the power and the temperature measurements are obtained in discrete time
steps, the above equations are discretized and the corresponding equations are given below.
Tp(k+1) = aTp(k)+b
n
∑
c=1
Pc(sc,Tc,k) (9.3)
T(k) = Tp(k)+RP(s,T,k) (9.4)
122
where a , 1− ∆tRpCp and b , ∆tCp . ∆t is the length of the discrete time step and k refers to
time k∆t.
The power consumption of a core is given by the sum of dynamic power and leakage
power as shown below
Pc(sc,Tc,k) = Pd,cuc(k)s3c(k)+Pl0,c,m+gl,c,mTc(k) (9.5)
The first term denotes the dynamic power consumption, where Pd,c is the maximum dy-
namic power consumption of core c at the maximum core utilization uc = 1 and maximum
core speed sc = 1. Since there is no explicit voltage control available, the core voltages are
assumed to vary quadratically with the core speeds, hence the use of s3c in the first term of
the above equation.
The last two terms represent the contribution of leakage power to the total power.
The leakage power has exponential dependence with core voltage and temperature. This
explains the cyclical dependence of power consumption on temperature as seen in (9.4).
In order to have accurate, yet simplistic representation of the above dependence, we have
used piece-wise linear (PWL) modeling of the exponential dependence. In the above equa-
tion, Pl0,c,m and gl,c,m are the offset and the slope of the leakage power of core c in the
mth piece-wise linear segment. An example of two-dimensional PWL decomposition of
leakage power w.r.t. core speed and temperature is shown in Figure 9.2. The procedure
to determine the mth PWL segment is described with an example. Consider a processor
which supports discrete speeds from 0.8 GHz to 2.1 GHz spaced equally every 100 MHz.
For the sake of PWL, let us discretize the temperature range from 30◦C to 90◦C equally
spaced every 5◦C. Then for an example of frequency 1.5 GHz and temperature 55◦C, m
is computed as m = 8×14+6 = 118, as 1.5 GHz corresponds to 8th discrete speed level,
55◦C corresponds to 6th discrete temperature level, and the total number of discrete speed
settings are 14.
123
0.8
1.2
1.6
2.1
30
50
70
90
0
10
20
30
Speed (
GHz)
Temperature ( ◦C)
L
ea
ka
ge
Po
w
er
(W
)
Figure 9.2: Piece-wise linearized surface of leakage power w.r.t. core speed and tempera-
ture
9.3 Model Identification
The model identification process consists of identifying parameters for both the power and
the thermal models, viz. a, b and R of (9.3) and (9.4); Pd,c, Pl0,c and gl,c of (9.5). For
the model identification, a set of benchmarks is chosen and allocated randomly to various
cores. Then the cores are executed with varying core speeds for every few milliseconds,
and is continued for several minutes. During this time, core speeds, core utilization, power
and temperature measurements are noted. These measurements are used in identifying the
above mentioned power and thermal model parameters. Since power is an input in the
thermal model, it is convenient to determine the parameters of the power model first.
Power Model
The parameters to be determined in (9.5), viz. Pd,c, Pl0,c and gl,c linearly combine to result
in power consumption of a core Pc. Therefore, the above parameters can be determined
using the below linear least squares (LLS) formulation:
Ypy = Ptot (9.6)
where Ptot(k),
n
∑
c=1
Pc(k) and k is the row index, which is also the time instant of the mea-
surement. In the above equation, the total power consumption is used instead of individual
124
cores power consumption due to the fact that most current processors can only provide total
power measurement of cores. Vector py consists of power parameters (refer (9.5)), which
are to be determined and is given by
py , [PTd Pl0,1,1 gl,1,1 · · · Pl0,n,1 gl,n,1 · · ·Pl0,1,M gl,1,M · · · Pl0,n,M gl,n,M]T , (9.7)
where M is the total number of PWL segments. Py is matrix of input clusters defined by
Y,

YTd (1) Y
T
l,1(1) · · · YTl,M(1)
YTd (2) Y
T
l,1(2) · · · YTl,M(2)
...
... . . .
...
YTd (K) Y
T
l,1(K) · · · YTl,M(K)

, (9.8)
where Yd,c(k), uc(k)s3c(k), ∀c ∈ {1, . . . ,n} corresponds to the dynamic power coefficient
Pd,c, while Yl,m corresponds to the leakage power coefficients Pl0,c,m and gl,c,m, ∀c. It is
given by
Yl,m(k),

[1, Tc(k)], sc(k),Tc(k) ∈ mth PWL segment
[0, 0], otherwise.
(9.9)
Thermal Model
A single equation for temperature T can be obtained by substituting Tp(k) = T(k)−RP(k)
from (9.4) into (9.3). The resultant equation is
T(k+1) = aT(k)+RP(k+1)+b
n
∑
c=1
Pc(k)−aRP(k). (9.10)
The above equation has an autoregressive (AR) term T, and also a moving average (MA)
noise component e from the previous time interval. Hence the necessary model that de-
scribes the above equation is autoregressive moving average with exogenous input (AR-
MAX) model. P is the exogenous input in the above equation. Several approaches are
available for identifying parameters of ARMA models [86]. We use the iterative linear
least square (ILLS) method to identify the above ARMAX model.
125
Since the value of the noise component is not known apriori, we first have to deter-
mine approximate values of a, b and R by solving the following LLS problem (obtained
from reorganizing (9.10)), which minimizes the noise error e(k)+qe(k−1) (current noise
+ some combination of the noise from the previous time interval).

T(1) PD(2) PD(1)
T(2) PD(3) PD(2)
...
...
...
T(K−1) PD(K) PD(K−1)


a
R11
R12
...
Rnn
b−aR11
b−aR12
...
b−aRnn

=

T(2)
T(3)
...
T(K)

. (9.11)
PD is defined as follows:
PD(k),

PT (k) 01×n · · · 01×n
01×n PT (k) · · · 01×n
...
... . . .
...
01×n 01×n · · · PT (k)

n×n2
. (9.12)
Using the solution of (9.11), an estimate for error e is computed (e= R.H.S - L.H.S
of (9.11)). With the estimated e, the moving average component q is obtained by including
the error term in the above equation as shown in (9.13) (see the last column of the first ma-
trix on the left) and solving (9.13) iteratively until the difference in the value of a between
two consecutive iterations is less than 10−5. Note that the first element of the error column
126
contains the average error. This is because there is no prior knowledge of the error at time
k = 0. The above procedure provides accurate values of a, b and R.

T(1) PD(2) PD(1) 1K−1
K
∑
k=1
e(k)
T(2) PD(3) PD(2) e(1)
...
...
...
...
T(K−1) PD(K) PD(K−1) e(K−2)


a
R11
R12
...
Rnn
b−aR11
b−aR12
...
b−aRnn
q

=

T(2)
T(3)
...
T(K)

. (9.13)
9.4 Minimization of Power and Temperature Prediction Errors through Kalman Filter
The models parameters derived in the previous section are valid only for the measurements
that were used in the derivation. Moreover, the models proposed in Section 9.2 are not
exact, but a simplification of a much higher order model. As such, the predictions can
be inaccurate and this requires a feedback of the past measurements to correct the future
predictions. Also note that the measurements are almost always corrupted by noise.
Kalman filter [87] is one of the well known methods used for prediction of linear
systems corrupted by noise. Kalman filter uses the prediction error to update the state of
the system such that the accuracy of future predictions is improved. In this way, model
inaccuracies are also taken care. The Kalman filter in its original problem cannot be used
with the proposed thermal model, for the reason that the proposed thermal model is not
linear, due to the cyclical dependence of core power consumption on core temperature.
Hence, we use Extended Kalman filter technique [88] designed for estimation of non-linear
127
state-space systems. Below, we describe the Extended Kalman filter and the procedure for
state estimation and noise removal.
Consider the following example of a non-linear state space system:
x(k+1) = f (x(k), i(k))+w(k) (9.14)
z(k) = h(x(k), i(k))+v(k). (9.15)
x is the state variable, i and z are the input and the output variables, respectively. w and
v are the process and the observation noises, both of which are assumed to be zero mean
multivariate Gaussian noises with covariance Cw and Cv, respectively.
The predict and the update equations corresponding to the Extended Kalman filter
are given below:
Predict Equations
xˆ(k+1|k) = f (xˆ(k|k), i(k)) (9.16)
Ce(k+1|k) = F(k)Ce(k|k)FT (k)+Cw(k). (9.17)
Equations (9.16) and (9.17) correspond to computing the predicted state estimate and pre-
dicted estimate covariance, respectively. Estimate of a variable y is denoted by yˆ. Ce
is called the error covariance matrix. The state transition matrix F is given by F(k) =
∂ f
∂x
∣∣∣
xˆ(k|k),i(k)
.
Update Equations
e(k+1) = z(k+1)−h(xˆ(k+1|k), i(k+1)) (9.18)
Cr(k+1) =H(k+1)Ce(k+1|k)HT (k+1)+Cv(k+1) (9.19)
G(k+1) = Ce(k+1|k)HT (k+1)C−1r (k+1) (9.20)
xˆ(k+1|k+1) = xˆ(k+1|k)+G(k+1)e(k+1) (9.21)
Ce(k+1|k+1) = (I −G(k+1)H(k+1))Ce(k+1|k). (9.22)
128
Equations (9.18) and (9.19) compute residual measurement and residual covariance, re-
spectively. Equation (9.20) gives the Kalman gain G, which is used to correct the estimated
state and the covariance as shown in (9.21) and (9.22), respectively, thereby, minimizing
future predicted errors. In this way, the model errors are taken care of through adjusting
state variables. In is the identity matrix of size n. The observation matrix H is given by
H(k) = ∂h∂x
∣∣∣
xˆ(k|k),i(k)
.
Now, we will compute the above quantities specific to our controller problem. The
state and the output equations of our controller are:
Tp(k+1) = aTp(k)+b
n
∑
c=0
Pc(k) (9.23) T(k)
Ptot
=
 Tp(k)1n×1
0
+
 R
11×n
P(k) (9.24)
Pc(k) = Pd,cuc(k)s3c(k)+Pl0,c,m+gl,c,mTc(k),∀c,sc(k),Tc(k) ∈ mth PWL segment. (9.25)
1n×1 is a column vector of ones of length n. The corresponding state transition and obser-
vation matrices are given by
F(k) = a+b∑
c
∂Pc
∂Tp
, Hc(k) =
 1+R
∂Pc
∂Tp
∑
c
∂Pc
∂Tp
 . (9.26)
From (9.4) and (9.5),
∂Pc
∂Tp
= gl,c,m
∂Tc
∂Tp
=
gl,c,m
1−Rccgl,c,m . (9.27)
Therefore,
F(k) = a+b∑
c
gl,c,m
1−Rccgl,c,m (9.28)
Hc(k) =
 1+R
gl,c,m
1−Rccgl,c,m
∑
c
gl,c,m
1−Rccgl,c,m
 . (9.29)
129
9.5 DEM Controller
The role of a DEM controller is to determine the appropriate core speeds to achieve a given
objective using the power and the thermal models, while ensuring that all the specified
constraints are satisfied. The advantage of the proposed closed-loop controller design (see
Figure 9.1) is that the operation of the DEM controller is not tied with the design of the
feedback loop. Thus the DEM controller is not bound to any specific type of objective or
constraint functions. Also, there is no restriction on the DEM controller on the objective
type (linear/non-linear) or the complexity of the solution used to achieve the objective.
The only inputs that the DEM controller needs is the updated package temperature from by
the Kalman filter. The following example of maximizing overall throughput illustrates the
operation of the DEM controller.
max
s(k)
∑
1≤c≤n
sc(k) (9.30)
s.t. T(k+1) = Tp(k+1)+RP(k) (9.31)
T(k+1)≤ Tmax (9.32)
Pc(k) = Pd,cuc(k)s3c(k)+Pl0,c,m+gl,c,mTc(k),∀c (9.33)
0n×1 ≤ s(k)≤ 1n×1. (9.34)
Equation (9.30) denotes the maximizing performance objective. The computation
of core temperatures for next time interval is given in Equation (9.31). Note that in (9.31),
to compute the temperature in the next time interval, the power consumption of the cur-
rent time interval is used, as it is not possible to estimate the future core utilization. The
constraint on the maximum temperature is given (9.32). Equation (9.33) is same as (9.5).
The range of allowable core speeds is specified in (9.34). It can be easily shown that the
above formulation is convex w.r.t. core speeds s. Therefore, the solution to be the above
130
formulation can be obtained using any of the standard convex optimizers. The processor
core speeds are set according to the solution of the above optimization. The experimental
results for the above optimization can be found in Section 9.6.
9.6 Experimental Results
Experimental Setup
The proposed controller was implemented on a quad core Intel Sandy Bridge processor [89]
to measure the controller’s efficiency and accuracy. The processor was instrumented to
measure the power consumption of various supply rails that feed the processor. Since
DVFS controls the power consumption of only cores, the power consumption of three
rails that feed the cores were only measured. The power consumption was computed by
measuring the supply voltages and the current flowing through the power rails (very low
resistances were connected in series with power rails). The voltages and the currents were
sampled using TI-MSP430 microcontroller [90], while the temperatures of cores were ob-
tained by reading corresponding model-specific registers (MSR) [85]. The readings from
TI microcontroller were smoothed to remove noise. Sandy Bridge processor also has an
energy counter that estimates the number of Joules consumed over a period of time. Sec-
tion 9.6 compares the accuracy of power consumption values obtained from MSR vs. TI
microcontroller. Unless otherwise specified, all our reported power measurements are ob-
tained through MSRs.
The proposed controller was written in C++ and deployed on Ubuntu linux. ACPI [91]
APIs provided by Intel under Linux were used to modify processor speed. The Sandy
Bridge processor supports 15 different DVFS states, also called P-states. However, Sandy
Bridge processor requires all cores to operate at the same frequency and the same voltage.
The workload for the experiment consisted of benchmarks from MiBench [63]. Though
the temperature and power measurements could be obtained as fast as 1 ms, due to the lag
in setting P-states, our scheduling interval for DVFS optimization was set at 10 ms. For all
131
0 2 4 6 8 10 12 14 16
0
20
40
60
Time (minutes)
Po
w
er
(W
)
MSR TI-MSP430
(a) Mean
0 10 20 30 40 45
0
10
20
30
Power (W)
St
d.
de
v
(W
)
MSR TI-MSP430
(b) Variance
Figure 9.3: Comparison of MSR power readings with the measurements from TI micro-
controller
our experiments, the processor fan was disabled, as the fan was not user controllable, and
thereby affect our prediction accuracy.
The average computation time for computing the controller action was 21 µs. Thus
asserting the fact that the proposed controller is computationally very efficient to be in-
cluded as part of operating system scheduler.
In the next sections, we will perform noise analysis of sensors; compare the accu-
racy of energy counter values from MSRs with the actual power measurement from the
TI microcontroller; validate the accuracy of the proposed controller in tracking proces-
sor power consumption and temperature; demonstrate the improvements in throughput and
energy-efficiency with the proposed controller.
132
35 40 45 50 55 60 65 70
1
2
3
4
Temperature ( ◦C)
St
d.
de
v
(◦
C
)
Figure 9.4: Plot of standard deviation of noise from temperature sensors
Noise Analysis of Power and Thermal Sensors
In this section we analyze the noise statistics of power and the thermal sensors from both
MSRs and TI microcontroller to determine the best choice for measurements. Figure 9.3a
shows the mean power consumption obtained from 9 identical executions of MiBench
benchmarks, as measured from both processor MSR readings and also from TI micro-
controller. MSR readings show a higher mean power consumption as it accounts for power
consumption of all components of the processor and not just the cores as TI microcontroller
does. Figure 9.3b shows the corresponding standard deviation of sensor noise from MSR
and TI microcontroller. The MSR readings shows a very large variation in noise for higher
power consumption, however, since both MSR and TI have comparable sensor noise varia-
tion in lower ranges, we prefer to rely on MSR values as they provide power consumption
for entire processor. For the sake of completeness, standard deviation of temperature sen-
sors noise is shown in Figure 9.4. Note that the mean noise power and temperature sensor
noise is zero.
Comparison of Tracking Ability of the Proposed Controller with Existing Approaches
In this section, we compare the accuracy of prediction of our proposed closed-loop con-
troller with current state of the art controllers. We choose Wang et al. [48] controller to
compare against our proposed controller, as it is more recent and widely known in the liter-
ature. The controller in comparison uses model predictive controller (MPC) to decide the
133
0 10 20 30 40
0
200
400
600
Time (s)
Po
w
er
(W
)
Wang et. al. [4] Proposed
Figure 9.5: Plot showing the tracking of power consumption of the proposed model vs.
Wang et al. model
next frequency assignments of a processor. Before beginning to compare the performance
and the effectiveness of the controllers in comparison, it is necessary to validate the models
the controllers use. Wang et al. [48] uses a simple linear power model that relates power
consumption to the operational frequency. There are several limitations of such a model.
It does not account for leakage power consumption that depends on temperature, which is
significant for modern processors and does not consider workload activity when computing
power consumption. As such, the prediction of power consumption may not be accurate
and in some cases, can lead to instability as will be shown next.
Figure 9.5 shows the tracking ability of processor power consumption by Wang et
al. model. We have used the procedure outlined in [48] for making corrections to the power
model. In the plot, we observe that the power model fails to keep track of the temperature
between 3 s to 23 s. This can happen if the correction procedure does not account for
occasional large noise. Moreover, model cannot make use of the knowledge of the current
temperature and workload to minimize error. The authors had demonstrated the efficiency
of their models and controller through simulations and were not based on implementation
on real processors. Experiments with real processors can have large noise spikes that need
to be handled. Since Wang et al. models failed the stability test, we could not compare
their controller accuracy with our proposed controller.
134
0 50 100 150 180
35
40
50
60
Time (s)C
or
e
2
Te
m
pe
ra
tu
re
(◦
C
)
0 50 100 150 180
35
40
50
60
Time (s)C
or
e
3
Te
m
pe
ra
tu
re
(◦
C
)
0 50 100 150 180
35
40
50
60
Time (s)C
or
e
4
Te
m
pe
ra
tu
re
(◦
C
)
0 50 100 150 180
35
40
50
60
Time (s)C
or
e
1
Te
m
pe
ra
tu
re
(◦
C
)
Predicted Measured
0 50 100 150 180
0
5
10
15
Time (s)
Po
w
er
(W
)
Figure 9.6: Plot depicting the tracking ability of the proposed closed-loop controller
135
Figure 9.6 shows the plots of temperature and power consumption predicted by the
proposed controller and the corresponding actual measurements of execution of MiBench
workloads for 180 s. The mean error in tracking was 0.05 W and 0.59◦C for power con-
sumption and temperatures of cores, respectively, while the standard deviation was 3.05 W
and 2.37◦C, respectively. Comparing this with the standard deviation of noise from power
(2.45 W) and thermal (2.05◦C) sensors, we conclude that our proposed controller provides
excellent tracking of power consumption and temperature of cores.
Maximizing Performance Subject to Thermal Constraints
This section demonstrates the application of the proposed closed-loop controller for maxi-
mizing performance. In essence it is solving the formulation (9.30)–(9.34) for every sam-
pling time. The plots of processor speed, power consumption and core temperatures are
shown in Figure 9.7. The maximum temperature was set at 60◦C. The low temperature
ceiling was specifically chosen as it is very unlikely to reach temperatures over 70◦C from
execution of workloads from MiBench. So the maximum temperature limit was set lower
to demonstrate that the effectiveness of the proposed controller to control the processor
speed to maintain the cores temperatures below the maximum specified, while achieving
maximum possible performance.
From the plots, it can be observed that the maximum temperature is violated few
times according to the measured data. This is either due to (i) sensor noise, which can lead
to wrong temperature measurements and also hamper modeling accuracy; (ii) limitations
on the part of the proposed models to account for transients whose time constants are
shorter than the package thermal time constants. Even with these limitations, the maximum
observed thermal violation was at 3◦C, which is 1.4 times the standard deviation of thermal
sensor noise. Notice the way the processor speed is adjusted by the controller to match the
workload requirements for maximum performance, while lowering the processor speed
when necessary to ensure the thermal violations are avoided.
136
0 50 100 144
45
50
55
60
65
Time (s)
Te
m
pe
ra
tu
re
(◦
C
) Core 1 Core 2 Core 3 Core 4
0 50 100 144
0.8
1.2
1.6
2.1
Time (s)
Sp
ee
d
(G
H
z)
0 50 100 144
0
4
8
12
Time (s)
Po
w
er
(W
)
Figure 9.7: Plot of speed, core temperatures and power while maximizing total perfor-
mance under thermal constraints
Maximizing Energy efficiency
For this experiment, the energy-efficiency was measured using metric performance/Watt
(PPW). PPW is commonly used in servers to measure energy-efficiency. PPW is defined as
PPW (t) =
n
∑
c=1
sc(t)
n
∑
c=1
Pc(t)
, which is the ratio of total performance to total power. The optimiza-
tion problem solved for this experiment is same as the formulation (9.30)–(9.34), except
that the objective is now replaced by the PPW metric. A set of benchmarks from MiBench
were executed using the proposed controller to maximize PPW and compared with the ex-
137
0 98.4 124.9 191.1
45
55
65
75
Time (s)
Te
m
pe
ra
tu
re
(◦
C
)
0 97.9 124.9 191.1
0.8
1.5
2.1
Time (s)
Sp
ee
d
(G
H
z)
performance ondemand proposed powersave
0 97.9 124.9 191.1
0
10
20
30
40
Time (s)
Po
w
er
(W
)
0 98.4 124.9 191.1
0
1,000
2,000
Time (s)
PP
W
(M
IP
S/
W
at
t)
Figure 9.8: Comparison of speed, maximum temperature, power and PPW using various
policies on Intel Sandy Bridge processor
138
Table 9.1: Comparison of overall delay, energy and PPW of the proposed energy-efficient
DVFS policy with the Linux DVFS policies shown in Figure 9.8.
performance ondemand proposed powersave
Delay (s) 98.4 97.9 124.9 191.1
Energy (kJ) 21.9 21.1 14.5 13.7
MIPS/Watt 419.8 501.7 662.5 467.1
isting policies available on Linux viz, performance, powersave and ondemand [92]. As
the name implies, performance policy tries to achieve maximum performance, powersave
minimizes the overall energy consumption and ondemand reacts to the workload activity
to maximize performance, while minimizing energy consumption.
The plots of processor speed, the maximum of all core temperatures, total power
consumption and the corresponding PPW are shown in Figure 9.8. Notice that the set of
workloads executed on each core ends at different time under each policy. For example,
under powersave policy, the workload ends the latest at 191.1 s. This is because powersave
uses only the minimum possible processor speed. On the other hand, under performance
policy, the workload ends at 98.4 s, as it uses the maximum possible frequency. The results
for all policies are summarized in Table 9.1 and a pictorial representation is presented in
Figure 9.9. The size of the circles in the figure represents relative MIPS/Watt. The table
and the figure clearly shows that the proposed policy outperforms all of the existing policies
on Linux by at least 32%. The black dotted curve in the figure shows the energy-delay
pattern, and where a new policy might possibly fit in. Some of the suggestions for further
improving the energy-efficiency are having: (i) per-core DVFS; (ii) prior information from
manufacturers on thermal and power models; (iii) higher sampling rate of sensors with
reduced sensor noise.
139
60 80 100 120 140 160 180 200
0
1
2
3
Performance
Ondemand
Proposed Powersave
Completion time (s)
E
ne
rg
y
(J
)
Figure 9.9: Pictorial comparison of existing DEM policies on Linux with the proposed
policy. The dotted line shows the energy-delay pattern.
140
Chapter 10
Conclusions and Open Problems
Multicores have proliferated nearly in all forms of computing and the scope of DEM has
changed along with it. This dissertation investigated not only the typical DEM problems
like maximizing throughput under thermal conditions on multicore processors for various
controls of DVFS, task migration and fan control, but also investigated the new problems
on multicore processors like minimum makespan completion, minimizing peak tempera-
ture under constraints on start and end times of tasks, and design of a MIMO multicore
DEM controller. In many cases, the solutions for typical DEM problems for single-core
processors are no more valid for multicore processors, and need to resolved, and often the
solutions are complicated and have non-polynomial solutions. Task migration introduces a
new DEM control that did not exist with single-core processors. This dissertation addresses
the above problems and presents novel, optimal, yet also practical solutions.
This dissertation address only a small subset of large DEM problems. As such,
there are many DEM problems that either are not addressed in the literature or do not have
efficient solutions. Some of these challenging problems could benefit from this dissertation
and are outlined below:
1. [Network of heterogeneous components] The mobile market is getting bigger than
PCs and servers market. The demand for many features on a smartphone requires
integration of many heterogeneous components like GPS, wifi, graphics unit, on-die
memory along with multiple cores on a single die called a system-on-chip or in short
an SoC. These components are connected through an interconnect bus. The activity
rate of one unit is dependent on the operation of other units and as such, the compo-
nents cannot be optimized individually. For example, memory and CPU processing
rate are tightly dependent on the application that is executed. Increasing CPU clock
141
frequency need not necessarily increase performance for memory intensive applica-
tions as CPU issues memory requests too often, which the memory will not be able
to satisfy the requests at the rate CPU is requesting and may result in CPU stall.
Hence it is important to model the interaction of different components on an SoC
to allow determination of optimal configuration of operation of the heterogeneous
components.
2. [Data center] Like in an SoC, data center is composed of many different computing
resources, but unlike in an SoC, a data center has many variables of optimization.
For example, a data center has an inlet air that flows from the bottom of the aisles,
which is sucked by processor fans and the hot air is collected at the top of the data
center by vents. Since cooling is responsible for about to 50% of data center cost, by
controlling the inlet air temperature and velocity, the data center costs can be greatly
reduced. The other area of optimization are workload allocation on servers.
3. [Resource fairness and Quality of service (QoS)] Fair allocation of resources is im-
portant in all forms of computing. Fairness is usually dictated by user priority of an
application. Applications with same priorities are expected to share equal resources.
Usually fairness of resource allocation is not an issue, but under energy minimiza-
tion and QoS constraints, applications can sometimes be deprived of resources. For
example, different applications have different acceptable QoS constraints. For ex-
ample, video and audio have higher QoS than browsers and word processors. The
QoS constraints may not be hard constraints, then the problem is for minimization
of energy consumption with least violation of QoS constraints, while ensuring fair
allocation of resources according to user priority.
142
REFERENCES
[1] R. M. Ramanathan, Intel multicore processors: Making the move to quad-
core and beyond, White Paper, Intel Corp., 2006. [Online]. Available: http:
//www.intel.com/technology/architecture/downloads/quad-core-06.pdf
[2] R. Rao, “Fast and accurate techniques for early design space exploration and dy-
namic thermal management of multi-core processors,” Ph.D. dissertation, Arizona
State University, 2008.
[3] V. Hanumaiah, S. Vrudhula, and K. S. Chatha, “Maximizing Performance of Ther-
mally Constrained Multi-core Processors by Dynamic Voltage and Frequency Con-
trol,” in Proc. ICCAD, 2009, pp. 310–313.
[4] R. Rao and S. Vrudhula, “Fast and Accurate Prediction of the Steady State Through-
put of Multi-core Processors under Thermal Constraints,” IEEE Trans. Computer-
Aided Design, vol. 28, pp. 1559–1572, 2009.
[5] ——, “Performance Optimal Processor Throttling under Thermal Constraints,” in
Proc. CASES, 2007, pp. 257–266.
[6] S. Borkar, “Thousand Core Chips – A Technology Perspective,” in Proc. DAC, 2007,
pp. 746–749.
[7] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan,
“Temperature-aware Microarchitecture: Modeling and Implementation,” ACM Trans.
Arch. Code Opt., vol. 1, pp. 94–125, 2004.
[8] A. Chandrakasan, S. Sheng, and R. Brodersen, “Low-power cmos digital design,”
Solid-State Circuits, IEEE Journal of, vol. 27, no. 4, pp. 473–484, 1992.
[9] J. Donald and M. Martonosi, “Techniques for Multicore Thermal Management: Clas-
sification and New Exploration,” in Proc. ISCA, 2006, pp. 78–88.
[10] J. Li and J. F. Martı´nez, “Power-performance Considerations of Parallel Computing
on Chip Multiprocessors,” ACM Trans. Archit. Code Optim., vol. 2, pp. 397–422,
2005.
[11] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, and M. Martonosi, “An analysis of
efficient multi-core global power management policies: Maximizing performance for
a given power budget,” in Proc. Intl’ Symp. Microarch. (MICRO), 2006, pp. 347–358.
143
[12] M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos, “Online
strategies for high-performance power-aware thread execution on emerging multi-
processors,” in Proc. Workshop on High-Performance Power-aware Computing (HP-
PAC), 2006.
[13] D. Brooks and M. Martonosi, “Dynamic Thermal Management for High-performance
Microprocessors,” in Proc. HPCA, 2001, pp. 171–182.
[14] P. Chaparro, J. Gonza´lez, G. Magklis, Q. Cai, and A. Gonza´lez, “Understanding the
Thermal Implications of Multicore Architectures,” IEEE Trans. Parallel and Dis-
tributed Sys., vol. 18, pp. 1055–1065, 2007.
[15] R. Rao, S. Vrudhula, and C. Chakrabarti, “Throughput of Multi-core Processors under
Thermal Constraints,” in Proc. ISLPED, 2007, pp. 201–206.
[16] V. Hanumaiah, S. Vrudhula, and K. S. Chatha, “Performance Optimal Speed Control
of Multi-Core Processors under Thermal Constraints,” in Proc. DATE, 2009, pp. 288–
293.
[17] ——, “Performance Optimal Online DVFS and Task Migration Techniques for Ther-
mally Constrained Multi-core Processors,” IEEE Transactions on Computer-Aided
Design, vol. 30, no. 11, pp. 1677–1690, November 2011.
[18] S. Murali, A. Mutapcic, D. Atienza, R. Gupta, S. Boyd, and G. D. Micheli,
“Temperature-aware Processor Frequency Assignment for MPSoCs using Convex
Optimization,” in Proc. CODES, 2007, pp. 111–116.
[19] S. Zhang and K. S. Chatha, “Approximation Algorithm for the Temperature-Aware
Scheduling Problem,” in Proc. ICCAD, 2007, pp. 281–288.
[20] A. Cohen, F. Finkelstein, A. Mendelson, R. Ronen, and D. Rudoy, “On Estimating
Optimal Performance of CPU Dynamic Thermal Management,” IEEE Computer Ar-
chitecture Letters, vol. 2, pp. 6–6, 2003.
[21] V. Hanumaiah and S. Vrudhula, “Temperature-aware DVFS for Hard Real-time Ap-
plications on Multi-core Processors,” IEEE Transactions on Computers, vol. 61,
no. 10, pp. 1484–1494, October 2012.
[22] R. Jayaseelan and T. Mitra, “Temperature Aware Task Sequencing and Voltage Scal-
ing,” in Proc. ICCAD, 2008, pp. 618–623.
144
[23] T. Chantem, R. P. Dick, and X. S. Hu, “Temperature-aware Scheduling and Assign-
ment for Hard Real-time Applications on MPSoCs,” in Proc. DATE, 2008, pp. 288–
293.
[24] V. Hanumaiah and S. Vrudhula, “Reliability-aware thermal management for hard
real-time applications on multi-core processors,” in Design, Automation Test in Eu-
rope Conference Exhibition (DATE), 2011, 2011, pp. 1–6.
[25] V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P. N. Strenski, and P. G.
Emma, “Optimizing Pipelines for Power and Performance,” in Proc. MICRO, 2002,
pp. 333–344.
[26] V. Hanumaiah and S. Vrudhula, “Energy-efficient operation of multi-core proces-
sors by dvfs, task migration and active cooling,” Computers, IEEE Transactions on,
vol. PP, no. 99, pp. 1–1, 2012.
[27] “HP Research Smart Cooling,” http://www.hpl.hp.com/research/smart cooling/index.
html.
[28] M. Ghasemazar, E. Pakbaznia, and M. Pedram, “Minimizing the Power Consumption
of a Chip Multiprocessor under an Average Throughput Constraint,” in Proc. ISQED,
2010, pp. 362–371.
[29] D. Shin, S. W. Chung, E.-Y. Chung, and N. Chang, “Energy-Optimal Dynamic Ther-
mal Management: Computation and Cooling Power Co-Optimization,” IEEE Trans-
actions on Industrial Informatics, vol. 6, no. 3, pp. 340–351, August 2010.
[30] A. Khajeh, A. Gupta, N. Dutt, F. Kurdahi, A. Eltawil, K. Khouri, and M. Abadir,
“Tram: A tool for temperature and reliability aware memory design,” in Proc. DATE,
2009, pp. 340–345.
[31] T. Chantem, X. S. Hu, and R. P. Dick, “Online Work Maximization under a Peak
Temperature Constraint,” in Proc. ISLPED, 2009, pp. 105–110.
[32] T. Constantinou, Y. Sazeides, P. Michaud, D. Fetis, and A. Seznec, “Performance Im-
plications of Single Thread Migration on a chip Multi-core,” ACM SIGARCH Comp.
Arch. News, vol. 33, pp. 80–91, 2005.
[33] P. Michaud, A. Seznec, D. Fetis, Y. Sazeides, and T. Constantinou, “A Study of
Thread Migration in Temperature-constrained Multicores,” ACM Trans. Arch. Code
Opt., vol. 4, pp. 9–1–9–28, 2007.
145
[34] M. D. Powell, M. Gomaa, and T. N. Vijaykumar, “Heat-and-run: Leveraging SMT
and CMP to Manage Power Density through the Operating System,” SIGOPS Oper.
Syst. Rev., vol. 38, pp. 260–270, 2004.
[35] F. Mulas, M. Pittau, M. Buttu, S. Carta, A. Acquaviva, L. Benini, and D. Atienza,
“Thermal Balancing Policy for Streaming Computing on Multiprocessor Architec-
tures,” in Proc. DATE, 2008, pp. 734–739.
[36] A. Coskun, T. Rosing, K. Whisnant, and K. Gross, “Static and Dynamic Temperature-
Aware Scheduling for Multiprocessor SoCs,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 16, pp. 1127–1140, 2008.
[37] V. Hanumaiah, R. Rao, S. Vrudhula, and K. S. Chatha, “Throughput Optimal Task Al-
location under Thermal Constraints for Multi-core Processors,” in Proc. DAC, 2009,
pp. 1548–1551.
[38] W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, and S. Ghosh, “HotSpot:
A Compact Thermal Modeling Method for CMOS VLSI Systems,” IEEE Trans. VLSI
Sys., vol. 14, pp. 501–513, 2006.
[39] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for Architectural-
level Power Analysis and Optimizations,” in Proc. ISCA, 2000, pp. 83–94.
[40] “Ansys Icepak features: Cooling software for the electronics industry,” Ansys Inc.
[Online]. Available: http://www.ansys.com/products/icepak/features.asp?name=p1
[41] P.-Y. Huang and Y.-M. Lee, “Full-chip thermal analysis for the early design stage via
generalized integral transforms,” IEEE Transactions on VLSI Systems, vol. 17, no. 5,
pp. 613 –626, may 2009.
[42] S. Heo, K. Barr, and K. Asanovic, “Reducing Power Density through Activity Migra-
tion,” in Proc. ISLPED, 2003, pp. 217–222.
[43] R. Rao and S. Vrudhula, “Efficient Online Computation of Core Speeds to Maximize
the Throughput of Thermally Constrained Multi-core Processors,” in Proc. ICCAD,
2008, pp. 537–542.
[44] K. Skadron, T. Abdelzaher, and M. R. Stan, “Control-theoretic Techniques and
Thermal-RC Modeling for Accurate and Localized Dynamic Thermal Management,”
in Proc. HPCA, 2002, pp. 17–28.
146
[45] Y. Fu, N. Kottenstette, Y. Chen, C. Lu, X. Koutsoukos, and H. Wang, “Feedback
thermal control for real-time systems,” in Real-Time and Embedded Technology and
Applications Symposium (RTAS), 2010 16th IEEE, april 2010, pp. 111–120.
[46] A. Bartolini, M. Cacciari, A. Tilli, and L. Benini, “A distributed and self-calibrating
model-predictive controller for energy and thermal management of high-performance
multicores,” in Design, Automation Test in Europe Conference Exhibition (DATE),
2011, march 2011, pp. 1–6.
[47] H. Jung and M. Pedram, “Stochastic Dynamic Thermal Management: A Markovian
Decision-based Approach,” in Proc. ICCD, 2006, pp. 452–457.
[48] Y. Wang, K. Ma, and X. Wang, “Temperature-constrained Power Control for chip
Multiprocessors with Online Model Estimation,” SIGARCH Comput. Archit. News,
vol. 37, pp. 314–324, 2009.
[49] F. Zanini, D. Atienza, L. Benini, and G. De Micheli, “Multicore thermal management
with model predictive control,” in Circuit Theory and Design, 2009. ECCTD 2009.
European Conference on, aug. 2009, pp. 711–714.
[50] C. Isci, G. Contreras, and M. Martonosi, “Live, runtime phase monitoring and predic-
tion on real systems with application to dynamic power management,” in Proc. Intl’
Symp. Microarch. (MICRO), 2006.
[51] G. Dhiman and T. S. Rosing, “Dynamic voltage frequency scaling for
multi-tasking systems using online learning,” in Proceedings of the 2007
international symposium on Low power electronics and design, ser. ISLPED
’07. New York, NY, USA: ACM, 2007, pp. 207–212. [Online]. Available:
http://doi.acm.org/10.1145/1283780.1283825
[52] R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda, “Pack & cap: adaptive
dvfs and thread packing under power caps,” in Proceedings of the 44th
Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44
’11. New York, NY, USA: ACM, 2011, pp. 175–185. [Online]. Available:
http://doi.acm.org/10.1145/2155620.2155641
[53] W. Liao, L. He, and K. M. Lepak, “Temperature and Supply Voltage Aware Per-
formance and Power Modeling at Microarchitecture Level,” IEEE Trans. Computer-
Aided Design, vol. 24, pp. 1042–1053, 2005.
147
[54] W. Huang, K. Sankaranarayanan, R. J. Ribando, M. R. Stan, and K. Skadron, “An Im-
proved Block-based Thermal Model in Hotspot 4.0 with Granularity Considerations,”
in Proc. WDDD, 2007.
[55] NVIDIA Compute Unified Device Architecture: Programming Guide, June 2008.
[56] R. Rao, S. Vrudhula, and K. Berezowski, “Analytical Results for Design Space Ex-
ploration of Multi-core Processors Employing Thread Migration,” in Proc. ISLPED,
2008, pp. 229–232.
[57] R. J. Moffat, “Modeling air-cooled heat sinks as heat exchangers,” in Proc. Semi-
Therm, 2007, pp. 200–207.
[58] D. E. Kirk, Optimal Control Theory. Prentice-Hall, 1970.
[59] R. F. Hartl, S. P. Sethi, and R. G. Vickson, “A Survey of the Maximum Principles for
Optimal Control Problems with State Constraints,” SIAM Rev., vol. 37, pp. 181–218,
1995.
[60] P. M. DeRusso, R. J. Roy, C. M. Close, and A. A. Desrochers, State Variables for
Engineers. Wiley-Interscience, 1997.
[61] M. Monchiero, R. Canal, and A. Gonza´lez, “Power/performance/thermal Design
Space Exploration for Multicore Architectures,” IEEE Trans. Parallel and Distributed
Sys., 2008.
[62] V. Hanumaiah, R. Rao, and S. Vrudhula, “The MAGMA Thermal Simulator,” http:
//vrudhula.lab.asu.edu/magma, Arizona State University.
[63] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown,
“MiBench: A free, commercially representative embedded benchmark suite,” in Proc.
WWC, 2001, pp. 3–14.
[64] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and
Complexity. Dover Publications, 1998.
[65] J. Munkres, “Algorithms for the Assignment and Transportation Problems,” Journal
of the Society for Industrial and Applied Mathematics, vol. 5, pp. 32–38, 1957.
148
[66] 2nd Generation Intel R© CoreTM Processor Family Desktop and Intel R© Pentium R© Pro-
cessor Family Desktop, and LGA1155 Socket, Intel Corp., May 2011.
[67] R. Viswanath, V. Wakharkar, A. Watwe, and V. Lebonheur, “Thermal Performance
Challenges from Silicon to Systems,” Intel Technology Journal, vol. 4, pp. 1–16,
2000.
[68] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, “The Case for Lifetime
Reliability-Aware Microprocessors,” SIGARCH Comput. Archit. News, vol. 32, p.
276, 2004.
[69] E. Karl, D. Blaauw, D. Sylvester, and T. Mudge, “Reliability Modeling and Manage-
ment in Dynamic Microprocessor-based Systems,” in Proc. DAC, 2006, pp. 1057–
1060.
[70] K. Waldschmidt, J. Haase, A. Hofmann, M. Damm, and D. Hauser, “Reliability-
Aware Power Management Of Multi-Core Systems (MPSoCs),” in Proc. Dynamically
Reconfigurable Architectures, 2006.
[71] A. K. Coskun, R. Strong, D. M. Tullsen, and T. S. Rosing, “Evaluating the Impact of
Job Scheduling and Power Management on Processor Lifetime for Chip Multiproces-
sors,” in Proc. SIGMETRICS, 2009, pp. 169–180.
[72] J. Laudon, “Performance/Watt: The New Server Focus,” SIGARCH Comput. Archit.
News, vol. 33, pp. 5–13, 2005.
[73] S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press,
2004.
[74] G. Lowney, “Why Intel is designing multi-core processors,” in Proc. SPAA, 2006, pp.
113–113.
[75] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, “Parameter
variations and impact on circuits and microarchitecture,” in Proc. DAC, 2003, pp.
338–342.
[76] S. Kulkarni, D. Sylvester, and D. Blaauw, “A statistical framework for post-silicon
tuning through body bias clustering,” in Proc. ICCAD, 2006, pp. 39–46.
149
[77] K. Agarwal and S. Nassif, “The impact of random device variation on sram cell sta-
bility in sub-90-nm cmos technologies,” Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, vol. 16, no. 1, pp. 86–97, Jan. 2008.
[78] S. R. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas,
“Varius: A model of process variation and resulting timing errors for microarchitects,”
Semiconductor Manufacturing, IEEE Transactions on, vol. 21, no. 1, pp. 3–13, Feb.
2008.
[79] H. Mahmoodi, S. Mukhopadhyay, and K. Roy, “Estimation of delay variations due to
random-dopant fluctuations in nanoscale cmos circuits,” Solid-State Circuits, IEEE
Journal of, vol. 40, no. 9, pp. 1787–1796, sept. 2005.
[80] H.-S. Wong, Y. Taur, and D. Frank, “Discrete random dopant distribution effects in
nanometer-scale MOSFETs,” Microelectronics Reliability, vol. 38, no. 9, pp. 1447–
1456, 1998.
[81] A. Khajeh, K. Amiri, M. Khairy, A. Eltawil, and F. Kurdahi, “A unified hardware and
channel noise model for communication systems,” in Proc. GLOBECOM, 2010, pp.
1–5.
[82] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark suite: Charac-
terization and architectural implications,” in Proc. PACT, 2008, pp. 72–81.
[83] T. E. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploring the Level of Ab-
straction for Scalable and Accurate Parallel Multi-Core Simulations,” in Proc. High
Performance Computing, Networking, Storage and Analysis, Nov. 2011.
[84] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi,
“McPAT: An integrated power, area, and timing modeling framework for multicore
and manycore architectures,” in Proc. IEEE Micro, 2009, pp. 469–480.
[85] Intel 64 and IA-32 Architectures Software Developers Manual, Intel Corporation, Au-
gust 2012.
[86] “Autoregressivemoving-average model,” http://en.wikipedia.org/wiki/
Autoregressive%E2%80%93moving-average model.
[87] R. Kalman, “A new approach to linear filtering and prediction problems,” in Trans. of
the ASME, 1960.
150
[88] P. S. Maybeck, Stochastic Models, Estimation, and Control. Mathematics in Science
and Engineering. Academic Press, 1979.
[89] E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weissmann, “Power-
management architecture of the intel microarchitecture code-named sandy bridge,”
Micro, IEEE, vol. 32, no. 2, pp. 20–27, march-april 2012.
[90] “EZ430-RF2500T – MSP430 2.4-GHz Wireless Target Board,” https://estore.ti.com/
EZ430-RF2500T-MSP430-24-GHz-Wireless-Target-Board-P1295.aspx.
[91] “Advanced configuration and power interface specification,” http://www.acpi.info/
spec.htm.
[92] V. Pallipadi and A. Starikovskiy, “The ondemand governor - past, present, and future,”
in Proc. Linux Symposium, vol. 2, 2006, pp. 223–238.
151
APPENDIX A
PROOFS
152
A.1 Proof of Optimal Control Policy
Hamiltonian Setup
Consider the optimal control formulation in Section 3.3. First we construct the Hamilto-
nian [58] from the performance index (3.10), which can be written as
∫ t f
0 1 dt, and the state
equality constraints (3.11) and (3.13) as shown below:
H (x,T,s,v,λ x,λ T ,µ , t) = 1+∑
c
λx,c(t)IPCc(t)sc(t)
+λ ′T [AˆT(t)+B{diag(Pmaxd )Xdiag(v(t))2s(t)+diag(Pv)Xv(t)+Pl0}]. (A.1)
λ x and λ T are the co-state variables that act as Lagrange multipliers for (3.11) and (3.13)
respectively.
The states (s, v, T) and the co-states (λ x, λ T ) in the Hamiltonian (A.1) are required
to satisfy the following:
dx∗(t)
dt
=H ∗λ x(t),
dT∗(t)
dt
=H ∗λ T (t), (A.2)
dλ ∗x(t)
dt
=−H ∗x (t),
dλ ∗T (t)
dt
=−H ∗T (t). (A.3)
Solving for the co-state variables,
λ ∗x(t) = λ (a constant), (A.4)
dλ ∗T (t)
dt
=−Aˆ′λ ∗T (t). (A.5)
From state space methods [60], λ ∗T can be calculated as
λ ∗T (t) = e
−Aˆtλ ∗T (0). (A.6)
Let µ and ν be the vectors of constraint qualifiers [59] for (3.15) and (3.14) respec-
tively, such that
µc(t)gc(t) = 0, µc(t)≥ 0, ∀t,c. (A.7)
ν ′(t)h(t) = 0, ν (t)≥ 0, ∀t. (A.8)
where gc(t) = sc(t)− kv (vc(t)−vth)
1.2
vc(t)max
b
(Tc(t))1.19
and h(t) = (T(t)−Tmax).
Optimal Voltage-Speed Profile
Pontryagin minimum principle [58] states that the optimal controls s∗, v∗ need to satisfy
the following inequality:
H (x∗,T,s∗,v∗,λ ∗x ,λ
∗
T ,µ
∗, t)≤H (x∗,T∗,s,v,λ ∗x ,λ ∗T ,µ ∗, t). (A.9)
153
From the direct adjoining approach [59],
L ∗sc(t) =H
∗
sc (t)+µcgsc(t) = 0, ∀t,c, (A.10)
L ∗vc(t) =H
∗
vc (t)+µcgvc(t) = 0, ∀t,c. (A.11)
Replacing the Hamiltonian in (A.9), (A.10) and (A.11) with (A.1) and separating
the controls for every core results in,
min
sc(t),vc(t)
[
m
∑
j=1
(Pmaxd, j (t)v
2
c(t)
N
∑
i=1
Bi, jλT,i(t)+λx,cIPCc(t))
]
sc(t)+
m
∑
j=1
Pmaxv, j (t)vc(t), (A.12)
[
λx,cIPCc(t)+
m
∑
j=1
(
Pmaxd, j (t)v
2
c(t)
N
∑
i=1
Bi, jλT,i(t)
)]
=−µc ≤ 0, ∀t,c, and (A.13)
m
∑
j=1
(
Pmaxv, j (t)vc(t)+P
max
d, j (t)v
2
c(t)
N
∑
i=1
Bi, jλT,i(t)
)
=
kv
v2c(t)max
b
(Tc(t))1.19
(vc(t)− vt)0.2(0.2vc(t)+ vt), ∀t,c. (A.14)
respectively. Note that all the terms on the L.H.S and the R.H.S in (A.14) except λT,i are
non-negative by definition. This forces λT,i ≥ 0. On a similar argument, on examining
(A.13), λx,c ≤ 0.
Hence to minimize (A.12), sc(t) = 1 (see (A.13)). Since λT,i ≥ 0 and λx,c ≤ 0,
vc(t) = 0 to minimize (A.12). However, vc(t) is constrained below by (3.15). This trans-
forms the voltage-speed inequality (3.15) to a equality (3.18). Similarly, sc(t) has an upper
bound constraint (3.14). Thus the resulting optimal sc(t) is given by (3.17).
A.2 Proof of Convexity of Voltage-speed Scaling
In order for the formulation in Section 3.4 to be a convex optimization problem, the ob-
jective and all the inequality constraints need to satisfy either the first or the second order
necessary and sufficient convexity conditions [73]. Here we have used the second order
necessary and sufficient conditions to prove the convexity of the problem.
The second order necessary and sufficient conditions state that a function f , differ-
entiable in the domain (dom) of f is convex iff the domain of f is convex and its Hessian
or second derivative is positive semi-definite, i.e.
∇2 f (x) 0, ∀x ∈ dom f . (A.15)
The above second-order conditions can be easily verified for (3.24), (3.26) and
(3.28). The proof of convexity for rest of the constraints viz. (3.25) and (3.27) are shown
below.
154
Since T is linearly dependent on Pˆ (as T(k− 1) is a constant in kth interval) ac-
cording to (3.29), proving the convexity of Pˆ is sufficient to prove the convexity of T.
Note that the time indicator k or kts is dropped in the subsequent derivations as the convex
formulation (3.24) – (3.28) remains the same for every scheduling interval.
Pˆ is defined in (3.20). The first and second order derivatives of Pˆ w.r.t. s are given
by
∇Pˆ= [2diag(Pmaxd )diag(v)+diag(Pv)]X∇v, (A.16)
∇2Pˆ(t) = 2diag(Pmaxd )X[diag(∇v)+∇
2v]+diag(Pv)X∇2v. (A.17)
Note that vc depends only on sc from (3.18). Hence finding the derivative of vc w.r.t
sc,
dvc
dsc
=
v2c max
b
(Tc)1.19
kv
1
(vc− vt)0.2(0.2vc+ vt) . (A.18)
Similarly, the second derivative is computed as
d2vc
ds2c
= 2
(dvcdsc )
2
vc
[
0.2v2c +0.8vtvc− v2t
(0.2vc+ vt)(vc− vt)
]
. (A.19)
For vc ≥ vt , it can be easily shown that d2vcds2c ≥ 0. Hence voltage vc is convex w.r.t
speed sc and thereby Pˆ and T are also convex w.r.t s from (A.17).
Since all the constraints in the formulation are proved to be convex, the formulation
turns out to be a convex optimization problem.
A.3 Proof of Quasiconcavity of PPW
Let f (s) =
n
∑
c=0
wcsc(t)
PT (s,v,T)1N×1
= SP(say), be the PPW objective at some scheduling interval. f is
said to be quasiconcave iff its domain (dom) and all its superlevel sets
Sα = {s ∈ dom f | f (s) ≥ α} are convex [73]. Consider an ordinate α , which intersects
the function f at two points (s1, f1) and (s2, f2). We need to prove that the function between
these points is greater than or equal to α , i.e.
f (λ s1+(1−λ )s2)≥ α (A.20)
where λ ∈ [0,1]. α can also be written as λ S1P1 +(1−λ )
S2
P2
, where Si =∑
c
wi,csi,c is the sum
of core speeds si and Pi = PTi (s,v,T)1N×1. With this (A.20) can be written as
λS1+(1−λ )S2
P(λ s1+(1−λ )s2) ≥ λ
S1
P1
+(1−λ )S2
P2
. (A.21)
Let Pλ = P(λ s1+(1−λ )s2), then the above equation is simplified to
λ
S1
P1
(Pλ −P1)≤ (1−λ )
S2
P2
(P2−Pλ ). (A.22)
155
Since S1P1 =
S2
P2
= α , the above equation reduces to
Pλ ≤ λP1+(1−λ )P2. (A.23)
The above equation is a sufficient condition on convexity of P. In Appendix A.2 we proved
that Pˆ is a convex function of s. P is the sum of elements in Pˆ+GTT. Since Pˆ and T are
convex and GT is constant, P is also a convex function of s. Hence (A.23) is true and f is
a quasiconcave function over the core speeds.
A.4 Proof of Quasiconvexity of Temperature w.r.t. Fan Speed
Let
g(s f an) =
h2s
h3
f an+h4
h1s f an
. (A.24)
Then Rconv(s f an) = (h1s f an(1− e−g(s f an)))−1 from (2.24). Differentiating Rconv(s f an) w.r.t.
s f an,
dRconv(s f an)
ds f an
=− 1
Rconv(s f an)
 1
s f an
+
dg(s f an)
ds f an
eg(s f an)−1
 . (A.25)
Differentiating again w.r.t. s f an,
d2Rconv(s f an)
ds2f an
=
1
Rconv(s f an)
[(
dRconv(s f an)
ds f an
)2
(A.26)
+
1
s2f an
+
eg(s f an)
(
dg(s f an)
ds f an
)2
(eg(s f an)−1)2 −
d2g(s f an)
ds2f an
eg(s f an)−1
 . (A.27)
dg(s f an)
ds f an
and d
2g(s f an)
ds2f an
are given by
dg(s f an)
ds f an
=
h1h2(h3−1)sh3f an−h4
(h1s f an)2
(A.28)
d2g(s f an)
ds2f an
=
h21h2h3(h3−1)(h3−2)sh3f an+h4
(h1s f an)3
. (A.29)
Note that g(s f an) = hAe from (2.23), where h is the heat transfer coefficient and Ae
is the effective area of the heat sink. We know that as s f an is increased, h reduces. This
can happen only if h3 ≤ 1. This can also be verified by substituting relation for h (see
Section 2.4. This ensures d
2g(s f an)
ds2f an
> 0. Hence d
2Rconv(s f an)
ds2f an
> 0 as eg(s f an) < 1 due to very
high value of h1 (h1 = mcp). Thus Rconv is convex w.r.t. s f an.
Note that dRconv(s f an)ds f an < 0 as
dg(s f an)
ds f an
< 0. Thus Rconv monotonically decreases with
s f an. This also implies R−1conv monotonically increases with s f an and hence quasiconvex in
156
s f an. Thus Gpkg and A in (2.25) and (2.26), respectively, are also quasiconvex in s f an and
thereby, E and R in (3.21) and (3.21), respectively, are also quasiconvex in s f an (inverse
and exponentials preserve quasiconvexity for monotonic functions). Therefore T (7.14) is
a quasiconvex function of s f an.
A.5 Proof of Quasiconvexity of BER w.r.t. vm
A function f is said to be quasiconcave, iff its domain (dom) and all its superlevel sets
Sα = {x ∈ dom f | f (x) ≥ α} are convex [73]. Consider an ordinate α , which intersects
the function f at two points (x1, f1) and (x2, f2). We need to prove that the function between
these points is greater than or equal to α , i.e.
f (λx1+(1−λ )x2)≥ α (A.30)
where λ ∈ [0,1]. In other words, the function has a single minimum value and increases
monotonically from the minimum point.
Q is defined in (8.5) and it is the tail probability of a standard normal function.
Hence as ρi gets larger, the tail probability Q(ρi) has to decrease monotonically. Hence it
is a quasiconvex function [73] w.r.t ρi. Next, we will show that ρi is a convex function of
vmint,i .
Since ρi depends linearly on vmint,i (see (8.6)), it is sufficient to prove convexity of
vmint,i . Differentiating (8.22) twice w.r.t v
m
i ,
dvmint,i
dvmi
=−s
min
i (k)max(Ti(k))
1.19
1.2kmi
(vmi (k)− vmint,i )−0.2. (A.31)
Differentiating again,
dvmint,i
dvmi
=−dv
min
t,i
dvmi
smini (k)max(Ti(k))
1.19
6kmi
(vmi (k)− vmint,i )−1.2. (A.32)
Since
dvmint,i
dvmi
≤ 0, as vmi (k) ≥ vmint,i ,
dvmint,i
dvmi
≥ 0. Hence vmint,i is convex w.r.t vmi . Hence Q is
quasiconvex function of vmi .
157
