








(B.E., Computer Science Engineering,
College of Engineering Guindy, Anna University)
A THESIS SUBMITTED
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF COMPUTER SCIENCE






List of Publications xii
List of Figures xiv
List of Tables xvii
1 Introduction 1
1.1 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ii
2 Related Work 9
2.0.1 Heat Production & Removal in a Computing System . . . . 9
2.0.2 Techniques to Reduce On-Chip Temperature . . . . . . . . . 11
2.1 Micro-architectural and System Level Techniques . . . . . . . . . . 13
2.1.1 Comparison with Power Reduction Techniques . . . . . . . . 13
2.1.2 Taxonomy of Micro-Architectural and System Level Thermal
Management . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Static Techniques . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 Runtime Techniques . . . . . . . . . . . . . . . . . . . . . . 17
3 Workload Characterization 21
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Tool Chain for Workload Characterization . . . . . . . . . . 23
3.2 Application Thermal Behavior . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Thermal Behavior of Individual Applications . . . . . . . . . 26
3.2.2 Impact of Processor Configuration on Thermal Profile . . . . 31
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
iii
4 Dynamic Thermal Management via Architecture Adaptation 39
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Architecture Level Thermal Management . . . . . . . . . . . 41
4.1.2 Software Based Thermal Management . . . . . . . . . . . . 42
4.1.3 Architecture Adaptivity . . . . . . . . . . . . . . . . . . . . 43
4.2 Overview of Thermal Management Framework . . . . . . . . . . . . 43
4.3 Neural Network Classifier . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Classifier Architecture . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Training the Classifier . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 Accuracy of the Classifier . . . . . . . . . . . . . . . . . . . 50
4.4 Performance Prediction Model . . . . . . . . . . . . . . . . . . . . . 51
4.5 Configuration Search Strategy . . . . . . . . . . . . . . . . . . . . . 57
4.6 Experimental Methodology and Results . . . . . . . . . . . . . . . . 62
4.6.1 Processor Model and Workloads . . . . . . . . . . . . . . . . 62
4.6.2 Dynamic Thermal Managements Schemes . . . . . . . . . . 63
4.6.3 Performance Comparison . . . . . . . . . . . . . . . . . . . . 63
4.6.4 Temperature Profiles and Throughput . . . . . . . . . . . . 64
4.6.5 Configuration Points for Adaptive DTM . . . . . . . . . . . 68
iv
4.6.6 Impact of Inaccuracy in Classifier . . . . . . . . . . . . . . . 69
4.6.7 Impact of Individual Configuration Parameters . . . . . . . 70
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Adaptive Thermal Management of Muti-Core Systems 72
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 Multi-core Thermal Management . . . . . . . . . . . . . . . 78
5.1.2 Power Management in Multi-Core Systems . . . . . . . . . . 79
5.2 Hybrid Thermal Management for Multi-Cores . . . . . . . . . . . . 80
5.2.1 Hybrid Thermal Management Architecture . . . . . . . . . . 81
5.3 Problem Formulation and Overview . . . . . . . . . . . . . . . . . . 82
5.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Thermal Management Framework . . . . . . . . . . . . . . . 83
5.4 Local Configuration Search . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.2 Neural Network Classifier . . . . . . . . . . . . . . . . . . . 87
5.4.3 Configuration Search Algorithm . . . . . . . . . . . . . . . . 91
5.4.4 Overhead of the Algorithm . . . . . . . . . . . . . . . . . . . 94
v5.5 Global Configuration Routine . . . . . . . . . . . . . . . . . . . . . 94
5.5.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.5.2 Operating Frequency . . . . . . . . . . . . . . . . . . . . . . 95
5.5.3 Core Coupling Factor . . . . . . . . . . . . . . . . . . . . . . 96
5.5.4 Final Configurations . . . . . . . . . . . . . . . . . . . . . . 96
5.5.5 Overheads and Scalability . . . . . . . . . . . . . . . . . . . 97
5.6 Experimental Settings and Results . . . . . . . . . . . . . . . . . . 97
5.6.1 Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.6.3 DTM Techniques . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6.4 Throughput of Different DTM schemes . . . . . . . . . . . . 101
5.6.5 Weighted Performance . . . . . . . . . . . . . . . . . . . . . 104
5.6.6 Configurations Selected . . . . . . . . . . . . . . . . . . . . . 105
5.6.7 Impact of Backup Technique . . . . . . . . . . . . . . . . . . 107
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
vi
6 Task Sequencing for Thermal Management 108
6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3 Task Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3.1 Thermal Profile of a Task Sequence . . . . . . . . . . . . . 115
6.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 118
6.3.3 Task Sequencing Algorithm . . . . . . . . . . . . . . . . . . 119
6.4 Sequencing & Voltage Scaling . . . . . . . . . . . . . . . . . . . . . 122
6.4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 122
6.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.5 Optimal Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 129
6.6.1 Task Sequencing Algorithm . . . . . . . . . . . . . . . . . . 130
6.6.2 Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.6.3 Sensitivity to Thermal Resistance . . . . . . . . . . . . . . . 133
6.6.4 Sensitivity to Slack Amount . . . . . . . . . . . . . . . . . . 134
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
vii
7 Temperature Aware Dynamic Scheduling 136
7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.1.1 General Purpose Scheduler Driven Thermal Management . . 138
7.1.2 Thermal Management Approaches for Hard Real Time Systems139
7.1.3 Thermal Management for Media Applications . . . . . . . . 140
7.2 Temperature Aware Scheduling Framework and Thermal Model . . 141
7.2.1 Thermal Model . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3 Temperature Aware Scheduling . . . . . . . . . . . . . . . . . . . . 143
7.3.1 Thermal Adjustment Phase . . . . . . . . . . . . . . . . . . 145
7.3.2 Best Effort Scheduler . . . . . . . . . . . . . . . . . . . . . . 146
7.3.3 CPU Share between a Hot and Cold Task . . . . . . . . . . 147
7.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 149
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8 Conclusion 154
8.1 Summary of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Abstract
Rising power density and on-chip temperature are seen as one of the major hur-
dles in sustaining processor performance improvement trends. Managing on-chip
temperature has become an important aspect at all levels of computer system de-
sign. In this thesis, we focus on micro-architecture and system level techniques to
manage temperature. Previously proposed approaches for thermal management
have revolved around developing efficient heuristics and control policies which at-
tempt to maximize the performance of the system while maintaining temperature
constraints. In contrast, we take a workload and processor configuration centric
approach to temperature management. We first characterize the thermal behavior
of a processor under variations in workload as well as variations in the hardware
configuration. Our characterization shows that the thermal behavior of the proces-
sor is highly sensitive to workload properties and hardware configuration. Armed
with this characterization, we propose thermal management approaches that (i)
alter the workload or (ii) alter the processor configuration to manage temperature.
In the first part of the thesis we present techniques that manage temperature
by adapting the configuration of the processor at runtime. We model the thermal
management problem as a hardware configuration search problem. Our framework
samples the performance counters to determine the characteristics of the workload
executing on the system and uses an online search algorithm to determine the
most appropriate thermally safe configuration for that workload. This framework
ix
is simple to implement and provides better performance (8.1% better on an aver-
age) than the best known existing dynamic thermal management techniques. We
extend the framework to multi-core systems and our framework provides better
performance (11.6% on an average) than more complicated previously proposed
thermal management approaches for multi-cores.
In the second part of the thesis, we focus on techniques that alter the workload
executing on the processor to manage temperature. In a multi-tasking system, the
workload executing on the processor is determined by the scheduler, which allocates
the CPU to the different tasks in the system. We observe that the temperature
profile critically depends a great deal on (i) the order in which the different tasks in
the system are executed, and (ii) the relative shares of CPU time given to the dif-
ferent tasks. We propose two scheduling driven thermal management approaches.
The first approach reorders the tasks in the system to provide an optimal thermal
profile. The second approach adjusts the relative shares of processor time provided
to the different tasks to manage temperature.
Acknowledgements
First and foremost, I would like to thank my thesis advisor Dr. Tulika Mitra for
her encouragement and guidance. I have learnt a lot from her during the course
of my PhD. Despite her busy schedules, she has always made the time to listen to
us. Her passion for research, commitment and professional attitude have been very
inspiring. It is an ideal example for me to emulate through out my professional
career.
I would also like to extend my gratitude Dr Weng Fai Wong and Dr Teo Yong Meng
for their valuable suggestions and feedback as part of my dissertation committee.
I would also like to thank Dr Samarjit Chakraborty for his feedback. I would also
like to thank my undergraduate advisor Dr Ranjani Parthasarathy for introducing
me to computer systems.
During the course of my PhD I have had the opportunity to attend two internships.
Both of these have been great learning experiences. I would like to thank Sriram
from Google; Dr Anasua Bhowmik and Swamy Punyamurtula from AMD for these
opportunities. I would also like to thank my manager Dr Anasua Bhowmik for
giving me time off from work to present the thesis.
I would like to thank National University of Singapore for supporting me with
various scholarships and fellowships. I would also like to thank the school of
computing technical help desk and administrative staff for their support.
xi
The embedded systems lab provided me with an ideal environment and eco-system
to pursue my research. I have had wonderful and really helpful friends in the lab.
Unmesh, Priya, Pan Yu, Hyuhn, Kathy, Linh , Nga, Swaroop, Eric, Achudhan,
Deepak, Balaji, Ankit, Zeghiou, Senthil and others : thanks for putting up with
me and helping me out. Despite being far away from home I have never missed
home thanks to my wonderful flat mates Eswar and Sivapriya for being so nice and
friendly. I have also made some really great friends during my stay at Singapore.
I would like to thank them for making my stay memorable and enjoyable.
Finally, I would like to acknowledge my family for being really supportive and
encouraging. I have been blessed with wonderful parents and a brother whose
confidence in me always keeps me going. My uncle, grand father, grand mother
and the rest of the extended family have played a big role in my development and
education. It has always been their dream to see me finish higher education and
it is with their inspiration that I began this journey. Thanks to them for always
being there for me.
List of Publications
Some of the materials presented in this thesis have been published in conferences
and journals. The list is shown below
• Chapter 4: R.Jayaseelan and T.Mitra. Dynamic Thermal Management via
Architectural Adaptation. Design Automation Conference (DAC) 2009, July
2009.
• Chapter 5: R.Jayaseelan and T.Mitra. A Hybrid Local-Global Approach
for Multi-Core Thermal Management. International Conference on Computer-
Aided Design (ICCAD) 2009, Nov 2009.
• Chapter 6: R.Jayaseelan and T.Mitra. Temperature aware task sequencing
and voltage scaling. International Conference on Computer-Aided Design
(ICCAD) 2008, Nov 2008.
• Chapter 7: R.Jayaseelan and T.Mitra. Temperature Aware Scheduling for
Embedded Processors. International Conference on VLSI Design, January
2009.
• Chapter 7: R.Jayaseelan and T.Mitra. Temperature Aware Scheduling for
Embedded Processors Invited: Special Issue on VLSI Design 2009. Journal
of Low Power Electronics, American Scientific Publisher, 5(3), October 2009.
xiii
Other Publications
• R.Jayaseelan, H.Liu and T.Mitra. Exploiting Forwarding to Improve Data
Bandwidth of Instruction-Set Extensions. Design Automation Conference
(DAC) 2006, July 2006.
• R.Jayaseelan, T.Mitra and X.Li. Estimating the Worst-Case Energy Con-
sumption of Embedded Software. Real-Time and Embedded Technology and
Applications Symposium (RTAS) 2006, April 2006.
List of Figures
2.1 Overview of previous approaches for thermal management . . . . . 15
3.1 Temperature effects of application/hardware interaction . . . . . . . 22
3.2 Tool-chain for workload characterization . . . . . . . . . . . . . . . 23
3.3 Temperature profiles for individual programs with initial tempera-
ture 40oC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Temperature profiles for individual programs with initial tempera-
ture 70oC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Temperature curves for two different task sequences of the same task
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Temperature curves with different shares of execution time to hot
and cold task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Performance/temperature impact of different configuration param-
eters for crafty benchmark . . . . . . . . . . . . . . . . . . . . . . 34
3.8 Performance/temperature impact of applying multiple configuration
parameters simultaneously for crafty benchmark . . . . . . . . . . 36
4.1 Adaptive Architecture: The dotted components are adaptive. . . . . 43
4.2 Components of the Adaptive DTM Framework. . . . . . . . . . . . 44
4.3 Neural network classifier architecture. . . . . . . . . . . . . . . . . . 47
4.4 Accuracy of the neural network classifier. . . . . . . . . . . . . . . . 51
4.5 Accuracy of the Performance Prediction Model . . . . . . . . . . . . 57
xv
4.6 Reduction of the configuration search space. . . . . . . . . . . . . . 59
4.7 Pruning of the configuration search space. . . . . . . . . . . . . . . 59
4.8 Performance comparison of different DTM schemes. . . . . . . . . . 64
4.9 Temperature profile for crafty . . . . . . . . . . . . . . . . . . . . 64
4.10 Temperature profile for gcc . . . . . . . . . . . . . . . . . . . . . . 65
4.11 Performance profile for crafty . . . . . . . . . . . . . . . . . . . . . 65
4.12 Performance profile for gcc . . . . . . . . . . . . . . . . . . . . . . . 65
4.13 Frequency profile for gcc . . . . . . . . . . . . . . . . . . . . . . . . 66
4.14 Frequency profile for crafty . . . . . . . . . . . . . . . . . . . . . . 66
4.15 IPC profile for gcc . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.16 IPC profile for crafty . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.17 Impact of inaccuracy of the neural network classifier on performance. 69
4.18 Impact of Different Parameters on Performance . . . . . . . . . . . 70
5.1 Temperature profiles for a workload on multi-core (core 0: wupwise,
core 1: gcc, core 2: art, core 3: crafty). Thread to core mapping is
not applicable for migration. . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Temperature profiles with adaptive DTM for wupwise, gcc, art and
crafty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Hybrid thermal management architecture. The dotted structures
are adaptive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Overview of our thermal management framework . . . . . . . . . . 83
5.5 Overview of local config search . . . . . . . . . . . . . . . . . . . . . 86
5.6 Neural network classifier . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7 Accuracy of neural network classifier . . . . . . . . . . . . . . . . . 90
5.8 Overview of multi-core simulation . . . . . . . . . . . . . . . . . . . 98
xvi
5.9 Throughput of different DTM schemes for heterogenous workloads . 102
5.10 Throughput of different DTM schemes for homogenous workloads . 104
5.11 Weighted performance for DTM schemes . . . . . . . . . . . . . . . 104
6.1 Peak temperature for all possible task sequences. . . . . . . . . . . 109
6.2 Thermal profiles of voltage scaling and combined approach . . . . . 110
6.3 Thermal profile of a repeating sequence of tasks. . . . . . . . . . . . 115
6.4 Task sequencing algorithm. . . . . . . . . . . . . . . . . . . . . . . . 120
6.5 Accuracy of task sequencing Algorithm. . . . . . . . . . . . . . . . . 131
6.6 Advantage of combined sequencing and voltage scaling (seq+vs)
over voltage scaling alone. . . . . . . . . . . . . . . . . . . . . . . . 132
6.7 Impact of task sequencing on the choice of thermal resistance. . . . 133
6.8 Impact of slack amount on voltage scaling. . . . . . . . . . . . . . . 134
7.1 Temperature aware scheduling framework . . . . . . . . . . . . . . . 141
7.2 Temperature aware scheduling Policy . . . . . . . . . . . . . . . . . 144
7.3 CPU share between hot and cold tasks . . . . . . . . . . . . . . . . 147
7.4 Temperature profile for TAS . . . . . . . . . . . . . . . . . . . . . . 151
List of Tables
3.1 Benchmark Characteristics . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Parameters of the baseline processor. . . . . . . . . . . . . . . . . . 32
4.1 Frequently selected configuration points by adaptive DTM. . . . . . 68
5.1 Workloads used for evaluation. . . . . . . . . . . . . . . . . . . . . . 99
6.1 Representative task sets. . . . . . . . . . . . . . . . . . . . . . . . . 130
7.1 Composition of task sets . . . . . . . . . . . . . . . . . . . . . . . . 150
7.2 Throughput and fairness of thermal-aware scheduler (TAS) with
smin = 0, smin = 0.2 and DTM Schemes . . . . . . . . . . . . . . . . 150
Chapter 1
Introduction
The micro-processor industry is driven by Moore’s law, which states that the num-
ber of transistors on chip doubles once every eighteen months. This is achieved
through scaling down of the size of the transistors, thereby accommodating more
transistors within the same area [32]. With every generation of scaling, transis-
tors become smaller, dissipate less power, and switch at a faster rate. Thus when
a micro-processor design at a given technology is moved directly to a new tech-
nology, we get a faster (higher clock rate) chip dissipating nearly the same power.
However, when a new micro-processor is released, additional functionality is added
by making use of the available transistors. The additional functionality can be in
the form of bigger and better features (for example larger caches, more complex
pipelines and others) or additional cores. For example, the Intel Pentium 4 proces-
sor designed at 90nm technology uses approximately 74 million transistors, while
the Core 2 Duo processor designed at 65 nm technology uses approximately 191
million transistors. The additional functionality improves the performance of the
system but comes at the cost of more complex circuits resulting in increased power
consumption. Moreover, as larger number of transistors are packed into the same
area, power density increases. Power density has been rising exponentially with
2transistor scaling and is fast approaching the power densities seen in nuclear reac-
tors [77]. Control of rising power density is seen as one of the main challenges in
sustaining Moore’s law [77, 80].
Power dissipation occurs in the form of heat and hence increased power dissipation
results in rising on-chip temperatures [19]. On-chip temperatures exceeding certain
safety limits [77] can cause permanent physical damage to a chip. However, the
typical operating conditions of the chip is kept well below the physical safety limit
[38] because high on-chip temperatures can affect normal chip operations in the
following ways :
• Reliability: Failure mechanisms such as electro-migration are accelerated
with increasing operating temperature. Studies have shown that the mean
time to failure (MTTF) decreases exponentially with increase in operating
temperature [60, 64].
• Timing Violations: The timing of a circuit is highly sensitive to temper-
ature as transistors switch slowly at higher temperature [89]. Hence the
operating frequency of a circuit must include margins for different on-chip
temperatures.
• Leakage Power and Thermal Runaway: Leakage power increases expo-
nentially with increase in temperature [58, 93]. There is a positive feedback
between temperature and leakage power. Increase in leakage power can in-
crease temperature, which in turn increases leakage. If this vicious cycle is
not controlled properly, then the rise in temperature can become unbounded
resulting in a thermal runaway.
From the preceding discussion, it is clear that thermal limits are among the
most important constraints affecting the performance of modern microprocessors.
3Hence, there is a need to control temperature at multiple levels of system design
and operation.
Heat removal and management have been an integral part of computer systems
design. Many commercial systems (starting from 80486) in this decade have used
cooling assemblies such as heat sinks to keep the operating temperature under
control. In early processor generations, power dissipation and power density issues
were not very severe and, in general, heat removal from the package (using fans
and sinks) was sufficient for keeping temperature under control. However, power
density has been increasing in an exponential fashion [77] and recently power
density and thermal issues have become prominent in micro-processor design.
Advanced packaging and heat removal techniques alone cannot manage all temper-
ature related issues in modern processors. Moreover, the shrinking size of computer
systems (laptops, multiple processors together on a server rack, etc.) has placed
further stress on the effectiveness of heat removal. The ability of a package to
remove heat is expressed in terms of Thermal design power (TDP). TDP refers
to the average power dissipation that the package can handle while keeping the
temperature under acceptable limits. High-performance processors require higher
TDP (and so more expensive) packages.
In addition to effective and efficient heat removal, reduction in heat dissipation is
also required. Effective heat reduction and thermal management techniques are
of critical importance and serve to bridge the gap between the high power den-
sity associated with high performance requirements and the limited heat removal
capacity of cost-effective packaging. Apart from just supplementing heat removal,
thermal management techniques are essential to keep the temperature of hot-spots
under control. Heat sinks, fans and other heat removal mechanisms are very effec-
tive at reducing average temperature of the chip. However, the temperature on a
4chip surface is not uniform and has a number of concentrated hot-spots (high tem-
perature points). Unlike heat removal techniques, which do not address hotspots,
thermal management solutions have the advantage of being able to monitor and
control the temperature of the hot-spots. To summarize, thermal management
techniques are essential to (i) ensure that the temperature of the hot-spots on-chip
are under control and, (ii) boost system performance under a given TDP package
by supplementing heat removal techniques.
A computer system has a number of layers of hardware and software interact-
ing with each other. Thermal management and heat reduction aspects can be
developed and explored at each individual layer. In this thesis, we focus on micro-
architecture and system-level approaches for thermal management. We propose
two micro-architectural and two system-level approaches for thermal management.
Our techniques are based on the observation that temperature of a processor is
strongly dependent on the workload executing on the processor and the configura-
tion of the processor. Our techniques adapt either the workload or the processor
configuration to manage temperature. Next we present a brief overview of the
thermal management techniques presented in this thesis.
1.1 Overview of the Thesis
Traditional micro-architectural design examines the tradeoff between circuit com-
plexity and performance and the goal of micro-architecture design has been to
maximize performance while keeping circuit complexity under control [42]. With
power consumption also becoming an important issue, micro-architectural tech-
niques have focused on maximizing performance while staying within the power
budget. More recently, micro-architectural techniques have focussed on managing
5temperature. The goal here is to maximize performance of the system while main-
taining temperature below a specified threshold [89]. At the system software level,
the goal is not only to maximize performance but also to satisfy a number of sys-
tem level requirements [52] while maintaining the temperature below the threshold.
System level requirements include real time deadlines, fairness and performance.
In this thesis we design a set of thermal management techniques that exploit ap-
plication and hardware heterogeneity for thermal management. We observe that
processor thermal behavior is highly sensitive to both the application character-
istics as well as processor configuration. Using these observations, we design two
classes of thermal management techniques. The first class of techniques exploit
hardware adaptivity to manage temperature. We observe that adapting multi-
ple processor parameters simultaneously is a very effective mechanism to manage
temperature. Based on this observation, we design a software based thermal man-
agement strategy that manages multiple adaptation parameters in the architecture.
We present our strategy for uniprocessors in Chapter 4 and extend it to multi-core
processors in Chapter 5. Our thermal management strategy outperforms existing
thermal management techniques for both uni-processor, and multi-core systems.
The second class of techniques we present in this thesis exploits heterogeneity in the
thermal characteristics of applications for thermal management in multi-tasking
systems. We observe that given a set of applications that execute concurrently in
a multi-tasking system, the resulting thermal profile is highly dependent on the
order of execution of the different tasks in the system and the relative share of
CPU time provided to the different (hot and cold) tasks in the system. We exploit
these observations to design two different system level thermal management tech-
niques. The first technique is designed in the context of a simple non-preemptive
scheduler and uses task reordering to manage temperature (presented in Chap-
ter 6). The second technique is applicable in the context of preemptive schedulers
6and adjusts the relative execution times provided to the different tasks (hot and
cold) to manage temperature (presented in Chapter 7). Our system-level thermal
management schemes manage to keep the temperature below the threshold while
satisfying a set of system level requirements such as real time constraints, fairness
and performance.
1.2 Thesis Contributions
With modern computer systems being severely constrained by rising on-chip tem-
perature, thermal management solutions have become a central aspect of computer
system design. The goal of any thermal management solution is to keep the tem-
perature of the system within a specific threshold without compromising on perfor-
mance and other requirements. At a very high level, thermal management solutions
try to arrive at the best system performance-temperature tradeoff either at design
time or dynamically at runtime. Among the different parts of a computer system,
the micro-processor is the hottest and so a large body of work has focussed on ther-
mal management solutions for micro-processors. Previously proposed solutions for
thermal management have revolved around the appropriate design of control hard-
ware or choice of heuristics that provide good performance-temperature tradeoff.
For instance, dynamic voltage and frequency scaling (DVFS) based techniques try
to determine the most appropriate voltage and frequency setting for the processor
such that temperature constraints are met.
In contrast to existing heuristic or controller based solutions, we propose workload
centric approaches for thermal management. We observe that the thermal behav-
ior of a micro-processor is highly sensitive to both the application executing on the
processor as well as the processor configuration. We characterize the sensitivity of
thermal behavior to application characteristics and hardware configuration, and
7exploit these characteristics to design new thermal management solutions. Our
thermal management solutions (i) have better performance under the same tem-
perature constraints and,(ii) are easier to configure and implement than previously
proposed solutions. Moreover, our solutions also explore previously unexplored as-
pects of temperature/system performance tradeoffs.
We present two software driven approaches and two hybrid approaches for thermal
management in this thesis. The software driven approaches exploit the variability
among the thermal profiles of different applications in a multi-tasking system. The
first approach tries to determine the most thermally optimal execution ordering of
tasks in a multi-tasking system and is applicable in the context of non-preemptive
multi-tasking system. Without any loss in performance, our technique can reduce
the peak temperature of the system by 4.09oC (5.8% reduction on an average).
The second approach determines the optimal shares of execution time among the
different tasks of the system such that temperature constraints are satisfied. Our
technique can handle both soft real time and best effort tasks and provides 4.3%
better performance on an average than more complicated hardware based mecha-
nisms [35].
Our hybrid solutions employ a combination of hardware and software for thermal
management. The hardware provides multiple thermal management knobs that are
controlled in software. Unlike previously proposed solutions, that employ hardware
feedback controllers, we rephrase the thermal management problem as a hardware
configuration search problem. We design a highly efficient software based dynamic
thermal management framework that provides 8.8% better performance than the
best performing previously proposed thermal management solution. We also re-
design this framework for multi-core systems and our multi-core solution has 12%
better performance than the best performing previously proposed approach.
81.3 Thesis Outline
In the next chapter, we present an overview of previously proposed approaches for
thermal management. In Chapter 3, we characterize the thermal behavior of pro-
cessors when executing different applications and under different configurations.
The observations from this chapter motivate the thermal management techniques
presented in the subsequent chapters. We observe that the temperature profile is
highly sensitive to heterogeneity in architecture and applications. In Chapter 4, we
present our thermal management technique that adapts multiple architectural pa-
rameters exploiting the sensitivity of thermal behavior to architectural parameters.
We extend this technique to multi-core processors in Chapter 5.
In our workload characterization, we also observe that temperature is highly sensi-
tive to application heterogeneity. We exploit application heterogeneity for thermal
management in Chapters 6 and 7. In Chapter 6 we present an approach that
uses task reordering to manage temperature of a multi-tasking system. Chapter 7
presents an approach that adjusts the relative execution shares given to hot and
cold tasks for thermal management. Chapter 8 concludes the thesis and presents
possible directions for future work.
Chapter 2
Related Work
In this chapter we present a general overview of previously proposed thermal man-
agement approaches. A more detailed description of related work associated with
each of the proposed techniques is done as we introduce the techniques in the sub-
sequent chapters. With temperature issues becoming one of the key performance
limiters in modern micro-processors, there has been an increasing focus on thermal
aware design and thermal management. Broadly, the temperature control tech-
niques can be classified into two categories, namely, techniques that improve heat
removal and techniques that reduce the heat production in the processor. Before
we present an overview of temperature management and heat removal techniques,
we present a brief account of how heat is produced and removed in a processor.
2.0.1 Heat Production & Removal in a Computing System
A typical computer system consists of one or more applications executing on a
micro-processor. An application consists of a stream of instructions and each
instruction encodes a specific sequence of activities on the different units of the
10
processor. For instance, a load instruction encodes access to the data memory, a
multiply instruction encodes usage of the multiplier and so on.
A micro-processor can be described at various levels of design. At the micro-
architecture level it consists of a set of units. The units of the processor are of
three major types: (i) Storage structures (e.g., register files, caches, etc.) that are
meant for storing instructions, data and temporary values, (ii) logic structures or
functional units (adder, multiplier, etc.) that perform the actual computation, and
(iii) control structures that coordinate the movement of both instructions and data.
Each unit is made up of a number of building blocks such as logic gates, flip-flops ,
storage cells and others. At the gate and circuit level, a micro-architectural unit is
expressed in terms of the constituent building blocks and their implementation. At
the lowest level, a micro-processor is expressed in terms of a number of transistors
interconnected by a number of wires.
When an application executes on a processor, a stream of instructions are fetched
from the storage (cache), decoded, the operations encoded by the instruction are
performed, and finally results are written into the storage. During the lifetime of an
instruction, it uses one or more units of the processor. At the circuit or transistor
level, usage of an unit translates to switching the states of the transistors that
form the unit. Transistor switching involves power dissipation. Similarly during
an instruction execution, signals are driven through wires connecting units and this
process also dissipates power. In addition, keeping a transistor at a particular state
(even without switching) involves some power dissipation known as leakage power.
The power is dissipated as heat and this results in an increase in temperature. To
keep the processor temperature under control, the heat dissipated in the processor
must be removed by an appropriate heat removal technique. Next we discuss a
typical heat removal mechanism found in a high performance processor.
The heat removal package depends strongly on the environment in which the pro-
11
cessor is deployed. Heat removal is less efficient or absent in embedded and mobile
systems [27]. In desktop and servers, the package typically consists of a spreader,
a thermal sink and some assembly to cool the sink. The spreader is attached to
the silicon die through a thermal interface material. The heat dissipated in the sil-
icon die is transferred to the heat spreader through the thermal interface material
and from the spreader to the sink. The sink loses heat to the ambient. Typical
packages include cooling mechanisms such as fans to aid the heat transfer between
the sink and the ambient [83].
The temperature of the chip surface depends on the difference between the rate
at which heat is dissipated in the chip and the rate at which the heat removal
systems is capable of removing it. Modern processors have power densities that
are challenging for the heat removal systems [83]. Hence, in addition to improved
heat removal, reduction in heat dissipation is also necessary. Next we provide an
overview of existing techniques for heat reduction and removal.
2.0.2 Techniques to Reduce On-Chip Temperature
In the first part of this section we review the packaging and circuit level techniques
for thermal management followed by micro-architectural and software based tech-
niques.
Package Level Techniques
Package level techniques are the first line of defence against increasing on-chip
temperature. Improved packaging techniques include improved sinks, spreaders
and better design for improved airflow [9, 99]. More exotic techniques such as
water cooled packages [4] are employed for over-clocking, but are too expensive
12
to be employed in mainstream production. As packaging becomes more complex,
packaging costs have been increasing steadily [83]. Another major challenge in
designing heat removal packages for micro-processors is that the temperature on
the surface of the silicon die is not uniform. The temperature of different units
of the processor can vary by upto 15oC(92oC − 77oC) and packaging needs to be
designed to keep the temperature of the hottest unit of the chip below acceptable
limits. For the above mentioned reasons, it is no longer possible to design packaging
that is sufficient for worst case power-dissipation [77]. Hence, packaging is designed
for average case power consumption and failsafe hardware mechanisms that can
control heat production are employed to manage on-chip temperature.
Circuit and Implementation Level Techniques
At the circuit level, temperature reduction mechanisms are required for two main
reasons. First, there is a strong dependence between temperature and the max-
imum operating frequency of the circuit [77]. Secondly, transistor variability in-
creases with increase in operating temperature [20]. At the circuit level, the trade-
off is made between performance and power density and is achieved by choosing
appropriate implementation option for a given functionality. For instance, an adder
can be implemented as a ripple carry adder [75], carry-look-ahead adder [75] and
so on. Ripple carry adders typically take a longer time than carry-look-ahead
adders, but have lower power density. More complex optimizations include tran-
sistor sizing to manage power density [18], selective threshold and supply voltage
control [15] and others.
13
2.1 Micro-architectural and System Level Tech-
niques
In this thesis we present micro-architecture and software level techniques to manage
temperature. A good starting point to reduce the temperature of a micro-processor
is to reduce the power consumption. Power-aware micro-architectural and software
development have been an actively researched area in the last decade. In the first
part of this section we show the need for micro-architecture and software based
thermal management techniques in addition to power management techniques.
Then we provide an overview of micro-architecture and software based temperature
management techniques.
2.1.1 Comparison with Power Reduction Techniques
Power reduction techniques do not generally suffice to manage temperature. While
reducing power can be a good starting point to reduce temperature, a separate class
of techniques are necessary to manage temperature [89]. Similarly techniques that
optimize for energy delay product do not directly optimize for temperature. This
is because temperature is a time varying quantity and the goal of thermal manage-
ment is to maintain temperature below the threshold at all time during execution.
Energy delay product on the other hand summarizes the energy efficiency of the
system over a significant period of operation. It is for this reason, energy delay
product does not have any direct correlation with the temperature profile of the
system [89]. The main reasons that dictate the need for architecture-level thermal
management are the following
• Chip Wide versus Localized Management:The goal of power manage-
ment techniques is to reduce the total power of the entire chip, while temper-
14
ature management attempts to reduce the temperature of the hottest unit of
the chip. Depending on the workload executing, the temperature difference
between individual units on chip can be as high as 15oC.
• Difference in Power versus Temperature Distribution: In a micro-
processor, caches are the largest power consumers while the execution core
(integer register file + functional units) and branch predictor are the hottest
units [89]. Hence techniques that reduce power consumption and temperature
must target different units.
• Instantaneous Power versus Temperature: Instantaneous power is a
poor indicator of temperature [89]. Hence techniques that try to reduce
instantaneous power consumption may not directly reduce temperature. This
is because temperature changes occur slowly and reflect sustained changes
in power dissipation over a large window of time. Moreover, temperature at
a hot-spot is dependent on unit-wise power distribution while instantaneous
power reduction techniques target total power.
Next we present an overview of architecture and software level thermal manage-
ment techniques.
2.1.2 Taxonomy of Micro-Architectural and System Level
Thermal Management
Figure 2.1 presents an outline of software and architecture level temperature control
techniques. Temperature control techniques can be classified into static/design
time techniques and dynamic/runtime techniques. Static techniques are again


















Figure 2.1: Overview of previous approaches for thermal management
techniques comprise of hardware based techniques, software based techniques and
hybrid techniques. Next we discuss each of these classes in detail.
2.1.3 Static Techniques
Thermal awareness can be built into the system at design time either in hardware or
software. Static software based techniques are mostly applicable in the embedded
domain where the functionality of the system is known in advance and the design
of system software is tightly coupled with hardware design. Another class of static
techniques are micro-architectural design space exploration where the hardware
parameters are selected by including thermal safety as one of the key requirements.
We discuss each one of these classes of techniques in detail.
Software Based Static Techniques
Static thermal management approaches fit naturally in the embedded space as
the workload to be executed on the system is known in advance. Embedded sys-
tems are often designed under strict constraints on area, power, performance and
cost. Many of these systems are designed to satisfy real time constraints. Static
16
thermal management approaches for embedded systems generally involve choosing
system parameters or scheduling tasks such that both the temperature constraint
and other non-functional constraints such as real time deadlines, performance and
power are satisfied. Wang et al. [96, 97] examine the impact of employing volt-
age scaling for thermal management in hard real time systems. They show that
satisfying temperature and real time constraints can be mutually conflicting goals
and derive the conditions under which both constraints can be satisfied. Zhang
et al. [104] derive the optimal voltage scaling policy for a set of embedded tasks
such that the performance is maximized and the temperature constraints are satis-
fied. Rao et al. [81] derive the optimal processor throttling policy that maximizes
performance while maintaining temperature for a given workload.
Modern embedded systems are designed as system on chips (SoC), which include
a number of heterogenous cores on the same die. Many static thermal manage-
ment approaches have been proposed to control temperature in MPSoC. These
approaches include task assignment [25, 47], scheduling [25, 29, 30, 31], and volt-
age assignment [59, 67, 68, 69] to manage temperature. Other static approaches
include compile time approaches to manage temperature such as optimizing regis-
ter assignment [105], temperature-aware loop parallelization [71, 72] and functional
unit assignment [70].
Hardware Based Static Techniques
The second class of design-time techniques are used in micro-architectural design
space exploration. Micro-architectural design space exploration for general purpose
processors is a complex process that attempts to choose suitable configurations for
the micro-architectural structures (cache sizes, number of functional units, etc.) to
satisfy a number of conflicting goals such as performance, power, circuit complexity
17
and cost [42]. Researchers have examined a limited part of this design space for
temperature reduction. Karthick et al. [51] and Nookla et al. [74] evaluate the
performance and temperature impact of different micro-processor floorplans.
In multi-core processors, there is a tradeoff between employing a large number of
simple cores or a smaller number of complex cores. Monchireo et al. [65], Li et
al. [101] and Huang et al. [46] examine the design space comprising of a number
of multi-core parameters such as number of cores, size of L2 cache and complexity
of the cores from a temperature and performance perspective. They show that for
high throughput and non-memory bound workloads it is better to employ large
number of simple cores with smaller caches where as for memory bound workload
a limited number of cores with larger caches is the optimal design choice.
2.1.4 Runtime Techniques
Static or design time techniques are very effective for thermal management when
the workload for the system is known beforehand (such as in static embedded
systems) or for optimal average performance across a range of workloads. Dynamic
techniques, on the other hand can control the temperature at runtime depending
on operating conditions of the processor and the workload. Most recent micro-
processors have on-chip temperature sensors that can be read by both hardware
and software [1, 5, 6, 84]. Dynamic techniques leverage on these on-chip sensors
and control temperature based on the sensor readings and are commonly referred
to as dynamic thermal management (DTM) techniques. Dynamic techniques can
be classified into hardware techniques and software techniques.
18
Hardware Based Techniques
Hardware based DTM techniques are based on the following general line of opera-
tion. A thermal management controller continuously samples the on-chip tempera-
ture sensors of the processor at a fixed sampling interval. When the temperature of
the processor exceeds the threshold temperature, suitable mechanisms are invoked
to manage the temperature. Different thermal management techniques differ in
the mechanisms which they use to control the temperature when the threshold
temperature is reached. Every mechanism to reduce the temperature of the chip
entails a performance loss. The key challenge in the design of these schemes is to
keep the temperature under the threshold while minimizing performance impact.
Hardware based dynamic thermal management approaches include fetch throt-
tling [88], dynamic voltage scaling [66], clock gating [21], activity migration [43],
cluster assignment [26] and functional unit balancing [79]. The execution core (reg-
ister files, execution units, issue logic, etc.) of the processor is the hottest portion
of the chip [89]. These techniques are designed to reduce the power dissipation in
the execution core either directly or indirectly. Fetch throttling [88] reduces tem-
perature by stalling instruction fetch for fixed cycles once the temperature hits the
threshold. Stalling the fetch unit periodically lowers the number of instructions
delivered to the back end of the pipeline. This lowers the utilization and so the
power consumption of the execution core. In clock gating, the entire processor is
gated for one sampling interval when the temperature exceeds the threshold [21].
Another class of dynamic thermal management techniques employ two copies of
hot processor units and swap usage when temperature hits the threshold. Heo et
al. [43] use two copies of the register file and control temperature by balancing the
utilization of the copies. Chaparro et al. [26] employ a clustered execution core
each with its own functional units and register file. The temperature difference
19
between the clusters is used to guide the issuing of instructions into the different
clusters. Powell et al. [79] use the functional units in the execution core in a round
robin fashion to balance utilization and thereby the temperature.
Software Based Techniques
Many researchers have explored software and system-level techniques for thermal
management. Such techniques leverage on software visible on-chip temperature
sensors. One of the commonly employed software level techniques is to dynamically
characterize the different tasks in the system in terms of their thermal behavior and
context switch to a cold task when the temperature exceeds the threshold [44, 52].
Software based thermal management policies have been proposed for simultane-
ously multi-threaded (SMT) and multi-core architectures. In SMT architectures,
a combination of multiple threads execute on the system simultaneously [41, 98].
The fetch policy chooses the thread from which the instructions are fetched every
cycle. Changing the fetch policy affects the relative number of instructions from
each thread that are active in the pipeline and hence impacts the thermal behavior.
By appropriately altering the fetch policy in an SMT processor, the temperature
profile of the system can be controlled [41, 98].
Multi-core processors have a number of physical cores integrated in the same pro-
cessor package. Given a multi-programmed workload, the temperature of each
core in the system depends on the mapping of the threads to the different cores.
In multi-core systems, dynamically changing the mapping of threads to cores has
an impact on temperature. Periodically migrating hot threads away from over-
heated cores helps balance the temperature. A number of migration driven ther-
mal management schemes have been proposed in the context of multi-core sys-
tems [35, 36, 63]. Different migration schemes for thermal management differ in
20
the policy used for migration. The policies that have been explored are migration
based on the difference in temperature between cores [36], number of times each
core hit the threshold [63] in a given interval, rate at which temperature rises in
each of the cores [35] and others.
Hybrid Techniques
Hybrid techniques use a combination of hardware and software for thermal man-
agement. Srinivasan et al. [91] propose a hybrid thermal management solution
specialized for multi-media workloads. This scheme exploits the frame based na-
ture of multi-media workloads. Processor parameters such as supply voltage and
architecture configuration can be adjusted in software. Extensive off-line profiling
is done to determine the highest performing and thermally safe setting for each
type of frame of the multi-media workload and the setting is stored. When decod-
ing for a new frame of a particular type is started, the processor parameters are
changed to the prior determined setting for that particular frame type.
In this thesis we propose two hybrid thermal management schemes. Unlike the
above mentioned approach, that are specialized for media applications, our tech-




A typical computing system consists of a set of applications executing on a hard-
ware and interacting with one or more components of the hardware to produce
the results. Central among the hardware components of the computing system
is the micro-processor, which executes the instruction stream from the applica-
tion. This interaction between the instruction stream from the application with
the processor results in power and heat dissipation. Thus, the heat dissipation
when a specific program executes on the processor depends both on the nature
of the application (program) and the processor. In this chapter, we characterize
the temperature effects (i) when applications with different characteristics execute
on a given processor, and (ii) a single application executes on different processor
configurations
3.1 Overview
When an application executes on a processor, it utilizes different units of the
processor for computation and storage. When a particular unit of the processor is
22
App1 App2 App3 App4
85-100 oC 70-80 oC 60-70 oC 50-60 oC
C1 C2 C3 C4
`











Figure 3.1: Temperature effects of application/hardware interaction
used, power is dissipated in the unit. The amount of power dissipated in a specific
unit of the processor over a time window depends on the utilization of the unit
in that time window. The power dissipated in the different units clearly depend
on the instruction stream, i.e., the application. This is illustrated in Figure 3.1.
The frequency and type of usage of the components of the processor over a time
interval is called the activity factor. In addition to the activity factor, the power
dissipation also depends on the configuration of the underlying hardware. Thus the
combination of the activity factor (application characteristics) and the hardware
configuration produces a certain pattern of power dissipation. The power consumed
in the processor is dissipated as heat and hence the power dissipation pattern
determines the temperature of the processor.
In this chapter we characterize the thermal impact of both the application and the
hardware characteristics. First, we characterize the thermal behavior of different
applications executing on a fixed processor architecture. We study the variability
of the thermal behavior among different applications individually and also study
the thermal profiles when a combination of such applications execute together in
23


















Figure 3.2: Tool-chain for workload characterization
a muti-tasking scenario. In the second case, we assume hardware flexibility and
examine the impact of varying processor configurations on the thermal profile of
a single long running application. Before we proceed to present the characteriza-
tion of the application and the hardware configuration on temperature profile, we
present a brief overview of the tool chain used to perform this characterization.
3.1.1 Tool Chain for Workload Characterization
Figure 3.2 shows a description of the tool chain used in our workload character-
ization and in the rest of this thesis. The application binary is compiled for the
Simple Scalar PISA [23] architecture with the maximum optimization level. The
compiled application binary is executed with the appropriate inputs on the Simple
Scalar [23] processor model. In addition to the processor model, the simulator takes
the appropriate processor configuration as input. As described later in the thesis,
we use two different processor configurations in our evaluation. A general-purpose
processor configuration chosen to closely resemble the Alpha 21264 processor [94]
is used in the techniques targeting general-purpose systems. An embedded pro-
cessor model that closely resembles the ARM cortex A8 [2] embedded processor is
chosen in the techniques for embedded systems.
24
When the application binary executes on the processor model that reflects a given
configuration, two important types of statistics are collected (a) the performance
statistics that reflect the execution time of the benchmark, and (b) activity factors
for different units of the processor. The activity factors is presented as input to
the Wattch [22] power simulator that computes the unit-wise power consumption.
The unit-wise power consumption is provided as input to HotSpot [89] thermal
simulator. The thermal simulator takes in the power consumption and the pro-
cessor floorplan as input to provide the dynamic thermal profile as output. This
top level view of the tool chain does not include our modeling of leakage power
that is dependant on the temperature. For leakage power modeling, we use the
HotLeakage [57] leakage models. We resolve the cyclic dependence of leakage power
and temperature in our simulation using a linear piecewise approximations in the
leakage models as discussed in [81].
We apply the tool-chain to understand and study the thermal impact of appli-
cation and hardware properties. The findings from this study are used to design
thermal management solutions that exploit application properties and hardware
adaptivity for thermal management. Thermal management solutions that exploit
application properties are suitable in the context of embedded systems. This is
because embedded systems have a well defined functionality and the set of pos-
sible applications to be executed on the system is known apriori. Such systems
employ simple processors and are cost, power and area constrained. Hence simple
and easy to implement software driven approaches for thermal management are
very attractive for these systems. We first study the thermal characteristics of em-
bedded applications executing on an Arm Cortex A8 [2] like embedded processor
model and use the results to design software based thermal management solutions
in Chapters 6 and 7.
25
3.2 Application Thermal Behavior
In this section we characterize the thermal behavior of a number of programs from
the MiBench [39], MediaBench [53], and EEMBC [55] benchmark suites. We ob-
serve the thermal profile of these programs when they execute on an embedded
processor model similar to the Arm Cortex A8 [2] processor. The main aim of
this characterization is to (a) determine the variability in the temperature profiles
among the different embedded applications, and (b) understand the thermal im-
pacts when a combination of such programs execute in a multi-tasking scenario.
We first begin with a description of the processor configuration and the experi-
mental setup used for this characterization.
Processor Configuration and Settings
We simulate the processor in detail using a cycle-accurate processor simulation
model. The simulation model is a modified version of Simple Scalar [23] simulator.
The processor configuration is chosen to closely resemble an Arm Cortex A8 [2]
processor. We simulate a six-stage in-order processor pipeline with 32KB instruc-
tion and data caches and 512-entry branch prediction buffer. Like the Cortex A8
pipeline, our model can issue upto two integer instructions per cycle. For power
modeling, we use the configurable power models from Wattch [22], a commonly
used power simulator. We use the supply voltage and operating frequency (1.2
Volts and 1.5 GHz) at 65 nm from the Cortex A8 specification to compute the
power consumption. The power model works in close association with the cycle
accurate processor model and determines the power consumption every cycle. For
temperature simulation we use HotSpot [89] which uses a thermal RC model. The








































Figure 3.3: Temperature profiles for individual programs with initial temperature
40oC
that computes the temperature. We use a sink thermal resistance of 1.0oC/W and
an ambient temperature of 40oC.
3.2.1 Thermal Behavior of Individual Applications
In this section, we study the temperature profiles of sixteen embedded benchmarks
under the following scenario: (i) the benchmark is the only program executing on
the processor, and (ii) the benchmark executes indefinitely. The goal of this study
is to understand and characterize the temperature profiles of each benchmark under
different initial temperatures (see Figures 3.4 and 3.3) .
Figure 3.3 shows the temperature profile for the sixteen embedded benchmarks
we use in this study. The temperature rises exponentially starting from an initial
state and reaches a steady state value. There is very little change in temperature
once the steady state is reached. It can be seen from these profiles that (i) there
is a large variation in temperature among different benchmarks, and (ii) there is
very little variation in temperature within a given benchmark. We also studied the
thermal behavior of these benchmarks with different values of initial temperature








































Figure 3.4: Temperature profiles for individual programs with initial temperature
70oC
the temperature moves exponentially to a steady state temperature. The rise
of temperature to a steady state is expected as a result of the thermal diffusion
equations. However, it is interesting that different applications show different
steady state temperature and there is very little variation in temperature once the
steady state is reached. Figure 3.4 shows the temperature profile starting from
an initial temperature of 70oC. Benchmarks with steady state temperature higher
than 70oC show a rise in temperature to the steady state while others show a
fall in temperature to the steady state. Different programs have different steady
state temperatures. Thus given an embedded task, its thermal characteristics can
be summarized by its steady state temperature. Table 3.1 shows the steady state
temperature of the sixteen embedded benchmarks and the variation in temperature
once the steady state is reached for these benchmarks.
Thermal Model for Single Application
In this section we use our characterization results to express the thermal behavior
of a single application as an exponential function. Let Ts be the steady state
temperature of an application A. Let T (t) represent the temperature at time t
and let Tinit be the temperature when A starts execution. That is, T (0) = Tinit.
28
Benchmark Source Steady State Max Variation
Temperature (C) Temperature (C)
adpcm MiBench 64.35 0.45
blowfish MiBench 84.75 0.08
crc MiBench 88.85 0.4
dijkstra MiBench 65.85 0.68
djpeg MiBench 82.55 0.17
epic MediaBench 67.15 0.08
g721 MediaBench 86.55 0.26
ghostscript MiBench 64.15 0.44
gsm MediaBench 80.35 0.46
lame MiBench 62.35 0.51
mp3 MediaBench 71.25 0.64
patricia MiBench 49.85 0.09
pegwit MediaBench 59.95 0.34
sha MiBench 87.15 0.23
stringsearch MiBench 77.65 0.03
susan MiBench 81.75 0.13
Table 3.1: Benchmark Characteristics
The single-task thermal profile can be expressed by the function
T (t) = Ts − (Ts − Tinit)× e−Kt (3.1)
where K is a processor-specific but application-independent constant [89].
Determining Constant K The value of K is determined oﬄine by observing
the heating and cooling curves corresponding to a number of applications on the
target processor. We fit the observed thermal profiles for the benchmarks into
Equation 3.1 and compute the value of K. For our processor model and setting we
observe K = 0.00472. We now verify the accuracy of Equation 3.1 in predicting
the thermal profile of a single task.
Accuracy of Thermal Model We applied the model (Equation 3.1) to predict
the temperature variations and compared the temperature curves from the model
29
and from HotSpot [89] simulation. For each benchmark, we measured the peak
error, that is the temperature difference at the point where the predicted and the
observed curves diverge the most. The maximum peak error is 0.6oC and the
average peak error is 0.14oC across all the benchmarks.
As seen from the preceding discussion, tasks exhibit heterogenous behavior in their
thermal profiles. Next we study the thermal characteristics when a combination of
such heterogenous tasks execute in a multi-tasking embedded system. We observe
that given a set of heterogenous tasks, the thermal characteristics of the execution
of these tasks depends on (i) the relative amount of time given to the hot tasks
and the cold tasks (ii) the order of execution of the tasks. We present cases for
how each of these factors impact the thermal profiles.
Impact of Task Ordering on Temperature Profile When a set of het-
erogenous tasks execute on the processor in a non-preemptive fashion (we dis-
cuss the impact of preemption and task shares in the next section), the tempera-
ture profile is very sensitive to the order in which the tasks execute. Consider a
set of heterogenous tasks namely crc, epic, gsm, stringsearch, dijkstra,
djpeg, adpcm, patricia. Figure 3.5 presents the temperature profiles for two
different execution orderings of this task set. A difference in temperature of 8.45oC
is observed between the peak temperature of the two sequences. To generalize
this observation, we experimented with a number of randomly generated task sets
comprising of tasks from Table 3.1 and found that the temperature profile was
sensitive to the order of execution. We found that on an average across these task
sets the temperature can vary by upto 6.2oC among different execution ordering.
Based on the observation that the temperature profile is highly sensitive to order
of execution of the task, we design a technique that employs task reordering for





















0 500 1000 1500 2000
Time (ms)
Figure 3.5: Temperature curves for two different task sequences of the same task
set
technique is presented in Chapter 6.
Impact of Task Shares on Temperature Profile In a preemptive multi-
tasking system with heterogenous tasks, the temperature profile can be controlled
by changing the shares of execution time given to the hot and cold tasks. From our
analysis of the temperature profile of a single task, we observe that the thermal
behavior of a task can be summarized in terms of its steady state temperature.
Given a set of tasks with high steady state temperature (hot tasks) and tasks with
low steady state temperature (cold tasks), the temperature can be controlled by
giving a higher share of execution time to the cold tasks. Figure 3.6 shows the
temperature profiles for three different executions of a task set consisting of a hot
task (crc) and a cold task (dijkstra). Clearly the temperature can be controlled
by giving a larger share of execution to the cold tasks. We present a dynamic
scheduling technique that manages temperature by controlling the shares of exe-
cution time given to the hot and cold tasks in Chapter 7. The scheduling technique
manages temperature while satisfying a number of system level constraints such
as real time deadlines and fairness.
31


























Figure 3.6: Temperature curves with different shares of execution time to hot and
cold task
Next we study the impact of changing the underlying hardware configuration on the
temperature profile. The observations from this study are used to design thermal
management solutions that adapt the processor configuration. General purpose
systems can adapt the underlying micro-architecture at runtime depending on the
executing workload to get the best possible performance and power consumption
for that workload. Dynamic adaptation is not typically employed in embedded pro-
cessors as the processors are very simple and do not support dynamic adaptation.
Hence, we evaluate the impact of processor configuration on thermal behavior in
a general purpose system context.
3.2.2 Impact of Processor Configuration on Thermal Pro-
file
In this section we present the impact of changing the processor configuration on the
temperature profile. We consider a single long running high performance applica-
tion executing on a configurable general purpose processor. Some of the parameters
of the processor are configurable at runtime and we examine the impact of each
32
Fetch/Issue/Commit Width 6
Window Size 64 entry
Active List 128 entry
Integer FU 6 Int Add/1 MUL/DIV
Floating point FU 2 FP Add/1 FPMUL/DIV
L1 I-Cache 64 KB, 2-way, 64B blocks
L1 D-Cache 64 KB, 2-way, 64 B blocks
L2 Cache 2 MB, unified, 8-way, 12 cycle latency
Main Memory 75 ns
TLB 128-entry fully associative
Branch predictor Hybrid:
4K bimod , 4K entry/12-bit/GAg
Branch Target Buffer 2K , 4-way
Table 3.2: Parameters of the baseline processor.
of these configuration parameters individually and together on performance and
temperature. First we discuss the processor setting and the parameters of the
processor that are configurable.
Processor Settings and Workloads
We use a cycle accurate processor model working in conjunction with a power and
thermal model for our study. We use Simple Scalar [23] simulator to simulate a
high-performance out-of-order processor with a peak issue width of six instructions
per cycle. The parameters of the processor model are summarized in Table 3.2.
We use a 128 entry active list (reorder buffer) and a 64 entry issue window. To
determine the power consumption we use power models from Wattch [22]. The
power model works in close conjunction with the performance model and the power
values are computed every cycle. We use linear scaling in Wattch to obtain the
power consumption with a supply voltage of 1.4 Volt and a frequency of 3.6 GHz
at 100 nm, which corresponds to the supply voltage and frequency of the Pentium
4 processor [7]. Wattch’s linear scaling assumes uniform capacitance scaling from
one technology generation to another. The dynamic power dissipation of a circuit
33
is given by P = A× V 2 × C × F where A is the activity factor, V is the voltage,
C is the capacitance of the circuit, and F if the operating frequency. The Wattch
models are based on estimating the capacitance(C) for different circuit at 130nm
and for another technology generation, capacitance values are obtained by using
appropriate scaling values. The scaling factors are based on data from transistor
device models [3] and industry trends [11]. For thermal modeling we use the
HotSpot-3.0 [89] temperature simulator. The average power consumption every
105 cycles is provided as input to the thermal model to compute the temperature.
We consider four configurable parameters namely fetch gating [88], window size
adaptation [24], issue width scaling [13] and frequency scaling [66]. We consider
these parameters as they are easily configurable and have an impact on the hottest
units of the processor [89]. Our simulation model uses four different window sizes
(16, 32, 48 and 64), eight different fetch gating levels (1 to 8) and five different issue
widths (2–6). When a fetch gating level of T is specified, the fetch unit is halted
for one cycle after every T cycles of activity. We consider eight levels of frequency
scaling between a maximum frequency of 3.6 GHz and a minimum frequency of
2.5 GHz. We assume a penalty of 10µs per frequency/voltage transition [35, 89].
We assume special instructions to resize the adaptive structures in software [45,
91]. The power consumption at different window sizes are obtained based on
previous work [85]. When the issue width is altered, we assume that the additional
functional units are switched off (no leakage power) and the corresponding lines of
the wake-up logic and register ports are not driven [85, 91]. Next we examine the
impact of each of these configuration parameters individually and collectively on
temperature and performance.
We use 14 benchmarks from the SPEC 2000 benchmark suite. For each of these
benchmarks, we fast forward to the simulation point specified by [76] and simulate























Fetch Gating Window Size Issue Width Frequency
Figure 3.7: Performance/temperature impact of different configuration parameters
for crafty benchmark
Temperature and Performance Impact of Hardware Configuration
In this section we examine the temperature and performance impact of different
configurable parameters. As discussed earlier, the parameters we consider are
fetch gating, issue window resizing, issue width scaling and dynamic voltage and
frequency scaling.
Figure 3.7 shows the performance versus temperature plots for each of these pa-
rameters for crafty benchmark. For example the curve corresponding to fetch
gating shows the performance and temperature corresponding to eight different
fetch gating levels. We discuss each of these configuration parameters separately.
Fetch Gating
Fetch gating controls temperature by reducing the number of instructions deliv-
ered to the pipeline. When the number of instructions delivered to the pipeline is
reduced, the activity factor in the pipeline and hence power dissipation is lowered.
This lowers the temperature. However, reducing the instructions delivered to the
35
pipeline results in lower throughput from the pipeline and hence loss in perfor-
mance. In our setting, we support eight different fetch gating levels and each of
these settings result in different performance and temperature points. When the
fetch gating level is increased, the temperature is reduced; however it results in a
lower performance as seen from Figure 3.7.
Window Size Reduction and Issue Width Reduction
When the issue width or window size is reduced, both the activity factor in the
pipeline and the power consumed per activity is reduced. A scaled down issue win-
dow dissipates lower power per operation than a full entry issue window. Similarly
when the issue width is reduced, no leakage power is dissipated in the additional
functional units and additional ports in the register file are switched off. This re-
sults in lower power consumption per instruction issue when either the issue width
or the window size is reduced. Moreover, lower number of instructions are issued
at smaller window sizes and issue widths resulting in further reduction in power
consumption. Of course, the performance is impacted negatively as can be seen in
Figure 3.7. In other words, lowering the issue width or window size results in lower
performance and temperature. However, the reduction in temperature per unit of
performance lost (slope of the curve) is higher than fetch gating and comparable
with DVFS. This is because unlike fetch gating that only lowers the activity factor,
these techniques lower both the activity factor and power dissipated per activity.
Voltage/Frequency Scaling
When dynamic voltage and frequency scaling (DVFS) is employed, both the voltage
and frequency are lowered. When the frequency is lowered by a factor β, the power
























Figure 3.8: Performance/temperature impact of applying multiple configuration
parameters simultaneously for crafty benchmark
lower temperature. With DVFS, there is a cubic relation between the performance
(frequency) and power dissipation. This cubic relation helps DVFS to become the
most effective thermal management technique [89]. The reduction in temperature
per unit performance (slope of the temperature-performance curve) is the highest
among the investigated techniques.
Combined Impact of Multiple Techniques
Figure 3.8 shows the impact of applying multiple configuration parameters si-
multaneously for thermal management. In our setting, we have eight different
fetch gating levels, four different window sizes, five different issue widths and eight
frequency levels resulting in a total of 1280 possible configurations when all the
parameters are applied simultaneously. The performance versus temperature plots
for a subset of these configurations for crafty is shown in Figure 3.8. An in-
teresting observation from Figure 3.8 is that for a given temperature, there are
multiple configurations with different performance. This is in stark contrast to
the single parameter configurations explored in Figure 3.7, where a reduction in
37
temperature is always accompanied by a loss in performance. When multiple pa-
rameters are configured simultaneously, they work synergetically towards lowering
temperature and thus providing better temperature-performance tradeoffs than
configuring each parameter individually. We repeated this analysis for a number
of other benchmarks in the SPEC CPU 2000 suite and found similar results. We
also examined the pareto-optimal points and found large variations in the set of
points from benchmark to benchmark. For applications with low ILP(instruction
level parallelism), we found that design points with smaller machine width (issue
width and window size) was optimal while high ILP(instruction level parallelism)
applications optimized for wider machine width and lower frequencies.
The goal of our temperature management strategy is to maximize performance
while maintaining temperature below a specified limit. This is a commonly found
objective for thermal management in high performance computer systems [89].
We observe that in such a setting, employing multiple configurable parameters in
a combined fashion provides very good performance temperature tradeoffs. Given
a particular temperature constraint the most optimal configuration point is the one
providing the highest performance among all the points that satisfy the tempera-
ture constraint. In Chapter 4, we design a software based framework for thermal
management that is motivated by these observations. Our framework employs the
four configuration parameters studied in this section in a combined fashion for
thermal management. Further, in Chapter 5 we show that using a combination
of multiple parameters is highly effective for thermal management in multi-core
architectures. We propose a software based thermal management framework for




In this chapter, we presented a characterization of thermal profiles for (i) different
programs executing on an embedded processor, and (ii) a long running application
executing on different processor configurations. We observed that different embed-
ded applications show a wide variation in their thermal behavior and the thermal
profile of an embedded application can be summarized in terms of its steady state
temperature. For task sets comprising of multiple tasks, the thermal profile is
sensitive to the order of execution of different tasks as well as the relative shares of
processing time for different tasks. We also characterized a single long running ap-
plication executing on different hardware configurations. In this characterization,
we evaluated the impact of using different configurable parameters individually
and collectively for thermal management. We observed that controlling multiple
configurable parameters synergetically for thermal management results in better
temperature/performance tradeoffs than employing each parameter individually.
We use the observations in this chapter to design thermal management schemes
in the subsequent chapters. The varying thermal profiles of different embedded
benchmarks is used to design two software based thermal management schemes
for embedded systems in Chapters 6 and 7. Based on our observation of the
impact of processor configuration on thermal management, we design a software
based thermal management scheme for general-purpose systems in Chapter 4 and




The goal of any thermal management technique is to maintain the on-chip tempera-
ture below an acceptable limit (threshold). Unfortunately controlling temperature
is almost always accompanied by a performance loss. The main challenge is to
arrive at a thermal management solution that (i) controls temperature effectively
and (ii) has minimal impact on performance. Of course given this scenario, it is
always possible to perform a design time temperature analysis and design the pro-
cessor in such a way that the temperature remains below the threshold. However
there is a serious limitation with only static or design time thermal management.
As shown in Chapter 3, on-chip temperature can vary widely depending on the
workload that is executing on the processor. A purely static or design time ap-
proach would involve designing the processor to be thermally safe for the highest
temperature (worst case) workload that can execute on the processor. It is not
clear if the worst case (highest temperature) workload can be determined statically
in a general purpose system. Even if that were the case, such an approach would
40
be very conservative in choosing design parameters and operating speed of the
chip. This would result in sub-optimal performance for most realistic workloads.
Hence, in real systems, the processor is designed to maximize the performance
of realistic workloads and dynamic thermal management solutions are engaged to
manage thermal stress. Dynamic thermal management techniques leverage on the
on-chip temperature sensors which are omnipresent in recent processors [6, 7, 40].
The on-chip temperature is monitored continuously and when it exceeds a prede-
fined threshold appropriate mechanisms are invoked to maintain the temperature.
Thus unlike a conservative static approach, performance sensitive mechanisms to
control temperature are engaged only on a need to use basis.
Many temperature control mechanisms such as throttling [21], dynamic voltage
and frequency scaling [66], activity migration [63] have been used. Each technique
is also accompanied by a policy which dictates how and when the mechanism would
be applied depending on the temperature. For instance, with clock gating, the
processor might be stopped one cycle per ten cycles for mild thermal stress and
every alternate cycle for a severe thermal stress. Most of the currently proposed
thermal management techniques have focussed on determining the temperature
control mechanism and the most efficient policy to employ that mechanism.
In this chapter we take a different approach to the design of thermal management
solutions. We observed in Chapter 3 that employing multiple different techniques
synergetically is very promising from a thermal management perspective. In par-
ticular we observed the impact of employing fetch gating, window size adjustment,
frequency scaling and issue width scaling simultaneously for thermal management.
The key challenge in such a scenario is explosion the thermal management con-
figuration space and determining the most optimal performance point at runtime.
We design a framework that samples the workload on the processor continuously
41
and appropriately manages temperature for that workload with minimal perfor-
mance impact. In this chapter we present our thermal management framework
which employs architecture adaptation, fetch gating and dynamic voltage scaling
synergetically to manage temperature. Of these fetch gating and dynamic voltage
scaling have been used in the past. Additionally we use architecture adaptation
for thermal management and show that it is very powerful when employed in con-
junction with frequency scaling. Next we review the related work in architecture
level thermal management.
4.1 Related Work
Our work on employing multiple interacting mechanisms for thermal management
is related to three classes of prior work in thermal and power management. We
review related work in each of these classes.
4.1.1 Architecture Level Thermal Management
Recently there has been widespread interest in architecture level thermal manage-
ment techniques [89]. In these techniques, the on-chip temperature is continuously
monitored with the help of on-chip sensors and when the temperature exceeds a
predefined threshold different mechanisms are employed to manage the tempera-
ture. Many techniques such as clock gating [21], dynamic voltage and frequency
scaling [66] have been proposed to manage temperature. In these techniques the
main challenge is to adjust the extent of response to the extent of thermal stress.
Skadron et al. [88] employ a feedback controller to solve this problem. The con-
troller uses the difference between the measured on chip temperature and the
threshold to control the operating frequency of the processor. Cohen at al [28]
42
have shown that employing feedback control results in near optimal results for
single response thermal management techniques.
In contrast to existing schemes, we argue that employing multiple mechanisms in
a combined fashion for thermal management can produce significantly better tem-
perature/performance tradeoffs. One possible approach to engage more than one
responses for thermal management is to determine a crossover point between the
thermal management techniques as in [87] where fetch gating and dynamic voltage
scaling are combined. When multiple mechanisms are used, tuning the thermal
management policy oﬄine becomes extremely cumbersome and the resultant policy
becomes too complex for a hardware implementation. Instead, we recast the ther-
mal management problem as a dynamic configuration space exploration problem
and present a dynamic software based framework for thermal management.
4.1.2 Software Based Thermal Management
Kumar et al. [52] employ performance counters to distinguish hot and cold tasks
and manage temperature by lowering the priority of hot tasks when required. In
multi-core and multi-processor systems OS migration has been employed to manage
temperature [36, 63].
In contrast to existing scheduler based approaches, our thermal management scheme
is a coordinated hardware/software mechanism which adjusts the hardware config-
uration dynamically to maximize the performance under temperature constraints.
Recently coordinated hardware/software mechanisms for performance/power trade-
offs have been proposed. Li et al. [56] use a software based framework to explore
tradeoffs between number of active cores and operating frequency for parallel work-
loads. Dynamic software based frameworks for cache adaptation in multi-cores
have also been proposed [17, 61].
43
4.1.3 Architecture Adaptivity
Micro-architectural structures such as caches, issue queues, register files and others
can be adapted at runtime with minimal overhead [12]. Adaptations involve either
changing the size or the width of these structures. In our work we adapt the issue
window size and the issue width at runtime for thermal management. Adapting
architecture structures for power management has been an extensively studied
area with techniques that target caches [14, 33], register files [13], issue window
[24, 78] and functional units [85, 91]. These approaches opportunistically employ
architecture adaptation using local indicators such as issue window occupancy
levels to resize the architecture structures. In contrast to these techniques which are
single point optimizations for power reduction, we employ architecture adaptations
combined with dynamic voltage and frequency scaling for thermal management.
Next we present an overview of our thermal management framework.
























































Figure 4.2: Components of the Adaptive DTM Framework.
In Chapter 3 we showed that employing multiple control mechanisms simultane-
ously provides good temperature performance tradeoff. Our software based frame-
work employs four mechanisms namely issue width scaling, issue window size scal-
ing, fetch gating and frequency scaling for thermal management. The adaptive
architecture used is shown in Figure 4.1. The challenge here is to manage the ex-
plosion in the configuration space. The configuration space can be specified by the
tuple < IW,W, T, F > where IW is the issue width, W is the window size, T is the
fetch gating level and F is the frequency. The goal of our framework is to deter-
mine the optimal configuration for a specific workload. The optimal configuration
is defined as the maximum performing configuration at which the temperature is
below the threshold.
Figure 5.4 presents the components of our software-based dynamic thermal man-
agement framework. The configuration management routine (on the right in Figure
5.4) runs in software. It collects the performance counters from the processor once
every adaptation interval (107 cycles or 2.8 ms at 3.6 GHz). As temperature change
occurs slowly [89], the adaptation interval is set in the order of milliseconds, which
is the period for timer interrupts in many systems. These workload statistics are
used to guide the choice of configuration parameters for the next interval. The
goal of the configuration search routine is to find the configuration with the max-
45
imal performance that satisfies the thermal constraints. Clearly, when we visit a
particular configuration C ′ in the search space, we need to answer two questions.
1. What is the expected performance of this configuration? For this purpose,
we develop a model that predicts the performance (in terms of number of
instructions committed per second) of configuration C ′ given the counter
values for the currently running configuration C. This model is presented in
Section 4.4.
2. Is the configuration C ′ thermally safe, i.e., if we switch to configuration C ′ in
the next interval, will the processor hit the thermal threshold? We design
a neural network classifier that takes in configuration C ′ plus the number
of instructions of each class (integer, floating point, branch, and load/store)
issued per cycle as input and predicts if C ′ is thermally safe. The details of
the classifier appears in Section 4.3.
Note that the classifier requires the number of instructions issued per cycle as
input as the temperature depends on the issued instructions. The performance,
on the other hand, is determined only by the committed instructions. To bridge
this gap, our performance model also estimates the number of instructions of each
class issued per cycle corresponding to configuration C ′ and feeds this information
to the classifier. Also note that our framework uses a neural classifier to predict if a
given workload running on a configuration is thermally safe. Hence the framework
is backed up by a hardware based simple DTM scheme which is employed during
the rare event(1.14% probability) of a unsafe configuration being chosen. In our
setting we use clock gating as the backup scheme where the processor is frozen for
one sampling interval when the temperature hits the threshold.
The configuration space of our adaptive micro-architecture consists of 1,280 points
(8 fetch gating levels × 4 window sizes × 5 issue widths × 8 frequency levels).
46
Clearly, it is not feasible to evaluate all the configurations and find the optimal one.
Instead, we design an efficient search strategy that (a) reduces the four-dimensional
configuration space (fetch gating levels, windows sizes, issue widths, frequency
levels) to two dimensions (IPC and frequency levels) based on insights gained from
the performance model, and (b) further prunes the two-dimensional configuration
space based on some properties of the space. Due to these optimizations, our
search strategy evaluates only a small subset of the configuration space (32 points
in the worst case). The optimized search strategy is described in Section 4.5.
4.3 Neural Network Classifier
While searching for the optimal configuration, we need to determine if a particular
configuration is thermally safe. The thermal profile of a processor typically shows
large variations among the different components of the processor (up to 15oC differ-
ence) [89]. Moreover, the temperature of a processor component is closely related
to the usage pattern of that component for a workload. Clearly, in our adaptive
micro-architecture, the temperature of the different processor components depend
both on the configuration parameters as well as the workload.
To accurately determine the temperature of the system under different workloads
and configurations, we need (i) power simulations to determine the power dissipa-
tion at a specified configuration , and (ii) temperature simulations to determine
the thermal impact of the changed power profile. The variation of temperature
with power is relatively well understood and can be approximated with linear [52]
or exponential [106] equations. However in our setting there are multiple inter-
related factors at play namely (i) change in per usage power dissipation of units
due to scaling of structure sizes, (ii) change in workload behavior and activity
factor due to different configuration, and (iii) change in thermal behavior due to
47
the new structure size and workload behavior. A detailed analytical model which
encompasses all the above effects would be computationally expensive to use in
our setting.
Instead, we employ neural networks to capture the complex non linear relationship
between the machine configuration, workload and temperature. In particular, we
model the problem of determining if a particular configuration is thermally safe for
the current workload as a classification problem. A classification problem consists
of a set of input features, output classes and a trained classifier. When an input
is given (i.e., the input features are assigned values), the classifier predicts the
class to which the input belongs. In our framework, the classifier partitions the
{configuration, workload} pairs into thermally safe and thermally unsafe classes.
Neural networks are commonly used to solve many classification problems. So we
design a neural network classifier to predict if a given {configuration, workload}

























Figure 4.3: Neural network classifier architecture.
Figure 4.3 shows the structure of our neural network classifier. The input features
consist of three configuration parameters: (1) instruction window size, (2) issue
48
width, (3) operating frequency plus four workload parameters: (1) number of
integer instructions issued per cycle, (2) number of branch instructions issued per
cycle, (3) number of load/store instructions issued per cycle, (4) number of floating
point instructions issued per cycle.
We choose the workload features that correlate well with the usage pattern of the
branch predictor and the execution core (instruction window, register file, and
execution units) — the hottest components of the processor [89]. To evaluate the
choice of the features, we collect the workload statistics and the corresponding
temperature of the hottest unit for a large number of benchmarks. By employing
principal component analysis [8], we observe that these four workload features
account for 98% of the variance in the observed temperature.
The values of the workload features vary according to the configuration. For ex-
ample, the number of integer instructions issued per cycle depends on the issue
width, instruction window size as well as fetch gating level. Given a configuration,
the workload parameters are obtained from the performance prediction model (see
Section 4.4). Note that fetch gating level is excluded from input features of the
classifier. This is because fetch gating only alters the usage patterns of the pro-
cessor components, which is reflected sufficiently in the four workload features.
The rest of the configuration parameters, on the other hand, also impact power
consumption per usage.
We use a neural classifier with a single hidden layer and one neuron in the hidden
layer as shown in Figure 4.3. The hidden layer neuron uses a sigmoid trans-
fer function and the output layer neuron uses a threshold transfer function. We
use this neural network architecture as it results in high prediction accuracy with
minimal classification time. Though it involves an exponential function for each
classification (in the sigmoid function), we achieve fast computation of exponential
49
function using table interpolation and other approximation methods [86]. More-
over, we will show later (see Section 4.5) that our search routine is highly optimized
and evaluates only a few configurations. Therefore, the overhead of classification
is negligible in comparison to the adaptation interval.
4.3.2 Training the Classifier
Our neural network classifier must be trained with a training set comprising of
typical inputs and their corresponding outputs to adjust the weights (w1–w8) and
the bias values (b1 and b2) in the model. We use the Levenberg-Marquardt training
algorithm [8] for training our classifier. The training algorithm is an iterative off-
line procedure that adjusts the weights and bias values in the neural classifier to
minimize the classification error. We train the classifier during system installation
and/or when the system conditions (heat sinks, ambient conditions, etc.) change.
A proper training set should include a wide variety of points in the feature space.
In our setup, this corresponds to training the classifier with different possible
configurations and workload parameters. The training set is generated by running a
set of micro-benchmarks under different configurations and checking if the resulting
execution hits the thermal threshold. Each micro-benchmark consists of a loop
body with 100 instructions. The number of loop iterations is large enough to
ensure that the loop execution time is longer than the thermal time constant of
the different processor components.
Our workload feature vector includes the number of integer, floating point, load/store,
and branch instructions issued per cycle. So the loop body of each micro-benchmark
contains a mix of these instructions. The shares of the instruction classes in this
mix are generated randomly for each micro-benchmark. However, we ensure that
50
the shares reflect the instruction mix observed in real programs; for example, a
typical program tend to execute more integer instructions compared to branches.
This is achieved by appropriately choosing the range of the randomly generated
shares for each instruction class.
4.3.3 Accuracy of the Classifier
We first train the classifier with 30 micro-benchmarks each running on 10 randomly
chosen configurations from the configuration space. After training, our neural
classifier has the following weights [w1, ..., w8]=[-3.96,-1.62,-1.58,-0.43,-1.70,-4.92,-
4.29,-1.01] and bias values b1=5.6716 and b2=0.01. Now we are ready to classify
any {configuration, workload} pair. We test the accuracy of the classifier in the
following way. First, we run each benchmark (from the SPEC 2000 [10] benchmark
suite) on the simulator 1 for all the 160 points in the configuration features space
(4 instruction window sizes × 5 issue widths × 8 frequency levels). Corresponding
to each {configuration, benchmark} pair, we collect the statistics of the number
of instructions issued per cycle in each instruction class (integer, floating point,
branch, and load/stores). We also record if the resulting execution hits the thermal
threshold. Next, for each {configuration, benchmark} pair, we feed the instruction
mix statistics and the configuration parameters to our trained neural network
classifier. The classification returned by our classifier (i.e., whether the execution
will hit the thermal threshold or not) is compared with the observed result from
the simulator.
Figure 4.4 plots, for each benchmark, the percentage of configurations (out of 160
total configurations) that are not classified correctly by our neural network classi-
fier. The average classification error is only 3%. However, the wrong classifications























Figure 4.4: Accuracy of the neural network classifier.
can be further divided into false positives and false negatives. In the case of false
positive, the classifier predicts a thermally safe configuration C as unsafe; thus
excluding the possibility of C being selected by the configuration search routine. If
C turns out to be the optimal configuration, then we incur some performance loss.
For false negatives, the classifier fails to detect a thermally unsafe configuration C ′
leaving open the possibility of C ′ being selected by the search routine as the opti-
mal configuration. As mentioned earlier, our framework has a fail-safe option that
engages simple hardware-level DTM mechanism such as clock gating if a thermally
unsafe configuration gets selected and the temperature hits the threshold. We can
observe from Figure 4.4 that false positives constitute the majority of the classifi-
cation errors. Across all the benchmarks, on an average, the classifier returns false
negative for only 1.14% of the configurations.
4.4 Performance Prediction Model
Recall that we collect performance counter values that characterize the workload
corresponding to the current adaptation interval. Assuming that the workload will
remain similar in the next interval (or change slowly), we explore the configuration
search space to identify the optimal configuration that is thermally safe for the next
52
interval. We have already discussed the design of a neural network classifier that
predicts if a configuration is thermally safe. Now we present a model that can
predict the performance of any configuration.
First, let us look at the input and output parameters of the performance model.
Let C = 〈T, IW,W,F 〉 denote the configuration in the current adaptation interval,
where T is the fetch gating level, IW is the issue width, W is the instruction
window size, and F is the frequency level. That is, the application is currently
running on configuration C. We collect the following statistics from the processor
performance counters, which are input to our performance prediction model.
1. Number of committed instructions Nuseful(C)
2. Number of cycles in the interval Cycles
3. Number of committed instructions of type X: NXuseful(C) where X can be of
type integer, floating point, branch, or load/store.
4. Instruction cache misses ICmiss, Data cache misses DCmiss, and Branch mis-
predictions Brmiss
The model is used in two ways. First, while searching the configuration space, we
need performance estimation for each configuration C ′ = 〈T ′, IW ′,W ′, F ′〉 that is
visited. The performance is expressed as number of useful instructions committed
per second (to include the effect of frequency scaling). Second, given a configura-
tion C ′, the neural network classifier needs the number of instructions issued per
cycle for each instruction class to predict if C ′ is thermally safe. Therefore, the
performance prediction model has to estimate number of issued integer, floating
point, branch, and load/store instructions per cycle. Note that number of issued
instructions is typically more than the number of committed instructions due to
branch misprediction.
53
So how do we estimate performance for configuration C ′ given the workload statis-
tics for configuration C ′? We extend the interval analysis [49] by Karkhanis and
Smith for this purpose. Interval analysis is based on the notion that any superscalar
processor has a sustained background level of performance that is interrupted by
miss events such as cache misses and branch miss-predicts. Based on this assump-
tion, the CPI (cycles per instruction) of a processor can be expressed as
CPI = CPIsteady+CPIbrmiss+CPIicmiss+CPIdcmiss = CPIsteady+CPImiss (4.1)
where CPIsteady is the background sustainable performance when there are no miss
events and CPIbrmiss, CPIicmiss, CPIdcmiss represent the performance loss from
branch miss-predictions, instruction cache misses and data cache misses, respec-
tively. For the current configuration C, we get
CPIsteady(C) = CPI(C)− CPImiss(C) where CPI(C) = Cycles
Nuseful(C) (4.2)
CPImiss(C) can be computed from the individual miss events and their miss penal-
ties as follows
CPImiss(C) = ICmiss × ICpenalty +DCmiss ×DCpenalty +Brmiss ×Brpenalty
Nuseful(C)
(4.3)
We use the results from the first order superscalar model [49, 50] to estimate
the penalties (ICpenalty, DCpenalty, and Brpenalty) associated with each miss event.
To simplify the model, we assume CPImiss remains unchanged across different
configurations in our configuration space. This is because changing instruction
window size, issue width, and fetch gating has minimal impact on the miss ratios
and their penalties. Thus, we set
CPImiss = CPImiss(C ′) = CPImiss(C) (4.4)
54
Now the CPI of configuration C ′ for which we are estimating the performance can
be expressed as
CPI(C ′) = CPIsteady(C ′) + CPImiss(C ′) = CPIsteady(C ′) + CPImiss (4.5)
Thus for C ′, we only need to compute the steady background performance CPI(C ′)
or IPC(C ′) = 1
CPI(C′) . This matches well with our intuition that the configuration
parameters we are adapting only impact the sustained background performance.
Now we proceed to explain how the steady IPC is computed at any configuration.
Estimating Steady IPC. Karkhanis and Smith [49] characterize the issue win-
dow characteristics of different applications under perfect caches, perfect branch
prediction, and unbounded issue width. They observe that the number of ready
instructions at window size W is
√
W , i.e., it is independent of the workload char-
acteristics. The issue window characteristic under limited issue width follows the
unbounded characteristics and saturates at the maximum issue width. Based on
this observation the steady state IPC (in the absence of miss events) at window size
W and issue width IW can be expressed as IPCsteady(IW,W ) = min(IW,
√
W ).
In our setup, we also need to account for the effect of the fetch gating level. At
fetch gating level T , the fetch unit is gated once after every T cycles. Thus, the
fetch unit can deliver at most T
T+1
× FW instructions per cycle, where FW is
the fetch width. We observe that, in the steady state, the number of instructions
fetched per cycle should be equal to the number of instructions issued per cycle.
Therefore, the steady IPC at configuration C ′ = 〈T ′, IW ′,W ′, F ′〉 can be expressed
as
IPCidealsteady(C ′) = min(
T ′
T ′ + 1
× FW ′, IW ′,
√
W ′) (4.6)
We call this ideal IPC as the characterization does not account for non-unit latency
55
instructions, limited number of functional units of different types, and commit
of multi-cycle operations [49]. To factor in these effects on the steady IPC, we




of the current configuration C based










As we do not adapt the latency of the functional units etc., the factor η remains
constant across different configurations. We do switch off excess unit-latency func-
tional units as we scale down the issue width. However, the number of unit-latency
functional units is always equal to the issue width and hence is never a bottleneck.
So
IPCsteady(C ′) = η × IPCidealsteady(C ′) = η ×min(
T ′
T ′ + 1
× FW ′, IW ′,
√
W ′) (4.8)
CPI(C ′) = 1




Finally, the performance for C ′ is estimated in terms of the number of instructions
committed per second
Performance(C ′) = IPC(C ′)× F ′ (4.10)
Estimating Issued Instructions. The performance of a configuration depends
on the number of useful instructions committed per second. However, the temper-
ature of the processor depends on all the issued instructions, which includes the
instructions from the wrong path (in case of branch misprediction) that are not
committed eventually. Recall that our neural network classifier needs the number
56
of instructions issued (corresponding to different classes) per cycle for a configu-
ration C ′ to predict if C ′ is thermally safe. Let CPIissue(C ′) be the CPI for all the
instructions (useful or otherwise). Clearly,
CPIissue(C ′) = CPI(C ′)−∆ (4.11)
The difference comes from the additional instructions issued along the wrong path
due to branch misprediction. The value of ∆ can be computed based on the dif-
ference between the misprediction penalties when only committed instructions are
counted and issued (committed + wrong path) instructions are counted [50]. The
value of the miss-prediction penalty is higher when only committed instructions are
counted, as cycles during which wrong path instructions are issued also contribute
to the penalty [50]. Based on this, the value of ∆ is given by
∆ =
Brmiss × (Brusefulpenalty −Brtotalpenalty)
Nuseful
(4.12)
The branch miss-prediction penalties for the committed instructions (Brusefulpenalty)
and the issued instructions (Brtotalpenalty) can be computed using the equations
in [50].






where IPCXissue(C ′) is the number of instructions of type X issued per cycle in
configuration C ′ and X can be of type integer, floating point, branch, or load/store.
























Figure 4.5: Accuracy of the Performance Prediction Model
Accuracy of the Performance Prediction Model. To evaluate the accuracy
of the performance prediction model, we first collect the statistics from the per-
formance counters of the baseline configuration. Using these performance counter
values as input, it is easy to estimate the performance of all the configurations em-
ploying our model. However, it is impossible to simulate all possible configurations,
which is required to evaluate the accuracy of our model. Instead, we simulate a
subset of points in the configuration space consisting of different combinations of
four issue widths 〈2, 4, 5, 6〉, four window sizes 〈16, 32, 48, 64〉 and four fetch gating
levels 〈2, 4, 6, 8〉. This gives rise to 64 different configurations. The performance
predicted by the model is compared against the actual performance obtained from
simulations. Figure 4.5 plots, for each benchmark, the average prediction error
across all the 64 configurations. The error is less than 5% for any benchmark and
the average error for all the benchmarks is only 3.8%.
4.5 Configuration Search Strategy
In this section, we present our strategy to perform an intelligent search of the
configuration space. The goal of this algorithm is to find, for the next adaptation
interval, the best performing configuration that is within the thermal limit. So
58
far we have seen how to estimate the performance and thermal behavior of a
configuration C ′ given the workload statistics for the configuration C in the current
interval. The configuration search algorithm invokes these estimation routines
to evaluate any point in the configuration space. A simple search strategy is to
evaluate every point in the configuration space and find the optimal configuration.
However, such a search strategy is not feasible given the number of points in the
configuration space (1,280 points) and the online nature of our DTM framework,
which invokes the search once every adaptation interval.
Reducing Search Space. Our search strategy exploits insights and observations
we derived while building the performance prediction model in Section 4.4 to reduce
the search space. Let us take a closer look at Equation 4.8 that models the steady
IPC for a configuration. It is clear from Equation 4.8 that the steady IPC is
constrained either by the fetch gating level (T ), the window size (W ) or the issue
width (IW ). Therefore, the performance of a processor cannot be improved by
over-designing along one of the configuration parameters while restricting the other
parameters — a balanced architecture provides the best performance. In other
words, given a target steady IPC, we can compute appropriate values of T,W , and















Our search strategy exploits this observation to reduce the four dimensional con-
figuration space (T , W , IW and frequency F ) into a two dimensional search space
consisting of only frequency and steady IPC (see Figure 4.6).
Once again, it is possible to employ a naive search strategy in this two-dimensional

















Figure 4.6: Reduction of the configuration search space.
Steady IPC












Pruned by binary search





Figure 4.7: Pruning of the configuration search space.
quency level. For each 〈F, IPCsteady〉 pair, we first compute the configuration pa-
rameters T , W , and IW that can support the steady IPC from Equation 5.4. Next,
we can compute the performance corresponding to the configuration 〈T, IW,W,F 〉
using the model presented in Section 4.4. The performance model also returns
the number of instructions of each class issued per cycle corresponding to this
configurations. This issued instructions mix along with configuration parameters
are fed into the neural network classifier that returns whether the configuration is
thermally safe.
Pruning Search Space. It is possible to prune this design space further based
on the observation that any increase in IPC or frequency level results in higher
performance and temperature. Figure 4.6 shows the two-dimensional configuration
60
space (IPC values and frequency levels in this figure are chosen for illustration
purposes only). The goal of the search algorithm is to find the configuration with
the highest performance that is thermally safe. We employ the following pruning
strategies.
1. If a configuration 〈F, IPC〉 is thermally unsafe, then all the configurations
〈F,X〉 where X > IPC are also thermally unsafe. Therefore, we can prune
all those points along the IPC axis.
2. If a configuration 〈F, IPC〉 is thermally safe, then all the configurations
〈F,X〉 where X < IPC have lower performance than the known thermally
safe point 〈F, IPC〉. Therefore, we can prune all those points along the IPC
axis.
3. If 〈F, IPC〉 is the highest performing, thermally safe configuration at fre-
quency level F , then we can prune all the configurations 〈F ′, IPC ′〉 where
F ′ < F and IPC ′ ≤ IPC. This is because all these pruned configurations
with lower frequency and IPC are guaranteed to have lower performance than
〈F, IPC〉.
To take advantage of the third pruning strategy, our search proceeds from the
highest frequency level to the lowest frequency level. This way, once we have
found the highest performing point at frequency level F , we can prune all the
points with lower IPC for frequency level F ′ < F . The first two pruning strategies
are exploited by employing a binary search along the IPC axis at a particular
frequency level. Let [IPC1, . . . , IPCN ] be the range of IPC values we need to
look at frequency level F . Then, we first evaluate the configuration C with IPC
value IPC1+IPCN
2
. If this point is thermally unsafe, then we can eliminate all the





− 1]. If, on the other hand, C is thermally safe, then we need to
look at the points [ IPC1+IPCN
2
+ 1, IPCN ].
Our search strategy for a hypothetical search space with four frequency levels and
eight IPC values is shown in Figure 4.7. The search starts at the highest frequency
F4. It evaluates 〈F4, 5〉, which is the midpoint of the IPC space and determines
that it is thermally safe. Therefore, points with lower values along the IPC axis
〈F4, 1〉 . . . 〈F4, 4〉 are pruned. The search proceeds to the remaining points along
the IPC axis and evaluates the midpoint 〈F4, 7〉 as thermally unsafe. Now the
higher IPC part (〈F4, 8〉) of the search space is pruned. The search continues in
this fashion evaluating the point 〈F4, 6〉 and finally returns 〈F4, 5〉 as the feasible
point with highest IPC at frequency F4. The search then moves to the next
frequency level F3 where F3 < F4 with IPC range [6 . . . 8]. The configuration
space 〈F3, 1〉 . . . 〈F3, 5〉 is pruned as the points within this space are guaranteed
to have lower performance than the point 〈F4, 5〉 returned for higher frequency F4.
In this fashion, the search proceeds for each frequency level and selects the best
performing thermally safe point at each frequency level. Finally, the configuration
with the highest predicted performance among all the frequency levels is selected
as the optimal configuration.
Complexity of the Search Algorithm. Our configuration search algorithm
has a worst case complexity of O(Lf × ln(LIPC)) where Lf is the number of fre-
quency levels and LIPC is the number of steady IPC levels. In our implementation
we consider 8 levels of frequency scaling. The steady IPC is varied between 2 and
6 in increments of 0.5. This gives us nine possible points in the steady IPC space,
which are combined with eight possible points in the frequency space. In other
words, our search algorithm would examine 32 configuration points in the worst
case. We implement our configuration search algorithm with a number of key op-
62
timizations such as using constants and pre-computations wherever possible, fast
exponentiation [86], unrolling the search loops, use of relational operators instead
of branches and others. We observe that our optimized configuration search rou-
tine takes around 8000 cycles on our simulated architecture in the worst case (32
search points). This overhead is small when compared to our configuration inter-
val (order of milliseconds). For a configuration interval of 1 ms this represents an
overhead of about 0.3% in the worst case.
4.6 Experimental Methodology and Results
We now present our experimental methodology (including the processor model
and the power model) and an evaluation of our software-based DTM management
scheme against state-of-the-art DTM management techniques.
4.6.1 Processor Model and Workloads
We use the Simple Scalar-3.0 simulator with Wattch power models [22, 23] for
our experimental evaluation. The micro-architectural parameters of our baseline
processor is given in Table 3.2. We model an out-of order superscalar processor
with an issue width of 6 instructions per cycle, 128 entry active list (reorder buffer),
and 64 entry issue window. As mentioned earlier, our adaptive architecture has
four possible window sizes (16,32,48,64), five possible issue widths (2–6), and eight
fetch gating levels. The performance, power and thermal models use the same
settings as discussed in Chapter 3 (Section 3.2.2).
For the DTM implementation, the maximum allowed temperature is 85oC and after
adjusting for sensor placement and reading errors, we get a threshold of 82oC [89].
63
We use 14 benchmarks from the SPEC 2000 benchmark suite. For each of these
benchmarks, we fast forward to the simulation point specified by [76] and simu-
late a total of 500 million instructions. Our simulation consists of an architectural
warmup phase and a thermal warmup phase after which the statistics are col-
lected [89].
4.6.2 Dynamic Thermal Managements Schemes
We compare our software-based thermal management scheme exploiting archi-
tectural adaptation (called adaptive DTM) against two state-of-the-art hardware
based schemes namely DVS and hybrid DTM (DVS + fetch gating). We use a
PI-control based DVS scheme and use Matlab to design a PI controller with a
set point of 81.8oC using the method described in [35]. The PI controller also
includes a low-pass filter to prevent frequent voltage transitions [89]. The hybrid
DTM scheme [87] combines fetch gating and DVS for thermal management.
As discussed earlier, the configuration search algorithm (implemented in software)
of adaptive DTM has worst-case overhead of 8,000 cycles. In our experiments, we
assume this worst-case overhead for each invocation of the search routine. Another
important parameter for adaptive DTM is the configuration interval or the interval
at which the search routine is invoked. We set the configuration interval to 107
cycles (2.8ms at 3.6GHz).
4.6.3 Performance Comparison
Figure 4.8 plots the slowdown of the three DTM schemes compared to the base-
line architecture operating at the maximum frequency (i.e., without any thermal



















Figure 4.8: Performance comparison of different DTM schemes.
the processor exceeds the threshold. It is clear that adaptive DTM has significant
performance benefit (lower slowdown) compared to the existing DTM schemes.
On an average, adaptive DTM has 11.68% slowdown while DVS and hybrid DTM
have 24.4% and 19.37% slowdown, respectively. This represents 52% reduction in
slowdown compared to DVS and 39% reduction in slowdown compared to hybrid
DTM. Next, we try to explain where this performance benefit of adaptive DTM is
coming from.



















Adaptive DTM Hybrid DTM
DVS Baseline
Figure 4.9: Temperature profile for crafty
Figures 4.9- 4.10 plots the time varying temperature profiles and throughput for

















Adaptive  DTM Hybrid DTM
DVS Baseline


















Adaptive DTM Hybrid DTM
DVS Baseline



















Figure 4.12: Performance profile for gcc
remains above the thermal threshold for the entire duration of execution. The





































































Figure 4.16: IPC profile for crafty
temperature below the threshold. The corresponding performance plots show that
keeping the temperature below the threshold results in loss of performance (billion
instructions per second or BIPS). The performance of all three DTM schemes
adaptive DTM, hybrid DTM and DVS are lower than the baseline. However, the
performance of adaptive DTM is higher than DVS and hybrid DTM for most points
in the plot.
We further analyzed the loss in performance in terms of frequency and IPC com-
ponents. The frequency plots are shown in Figures 4.13, 4.14 and the IPC plots are
shown in Figures 4.15, 4.16. As seen from Figure 4.13, 4.14, adaptive DTM in gen-
eral resulted in a higher operating frequency than DVS and hybrid DTM for the
same thermal constraint and this results in an improved performance. This in be-
cause unlike the other DTM schemes, adaptive DTM scales the micro-architecture
structures in conjunction with frequency scaling. However, this impacts IPC as
seen from Figure 4.15, 4.16. Our configuration search strategy optimizes along
both the frequency and IPC axes and hence achieves better performance than
existing techniques.
We also found that DVS and hybrid DTM result in more frequent voltage and
frequency transitions than adaptive DTM. We observe this trend even though these
68
Benchmark C1 C2 C3 C4
art < 4, 32, 3, 3.6 > < 3, 48, 4, 3.45 > < 5, 48, 5, 3.45 >
ammp < 7, 48, 5, 3.30 > < 3, 48, 4, 3.6 > < −1, 64, 6, 3.14 >
bzip < 5, 48, 5, 3.45 > < 5, 48, 4, 3.45 > < −1, 64, 6, 3.3 > < 3, 32, 3, 3.6 >
crafty < 6, 32, 5, 3.3 > < −1, 48, 6, 3.1 > < 4, 32, 4, 3.45 >
eon < 7, 48, 5, 3.45 > < 4, 32, 4, 3.6 >
facerec < 4, 32, 4, 3.6 > < 6, 48, 5, 3.45 >
fma3d < 6, 48, 5, 3.6 >
gcc < 5, 48, 5, 3.45 > < 3, 32, 4, 3.6 > < 4, 32, 4, 3.30 >
gzip < 5, 48, 5, 3.45 > < 4, 32, 4, 3.6 > < 5, 48, 5, 3.30 >
mesa < 5, 32, 5, 3.30 > < 3, 16, 3, 3.6 > < −1, 48, 6, 3.6 >
parser < 5, 32, 5, 3.4 > < 4, 32, 4, 3.6 > < 2, 32, 3, 3.6 >
sixtrack < 4, 48, 4, 3.45 > < 3, 16, 3, 3.6 > < 5, 32, 5, 3.30 >
vortex < 4, 48, 5, 3.45 > < −1, 64, 6, 3.6 >
wupwise < 4, 32, 4, 3.45 > < 2, 32, 3, 3.6 > < −1, 64, 6, 3.14 >
Table 4.1: Frequently selected configuration points by adaptive DTM.
schemes include a low-pass filter that performs voltage transition only if the re-
sulting expected performance is higher than the voltage transition overhead [87].
This also makes DVS and hybrid DTM more sensitive to the voltage transition
overhead than adaptive DTM. When the transition overhead is not accounted for,
the average slowdown of DVS drops from 24.02% to 14.62%, hybrid DTM drops
from 19.37% to 13.11% and the average slowdown of adaptive DTM drops from
11.68% to 9.23%. Most practical high performance systems incur non-negligible
overhead for voltage/frequency transitions [7, 40, 91].
4.6.5 Configuration Points for Adaptive DTM
Let us now examine the configuration points selected by our search algorithm. The
most frequently selected (or dominant) configuration points for each benchmark
are presented in Table 4.1. These configuration points cover more than 90% of all
the configuration intervals for each benchmark. Each configuration is represented
by the tuple < T,W, IW,F > where T specifies the fetch gating level, W denotes






















Figure 4.17: Impact of inaccuracy of the neural network classifier on performance.
GHz. It is clear that there is significant variation among the configurations points
selected for different benchmarks. Indeed, even for the same benchmark, different
phases may require different configuration points. The exception is fma3d, which
has a very static profile consisting of only a single configuration point.
4.6.6 Impact of Inaccuracy in Classifier
As discussed earlier, adaptive DTM relies on a neural network classifier to predict
if a configuration is thermally safe. In case the prediction goes wrong, it must
be supported by some appropriate hardware-based thermal management scheme
that can guarantee thermal safety. We use a simple thermal management scheme,
namely clock gating, where the processor is frozen for one sampling interval when
the temperature hits the threshold [21]. To evaluate the impact of prediction
inaccuracy, we compare the slowdown of adaptive DTM with and without clock
gating in Figure 4.17. Adaptive DTM without clock gating does not incur any
performance loss due to misprediction by the classifier; however, it is not thermally
safe. The average slowdown reduces from 11.68% to 10.3% when clock gating is not
engaged even under thermal distress. As the classifier has very high accuracy, the


















































Figure 4.18: Impact of Different Parameters on Performance
DTM scheme.
4.6.7 Impact of Individual Configuration Parameters
In this section we examine the impact of using each parameter separately and in
a combined fashion. Figure 4.18 shows the slowdown when each of our adaptive
parameters (issue width, window size and fetch gating) are used individually along
with DVFS for thermal management. In each of these cases our software config-
uration search is applied but only one of the parameters is adapted. When only
one technique is engaged along with DVFS, applying DVFS in conjunction with
issue window resizing gives best results. This is consistent to our observation in
Chapter 3 which shows that issue window scaling and DVFS as being the two most
effective single response thermal management techniques. However there is still a
significant difference in performance (upto 4.6% average) between employing all
these techniques simultaneously and between using only DVFS and issue window
size scaling for thermal management.
71
4.7 Summary
In this chapter we presented a dynamic software based framework for thermal
management. The framework is inspired by our observation in Chapter 3 that
employing multiple temperature reduction mechanisms simultaneously results in
significantly better temperature performance tradeoffs than engaging a single re-
sponse. Our framework which uses dynamic voltage and frequency scaling (DVFS),
issue width scaling , window size resizing and fetch gating simultaneously for ther-
mal management results in an average of 8% reduction in slowdown than the best
performing state of the art DTM technique. In the next chapter we design a




Power and temperature constraints are seriously limiting frequency increase in re-
cent processor generations. Up until recently, the frequency of operation of the
chip has been increasing steadily in every processor generation (upto 90nm) [32].
Recently Moore’s law (since 65 nm) has taken a different path where the increased
number of on-chip transistors in each generation has resulted in an increased num-
ber of on-chip cores. Each core can operate at a potentially lower frequency than
the cores of the previous generation. For instance a single core Pentium-4 processor
at 90 nm operates at a frequency of 3 Ghz while a dual core Core 2 Duo proces-
sor at 65 nm operates at 2.3 Ghz. Improvements in performance from generation
to generation hinges critically on being able to exploit the parallelism offered by
additional cores. Hence micro-architectural techniques for improving multi-core
performance and software techniques to exploit multi-core systems have gained
prominence.
Multi-core systems comprise of a number of physical CPU’s executing on a single
73
processor package. The performance of these systems depends on the ability to
keep as many cores as possible busy. Keeping multiple cores in a single processor
package busy puts tremendous pressure on the heat removal system and so tem-
perature problems can seriously limit the performance [77, 80]. Hence, employing
an efficient thermal management solution is critical to improve the performance of
a multi-core system. There are a few new challenges and additional opportunities
in the thermal management of multi-core architectures which are not present in
conventional single core architectures. They are
• In a multi-core system, depending on the workload executing each individual
core can have different temperatures and the goal of thermal management
is to keep the temperature of the hottest core below the threshold. Though
this might seem obvious, it has large implications on the performance and
complexity of different thermal management schemes.
• Temperature of a specific core depends not only on the workload executing
on that core but also on the workload executing on other cores.
The dependence of the temperature of a core on the workload in other cores is
primarily because two important phenomenon. The first phenomenon is global
heating [101] where as a result of an increased number of active cores within the
same package, the total amount of heat to be removed from the package increases
and thereby the temperature increases. The second phenomenon called lateral
coupling refers to the lateral transfer of heat between adjacent cores. Clearly any
generic technique for thermal management in multi-cores must be designed to take
these effects into account.
DTM techniques for multi-core architectures can be broadly classified into dis-
tributed techniques (which operate at individual core level) and global techniques
74
(which operate globally across all the cores on chip). For instance, dynamic volt-
age/frequency scaling can be employed in a distributed fashion (distributed DVFS)
where each core can scale its voltage/frequency independently. Alteratively, a
global approach constrains all the cores to scale their voltage/frequency uniform-
ing and simultaneously (global DVFS).
Multi-core systems also provide thread migration opportunity to manage temper-
ature. Depending on the workload executing on a multi-core system, there can
be substantial variation in temperature among the different physical cores on the
same die. Migration-based DTM techniques exploit this temperature differential
by periodically moving the threads away from the hot cores to the cold cores and




















Core 0 Core 1 Core 2 Core 3











































Core 0 Core 1 Core 2 Core 3






















Core 0 Core 1 Core 2 Core 3
Figure 5.1: Temperature profiles for a workload on multi-core (core 0: wupwise,
core 1: gcc, core 2: art, core 3: crafty). Thread to core mapping is not applicable
for migration.
Let us consider a heterogenous workload comprising of four SPEC benchmarks
wupwise, art, gcc and crafty executing on a 4-core system. Figure 5.1 shows
the temperature profiles of the four cores without DTM and with the previously
proposed DTM techniques. We assume a threshold temperature of 82.5oC. The
temperatures of all the four cores remain above the threshold for the entire du-
75
ration of execution when no thermal management is deployed (see Figure 5.1(a)).
However, we observe a large variation in temperature between the core executing
the hottest thread (gcc) and the core executing the coldest thread (art).
Global DVFS scales down the operating frequency of all the cores from 3.6 GHz to
2.92 GHz in an effort to lower the temperature of the hottest core (core 1) below
the threshold. Clearly, this scheme is unfair for the cold threads (e.g., art) as their
performance penalties are at par with those of the hot threads. Distributed DVFS
addresses this fairness issue by allowing each core to choose its operating frequency
independently based on the workload. Thus, distributed DVFS selects a higher
operating frequency for the cold thread (3.4 GHz) and a lower operating frequency
(2.86 GHz) for the hot thread resulting in substantially better throughput for the
overall system.
Clearly, the additional flexibility of being able to manage the temperature of each
core independently results in significant performance advantage. However, this
advantages for distributed DVFS (which is the only distributed multi-core DTM
in the literature) comes at the cost of escalating design complexity. First, allowing
each core to have its own supply voltage creates multiple voltage islands on-chip
and suitable mechanisms need to be built in for communication among the voltage
islands. Secondly, voltage regulators have to be provided per core (so that each
core can scale its supply voltage independently) increasing the complexity of the
power delivery network. Finally, as the communication among the cores need to
be verified for each possible voltage state of each core, the verification complexity
for the chip increases exponentially.
Thread migration can mitigate, to some extent, the performance impact of global
DVFS without the additional hardware complexity of multiple voltage islands [36].
As threads are periodically migrated among the cores, any single core is less likely
76
to get heated up significantly. Global DVFS coupled with thread migration help
to smoothen out the temperature difference across the cores and hence boost the
operating frequency of the chip. The thermal profile of global DVFS + migration
is shown in Figure 5.1(d). Employing thread migration along with global DVFS
has enabled the entire chip to operate at 3.14GHz as opposed to 2.92GHz for
global DVFS alone. However, migration has its own limitations. First, migration
does not reduce the total power dissipated in the system; it simply moves the hot
spots around instead of eliminating the hot spots. Secondly, the huge penalty
associated with switching a task from one core to another constraints the time
scale at which migration can be performed. Finally, migration is not very scalable






















Core 0 Core 1 Core 2 Core 3
Figure 5.2: Temperature profiles with adaptive DTM for wupwise, gcc, art and
crafty
In this chapter, we propose a novel two-level hybrid DTM scheme for multi-core
systems. We observe that per-core control is quite powerful in achieving high per-
formance as long as it does not add substantially to the design complexity. Clearly,
DVFS is only suitable at a global level for the entire chip. We need a different set
of knobs to control the performance and power per core independently. We ob-
serve that non-DVFS techniques such as fetch gating and architecture adaptations
(runtime reconfiguration of instruction window size and issue width) are easy to
77
implement at individual core level as they are largely localized and do not cre-
ate complexity in terms of communication among the cores and multiple voltage
islands. At the same time, such mechanisms provide effective response (though
gentler than DVFS) to thermal stress.
Hence, our hybrid scheme combines local non-DVFS techniques (fetch gating and
architecture adaptation) at per core level with global DVFS. Each core can chose
an appropriate configuration (fetch gating level and architectural parameters) to
maintain its own temperature. Thus the global operating frequency need not be
lowered to keep the hottest thread below the threshold. This is illustrated in
Figure 5.2, where our hybrid DTM scheme employs non-DVFS techniques at indi-
vidual core level to balance the core temperatures and the global supply frequency
is no longer limited by the temperature of the hottest thread.
The major challenge for our hybrid DTM technique is that it should be accompa-
nied by an efficient runtime mechanism that can select the appropriate per-core
settings and the global setting so as to control the temperature below the threshold
while achieving near-optimal performance. We rephrase the thermal management
problem as a configuration search problem and design an efficient software based
mechanism to adapt the local and global thermal management parameters at run-
time depending on the workload. Next, we review related work in power and
thermal management of multi-core architectures.
5.1 Related Work
In this section we review some previously proposed approaches for thermal man-
agement and power management in multi-core systems.
78
5.1.1 Multi-core Thermal Management
Most of the previously discussed approaches for thermal management in Chapter 4
are also applicable in the context of multi-core systems. However one of the major
design choices for multi-core system thermal management is to decide on whether to
employ techniques in a distributed or global fashion. We examine the performance
impact of each of these choices in the next section. Moreover with multi-core
systems another unique option to reduce the temperature is migration. When a
set of heterogenous threads execute on a multi-core processor, large variations in
temperature can be observed among the different cores. Migration can be used to
balance the temperature of the cores. Of course migrating a thread away from a
core involves a significant migration penalty and different migration based thermal
management techniques differ in the choice of when and how to migrate threads
Gomma et al. [36] migrate a thread from its active execution core whenever the
temperature of a specific processor unit hits the threshold. They select the coldest
among all active cores as the target for migration. Lee et al. [63] use the number
of times a specific core hit the threshold in the previous control interval as an
indicator to decide on thread migration. Donald et al. [35] employ a migration
algorithm which compares every possible core pair to swap threads and decides
on the best swaps using both the temperature of the cores as well as the rate of
change of temperature in the last window. Our experiments have shown that this
is the best performing migration algorithm among the techniques we investigated.
However the complexity of the scheme is O(N2) where N is the number of cores
and this can impend the scalability of this scheme. Another important point to
note is that migration by itself cannot avert all thermal emergencies. It is generally
employed in conjunction with other thermal management schemes such as DVFS.
In contrast to existing migration based thermal management schemes, we propose
79
a non-migration approach for thermal management. We design a software based
thermal management framework that uses a combination of architecture adapta-
tion at the per core level and global DVFS for thermal management. Our ther-
mal management framework can provide better temperature/performance tradeoffs
than the best performing existing thermal management techniques.
5.1.2 Power Management in Multi-Core Systems
To the best of our knowledge ours is the first work employing a software based
framework and architecture adaptation for thermal management in multi-cores.
Previously a number of software and hardware based approaches have been pro-
posed for power management in multi-core systems. Isci et al. [48] have examined
a number of local and global policies for power management in multi-core systems
and have concluded that for efficient power management, policies that have a global
view of the system are necessary. Meng et al. [61] have proposed a software based
framework that adapts the cache hierarchy in a multi-core system to save power.
Instead of adapting the cache hierarchy for power management, we employ core-
level adaptations in conjunction with voltage scaling for thermal management. Li
et al. [56] examine the tradeoff between the number of active cores and frequency
of operation for parallel workloads. Given a parallel workload, the power budget
can be satisfied by either operating a small number of cores at high frequency or
a larger number of cores at a lower frequency and the optimal performance point
depends on the workload. They design a software based framework that employs
binary search and an analytical model to choose the optimal point for a given
workload.
Unlike the above mentioned approaches that examine power/performance tradeoffs
along a single dimension, we employ a combination of multiple techniques simul-
80
taneously for thermal management. In Chapter 3 we examined the advantage of
using multiple adaptations for thermal management over employing a single tech-
nique. Next we present our hybrid thermal management scheme for multi-cores.

























Core 1 Core 2
Core 3 Core 4Global Frequency
This page was created using Nitro PDF trial software.
To purchase, go to http://www.nitropdf.com/
Figure 5.3: Hybrid thermal management architecture. The dotted structures are
adaptive
In this section we present a two level hybrid DVS scheme for thermal management
in multi-core systems. We show that a combination of non DVFS thermal man-
agement techniques such as architecture adaptation, fetch throttling and others
at a per core level in conjunction with global DVFS is very effective for thermal
management. Implementing techniques such as architecture adaptation and fetch
throttling at a per core level is less complex than per core DVFS since it does not
involve complexities such as multiple voltage islands and others. We observe that
this simpler design can provide performance similar (in some cases better) than
the more complex per core DVFS.
In the rest of our discussion we assume a 4 core system although the solutions we
propose are generic and applicable to multi-core system with any number of cores.
81
Each core is an out of order system similar to the Alpha 21264 [94] processor. In
our present evaluation we assume that each core has 512 KB of private L2 [1]. The
methods proposed in this chapter are generic and are applicable in systems with
shared L2 such as Intel Core 2 Duo [6].
5.2.1 Hybrid Thermal Management Architecture
Our architecture along with the knobs for thermal management is shown in Fig-
ure 5.3. On each core three micro-architectural parameters namely the issue width,
window size and the fetch gating level can be controlled at runtime. We chose
these parameters because they can directly impact the temperature of the hotspots
within the processor and are easily configurable at runtime [91]. The issue width
can be scaled down by disabling the appropriate selection tree [12]. Fetch unit can
be controlled by setting the appropriate fetch gating level. When the fetch gating
level is set to T, the fetch unit is deactivated once after every T cycles. When
the gating level is set to 0, the fetch unit is active for all cycles (no fetch gating).
The instruction window has four equal partitions and each partition can be en-
abled/disabled separately [24, 78]. Changing the window size involves a pipeline
flush and resting the window to the appropriate new size. Since our adaptivity
is coarse grained (large sampling interval order of milli-seconds), this simple reset
mechanism involving a pipeline flush has negligible impact on performance.
An important parameter in runtime adaptivity is the interval in which the adaptiv-
ity is triggered. Since temperature changes slowly (in the order of milli-seconds),
we can afford take adaptivity decisions in software. We assume that the instruction
window size, issue width and fetch gating level can be changed by using special
instructions [45, 91]. In addition to this we assume global dynamic voltage and
frequency scaling i.e. supply voltage and frequency of all cores are scaled together.
82
Given these parameters to control the temperature, the goal of thermal manage-
ment is to maximize the performance while maintaining the temperature below
the threshold. We formulate the thermal management problem as a configuration
search problem. In the next section we present the thermal management problem
and an overview of our framework for thermal management.
5.3 Problem Formulation and Overview
In this section we formulate the thermal management problem and present our
thermal management framework
5.3.1 Problem Formulation
The goal of our thermal management framework is to determine the optimal pa-
rameters for each core (issue width, window size and fetch gating level) as well
as the global supply voltage/frequency that maximizes performance (in billion in-
structions per second or BIPS) while maintaining the temperature of all the cores
below the threshold. To state formally, given N cores, we would like to choose a
configurations for each core Ci (1 ≤ i ≤ N, 1 ≤ Ci ≤ M) where M is the total
number of configurations per core and a global operating frequency F such that
• The system performance Perf = F × ΣNi=1IPCi is maximized where IPCi
is the instructions per cycle of core i for configuration Ci
• For all cores 1 ≤ i ≤ N , Tempi < Th where Tempi is the temperature of
core i and Th is the threshold temperature.
83
The configuration Ci for each core consists of three parameters: issue width (IWi),
window size (Wi), and fetch gating level (Ti).
There are two main challenges to solving the configuration search problem. Given
a specific configuration for each core and global operating frequency, we need to
determine if this configuration is thermally safe and evaluate the performance for
this configuration. In our framework, we use a neural classifier to determine if a
configuration is thermally safe and a performance prediction model to evaluate the
performance of a given configuration.
Secondly, the large configuration space and the online nature of our scheme makes
exhaustive search of all configuration points infeasible. For instance, given a 4-core
architecture with five possible issue widths, four possible window sizes and eight
possible fetch gating levels at each core and eight global frequency levels, there is
a total of 4160 × 8 configuration points. We devise a three-phase search strategy
that efficiently solves this configuration search problem.











































Figure 5.4: Overview of our thermal management framework
84
Figure 5.4 presents the structure of our hybrid local-global thermal management
framework. During execution, each core periodically (0.1 ms) samples the on-
chip thermal sensors. In addition, it collects the performance counter values and
the instruction signatures once every 3 ms. The instruction signatures are used
to determine if there is a phase change [34] in the application executing on the
core. A phase change will trigger a configuration search process to better fit the
architecture according to the new workload. Note that local phase change in the
workload of processor Pi might result in a new configuration for processor Pj (i 6= j)
that did not encounter phase change itself.
The central component of our framework is an efficient multi-phase configuration
search strategy. As mentioned earlier, the configuration space is too large to be
searched at runtime. On the other hand, we cannot nicely partition the search
space at core boundaries as the frequency has to be selected globally and the
configuration selected for a core impacts the temperature of the cores in the neigh-
borhood due to lateral transfer to heat. Our search strategy breaks this cyclic
dependency by employing multiple phases: (1) a local phase suggests optimal con-
figuration at every global frequency level based on minimal global information, (2)
a global phase follows that selects the global operating frequency, and (3) another
local phase that concludes the search by selecting appropriate local configurations
armed with more global information.
The first search phase proceeds locally (i.e., independently) for all the cores that
detect phase changes in their workload. It suggests for each core the optimal safe
configuration Cfi (highest throughput without exceeding temperature threshold)
at every frequency level f and the expected instructions per cycle (IPC) for this
configuration. This phase uses an estimate of lateral coupling as minimal global
information is known at this point. Our local search algorithm is a modified
version of the binary search and examines a very small portion of the entire search
85
space (presented in Section 5.4). It employs a neural classifier to determine if a
configuration is thermally safe for a workload. As we predict the thermal safety
of a configuration, we must provision for a failsafe mechanism in case of rare
misprediction that may result in the system exceeding thermal threshold. Thus,
our technique is backed up by global clock gating where all the cores are made
inactive (by hardware) for one sampling interval of thermal sensors (0.1 msec)
when the temperature of any core hits the threshold.
The second phase is global in nature. It takes in the results from the local con-
figuration searches as input and determines the global operating frequency F that
maximizes throughput. In addition, it estimates the core coupling factor. The
temperature of a core depends on configuration and workload of all the cores in
the system. The core coupling factor reflects the operating conditions of the other
cores in the system. We define the core coupling factor in Section 5.4.2.
Now that more accurate lateral coupling information is available, we need one final
local phase for each core to check the thermal safety and optimality of the suggested
configuration at frequency level F . If the previously suggested configuration (with
limited global information) is either unsafe (more thermal impact from neighbors)
or sub-optimal (less thermal impact from neighbors allowing higher performance
configuration), the core can choose an appropriate safe and optimal configuration
Ci.
Once all the three phases are over, the global operating frequency is set to F and
the reconfigurable components of each core are scaled accordingly to the selected
parameters.
86
5.4 Local Configuration Search
The local configuration search takes the performance counters from the local core
and the core coupling factor as input and determines the optimal operating config-
uration as well as the corresponding IPC for each frequency level. We first present















Figure 5.5: Overview of local config search
An overview of the local configuration search is presented in Figure 5.5. It has
two main components: the configuration search algorithm and the classifier.
The highest performing configuration that is feasible at a given frequency depends
on the workload running on a specific core. Clearly, determining this configuration
requires us to evaluate the IPC and the thermal safety of the configuration. We use
an analytical performance model to quickly estimate the IPC for a {configuration,
workload} pair using the sampled performance counters as input. We use a neu-
ral classifier to predict if a given {configuration, workload, frequency} tuple is
thermally safe. As with Chapter 4, examining every point in the search space to
determine the most optimal configuration point is computationally expensive. We
87
convert a four dimensional search space into a two dimensional search space to
reduce the search complexity. Next we present our neural classifier.












































Input Layer Hidden Layer Output Layer
Figure 5.6: Neural network classifier
In a multi-core system, the temperature of a core depends both on the power
consumption of the specific core as well as the power consumption of other cores
in the system. Hence our classifier takes both the local as well as the global
parameters as inputs. An example classifier running on core 0 (C0) of a four core
system is shown in Figure 5.6. The top six inputs correspond to the parameters
provided by the local search. These include the number of instructions of different
classes issued per cycle, the configuration and the frequency of operation. We
reduce the configuration space consisting of three variables (issue width, widow
size and throttling level) into a single variable based on the observations from the
performance model. The instruction classes are chosen on the observation that
these parameters reflect strongly the utilization of the execution units and the
branch predictor which are the hottest units in a core.
The classifier also includes parameters to reflect the power consumption of the other
cores in the system to reflect lateral coupling and global heating effects [101]. In
88
our setup, the power consumption of the core depends both on the configuration of
the core as well as the workload on the core. We summarize the power consumption
of the remote cores by the total instructions issued per cycle on the remote cores.
Previous work has shown that there is a reasonably strong correlation between
core power consumption and instructions per cycle [87]. As discussed earlier, the
global part of the classifier (core coupling factor) is computed by the global config
search routine. We illustrate further the working of the classifier and how the
global config routine needs to transfer only a single value to each of the local
configuration search routines.
Working of Classifier
We use a back propagation network with a single hidden layer and one neuron in
the hidden layer as shown in Figure 5.6. We chose this configuration as it provides
a good tradeoff between classification accuracy and complexity. The operation of
our classifier is as follows. The input layer does the following computation
OutL1 = w1 × I1 + w2 × I2...+ w12 × I12 + b1
CF = w7 × I7 + w8 × I8...+ w12 × I12
= wc01config1 + wt01tipc1 + wc02config2 + wt02tipc2 + wc03config3 + wt03tipc3
OutL1 = w1 × I1 + w2 × I2...+ w6 × I6 + CF + b1
where OutL1 is the output of the first layer (input layer) of the neural classifier
w1 − w12 are the weights in the neural classifier
I1 − I6 are the inputs corresponding to the local workload and configuration pa-
rameters
I7−I12 are the inputs corresponding to the workload and configuration of the other
cores in the system
89
CF is the core coupling factor computed by the global configuration routine and
transferred to the local configuration routine
The hidden layer applies the sigmoid function and the output layer applies the
threshold function as shown in Figure 5.6. The final output of the classifier is a bi-
nary output which predicts the combination of configuration of the cores, workload
and operating frequency to be thermally safe/unsafe. Neural classifiers must be
trained with training examples which comprise of example inputs and outputs [8].
In our setup, the training is performed off-line during system installation or when
the operating conditions, thermal sinks and others change. Next we present our
training process.
Classifier Training
Neural classifiers must be trained with a training set comprising of typical input
and out pairs. For good classification accuracy, the training set must comprise of
a set of points resembling possible inputs during the operation of the classifier.
In our setup, the input space consists of different possible configurations of the
cores and workloads running on them. We create the training set by running
a set of micro-benchmarks on the system and determining if the corresponding
execution of the micro-benchmark hit the threshold temperature. Each micro-
benchmark comprises of a loop body with 100 randomly generated instructions.
The instructions are chosen from the input instruction classes and the random
selection of instructions is skewed to represent the instruction mix found in real
benchmarks. We generate thirty such micro-benchmarks. We then generate 3000
different training examples each involving randomly generated workload mix of
micro-benchmarks and randomly chosen core configurations. Each of these 3000
examples are executed on the machine and it is observed if the execution hits the
90
threshold. After generating the training set, we use the Levenberg-Marquardt [8]
training algorithm for training the classifier. The training algorithm is an iterative
process that adjusts the weights and bias values in the neural network to minimize
the number of miss-classifications in the training set. Next we present the accuracy
of our neural classifier.























Figure 5.7: Accuracy of neural network classifier
We evaluate the accuracy of the neural classifier by testing it on the workload mix of
SPEC 2000 benchmarks presented in Table 5.1. For each workload, we execute the
workload with 10 randomly chosen configurations and check if the corresponding
execution hits the threshold or not. Using the data gained from the corresponding
execution as input, we employ the neural classifier to predict if the execution is
thermally safe and compare the results of the simulation and the neural network.
The results are presented in Figure 5.7. There are two types of miss-predicts
possible namely, false positives : classifier predicts a thermally safe configuration
as unsafe and false negatives : classifier predicts a thermally unsafe configuration
as unsafe. It can be clearly seen that our neural classifier is accurate (less than
5% miss-prediction in all cases) and in general makes more false positive miss-
predictions than false-negative miss-predictions. In the next section we present the
configuration search process which encompasses a performance prediction model.
91
5.4.3 Configuration Search Algorithm
The configuration search algorithm examines points in the frequency, architecture
adaptation space to determine the best performing thermally feasible configura-
tion for each frequency level. It encompasses a performance model to predict the
performance at a given configuration. Our search strategy is based on insights
from the performance model. Next we present the performance model.
Performance Prediction Model
The performance prediction model is identical to the performance model presented
in Section 4.4. Our analytical model does not explicitly model multi-core effects
such as impact of shared caches, cache contention and shared memory bandwidth
on IPC (Instruction per cycle) of each core. This is because our performance model
is a first order approximation and provides the impact of workload (miss events and
data dependence) on IPC. Contention for shared resources such as caches would
first impact the number of miss events observed on each core. This in turn would
affect IPC of the core. Since our performance prediction model models the impact
of misses on IPC, this would eventually be reflected in our model. Our model is
used for online performance estimation and tuning rather than oﬄine design space
exploration. We chose to be less rigorous and detailed than analytical performance
prediction methodologies for multi-core systems [16].




IPCsteady(C) + CPImiss (5.2)
92
IPCsteady(C) = η ×min( T
′
T ′ + 1




IPCsteady(C) is the steady IPC at configuration C and η is an application spe-
cific constant given by equation 4.7. Next we discuss how the model is used for
performance estimation.
Performance Estimation
Inverting Equation 5.3, given a particular value of steady IPC (sipc), the configu-















In other words, a particular configuration point consisting of three values can
be summarized by a single value steady IPC (sipc). Our model evaluates takes
the performance counters at a specific configuration (steady IPC ) as input and
evaluates the IPC and total instructions per cycle at a different configuration. The
performance counters taken by the model as input are the following
1. Number of committed instructions Ninst(C)
2. Number of cycles in the in interval Cycles
3. Number of committed instructions of type X: NXinst(C) where X can be of
type integer, floating point, branch, or load/store.
4. Instruction cache misses ICmiss, Data cache misses DCmiss, and Branch mis-
predictions Brmiss
Given a specific configuration defined by steady IPC sipc and the corresponding
parameters defined by equation 5.4. The steady IPC at the particular configuration
93





The CPImiss can be computed from the sampled miss events (performance coun-
ters) using equation 4.3. We also make slight modifications to the model to com-
pute the total instructions (correct + wrong path) issued per cycle similar to
Chapter 4. Once this is known, we use the measured instruction ratios from the
performance counters to determine the instructions issued per cycle of different
classes (input to the classifier). Next we present our configuration search algo-
rithm
Configuration Search Process
The configuration search process proceeds along two axes the operating frequency
and the steady IPC. For each frequency, steady IPC pair it first uses the perfor-
mance prediction model to determine the IPC and the total instructions per cycle.
The total instructions per cycle, IPC and core coupling factor (from global routine)
are given as input to the neural classifier which determines if the frequency, steady
IPC pair is thermally safe. Using this process, the highest performing thermally
safe configuration for each operating frequency level can be determined. However
the search process can be optimized further based on the observation that any in-
crease in steady IPC (wider configuration) results in a increase in performance at
an increased temperature. Hence the problem of finding highest performing steady
IPC for a given frequency is same as finding the highest feasible steady IPC for
that frequency. Secondly if for a given frequency F, steady IPC S1 is infeasible
then all steady IPC values higher than S1 are also not feasible. Exploiting this
properties, our config search process proceeds as a binary search along the steady
94
IPC axis. It first examines the midpoint of the steady IPC space . If this point is
feasible, it proceeds in the upper half of the search space otherwise it proceeds in
the lower half of the search space. Given F frequency and S steady IPC levels the
total number of points examined is F × ln(S). In our experiments, we have eight
frequency levels between (3 Ghz and 2.14 Ghz) and 9 steady IPC levels between 2
and 6 resulting in a maximum of 32 search points.
5.4.4 Overhead of the Algorithm
As discussed earlier, our local configuration search routine evaluates 32 search
points in the worst case. Each search point involves performance evaluation using
the performance model, and invoking the neural classifier to determine thermal
safety. We implemented the search algorithm using a number of key optimization
such as fast exponentiation for the classifier, loop unrolling and relational operators
instead of branches. We also profiled the routine to determine that it takes about
8000 cycles in the worst case. Since our search routine is triggered by coarse
grained phase detection, it is invoked very infrequently. Even if the routine is
invoked once every phase detection sample (one million cycles) it represents an
overhead of 0.8%. In reality the overheads are less than 0.1%. Next we present
our global configuration routine.
5.5 Global Configuration Routine
The global configuration routine is responsible for determining the global operating
frequency and the core coupling factors for the cores. We first give a description of
the input to the core configuration routine and discuss further as to how it chooses
the operating frequency and computes the core coupling factor. At the end of the
95
section we present an hierarchial design which can greatly improve the scalability
of our approach.
5.5.1 Inputs
The input to the global configuration routine is a set of tables one from each of
the local search routines. The table for a every each core is indexed by different
operating frequency levels and for each frequency level, the entries contain the
highest performing configuration (steady IPC), the IPC at this configuration and
the total instructions (correct + wrong path) issued in this configuration. These
are the outputs of the local configuration search routine running at each core. Next
we show how the global configuration routine determines the frequency
5.5.2 Operating Frequency
The aim of the framework is to maximize the performance which is defined by
P = F ×∑Ni=1 IPCi where F is the operating frequency N is the number of cores
and IPCi is the IPC of the i
th core. The global config routine makes a linear
scan of the frequency level in all tables and determines the performance for each
frequency level (maximum possible IPC for a core at a particular frequency level
is recorded in the table). It then chooses the frequency level with the maximum
performance. Note that the global configuration routine can be easily modified to
maximize other metrics such as weighed speedup, harmonic speedup and others.
We discuss another variation which optimizes the weighed speedup in Section 7.4.
96
5.5.3 Core Coupling Factor
In a multi-core system, the temperature of a core depends both on the workload
running on that core and the workload running on other cores in the system. Our
neural network classifier takes this into account using the core coupling factor as
input (see Section 5.4.2). The coupling factor to each core is a weighed sum of
the configurations and the total instructions per cycle of other cores in the system.
For eg the coupling factor for core 0 is
CF0 = wc01config1+wt01tipc1+wc02config2+wt02tipc2+wc03config3+wt03tipc3
(5.6)
The weights in the coupling factor for the coupling factor are the classifier weights
determined during the off-line training (Section 5.4.2). Once the optimal frequency
is calculated, the global configuration routine determines the coupling factor for
all the cores using the configuration and the total instructions per cycle from the
tables.
5.5.4 Final Configurations
The computed optimal frequency Fopt and the core coupling factor is updated to
the individual cores. Each core then re-runs the local config search only for the
optimal frequency Fopt with the new coupling factor to determine the configuration
for execution.
As can be seen from the discussion, the coupling factor depends on the present
configuration of the cores. The configurations chosen at each core in turn uses
the coupling factor computed earlier (local config search uses coupling factor to
determine the optimal configurations). It is possible to run this loop multiple times
97
with an initial set of configurations to reach a stable point at which configurations
do not change. However in our implementation we perform only a single pass to
keep the implementation simple. We find that this does not affect our scheme
significantly since the temperature of each core is most sensitive to local workload
and configuration than the workload/configuration of other cores.
5.5.5 Overheads and Scalability
Our algorithm performs a maximum of O(N2+F ) multiplications and O(NF+N2)
additions where N is the number of cores and F is the number of distinct frequency
levels. We implemented and profiled our algorithm and determine the overhead in
the worst case to be 1215 cycles for a four core system with eight frequency levels.
We assume worst case overhead for each invocation of our algorithm in our results.
Our algorithm is a hierarchial algorithm and hence can scale to any number of
cores. In fact we find that as the number of cores in the system increases, the
temperature of a specific core is less likely to be affected by remote cores in the
floor plan. Hence the algorithm can be extended to a multi-level hierarchy with
multiple levels of global control. In the next section we present the experimental
settings and results.
5.6 Experimental Settings and Results























Figure 5.8: Overview of multi-core simulation
5.6.1 Simulation Flow
We use a modified multi-core version of simple-scalar [23] alpha toolkit for our
experimental evaluation. We use HotSpot-3.0 for thermal modeling. An overview
of our simulation methodology is shown in Figure 5.8. We model a four core
architecture and simulate each of the pipe stages in detail using a cycle accu-
rate simple scalar model. Our pipeline model is augmented with power models
from Wattch [22] which calculates the power dissipated every cycle. Once every
sampling interval, the average power consumed during the interval is sent to the
thermal modeling routine which uses thermal modeling libraries from HotSpot-3.0
to calculate the temperature of each micro-architecture unit within each core, the
temperature values are given as input to the DTM module and it uses the tempera-
ture values to guide thermal management decisions( frequency scaling, clock gating
and others) which is provided as input to the core pipeline models. The input to
the thermal management module is the chip floorplan and some thermal modeling
parameters such as ambient temperature, sink parameters etc. We use a four-core
floorplan similar to previous work [35, 36] which contains four identical cores and
the rest of the die-area is used by the L2 cache. For thermal modeling we assume
an ambient temperature of 45oC , heat sink convection resistance of 1.0oC/W and
a maximum tolerable temperature of 85oC [89]. After adjusting for placement and
sensor errors, we get a threshold of 82.5oC [89]. The micro-architectural param-
eters of our baseline processor is given in Table 3.2. We model an out-of order
99
superscalar processor with an issue width of 6 instructions per cycle, 128 entry
active list (reorder buffer), and 64 entry issue window. As mentioned earlier, our
adaptive architecture has four possible window sizes (16,32,48,64), five possible is-
sue widths (2–6), and eight fetch gating levels. The performance and power models
for each core use the same settings as discussed in Chapter 3 (Section 3.2.2).
5.6.2 Benchmarks
Heterogenous Homogenous
Label Benchmarks Label Benchmarks
T1 bzip,parser,wupwise,gcc H1 gcc,gcc,gcc,gcc
T2 crafty,mesa,gcc,vortex H2 bzip2,bzip2,bzip2,bzip2
T3 bzip2,gzip,equake,vortex H3 vortex,vortex,vortex,vortex
T4 eon,gcc,art,wupwise H4 art,art,art,art
T5 gzip,vortex,art,parser H5 eon,eon,eon,eon
T6 fma3d,parser,facerec,bzip2 H6 fma3d,fma3d,fma3d,fma3d
T7 wupwise,gcc,art,crafty H7 mesa,mesa,mesa,mesa
T8 parser,gzip,bzip,gcc H8 wupwise,wupwise,wupwise,wupwise
T9 voretx,bzip,eon,equake
T10 mesa,vortex,facerec,parser
Table 5.1: Workloads used for evaluation.
We use multi-programmed workload consisting of homogenous and heterogenous
threads to evaluate our scheme. The workloads are constructed using fifteen SPEC
2000 benchmarks and for each of these benchmarks, we fast forward to the first
simulation point specified by [76] and then perform detailed simulation. Our simu-
lation encompasses a architectural warmup and thermal warmup [89] after stats are
collected. The workload mix is shown in Table 5.1. We use a set of heterogenous
workloads comprising of four independent SPEC benchmarks and also a configu-
ration similar to the SPEC rate MP evaluation suite [10] where multiple copies of
the same benchmark are executed. This is representative of homogenous threads
found in data parallel benchmarks. We evaluate this configuration as it is com-
monly used to evaluate and compare the performance of modern high performance
100
multi-core systems.
The heterogenous workload mix were chosen to include programs with high ILP
and low ILP in the same set. Migration based approaches are very effective in this
scenario. This helps us to get a good idea about how our method would compare
against previously proposed migration based approaches.
5.6.3 DTM Techniques
We compare our technique (and a simpler variation of our technique) with three
other DVS schemes. The DTM schemes we model are as follows
Global DVFS: The frequency/voltage of all cores in scaled in response to a
thermal stress on any of the cores. We use feedback control(PI controller) to set
the appropriate frequency level for a given thermal stress [35].
Distributed DVFS: Each core can chose its own frequency/voltage setting. We
use feedback control( distributed PI controller) to set the appropriate frequency
level for a given thermal stress [35].
Global DVFS + Migration: We compared a number of approaches for migra-
tion to avert thermal stress [35, 36, 63] in conjunction with global DVFS. Among
these we use the multi-loop based method that combines an inner feedback loop
for global DVFS with an outer feedback loop for migration as we observed it to
be the best performing method among the migration techniques investigated. We
assume a 100µs overhead for migration [35].
101
Adaptive DTM: This is the proposed technique which uses global voltage scal-
ing and architecture adaptivity locally.
DVFS + Throttling: This is a variation of our approach where we engage
global voltage scaling and fetch throttling locally to control the temperature. This
mechanism is suitable when per core adaptivity cannot be supported and a simpler
technique needs to be employed at a core level. Next we compare the throughput
of the different approaches
5.6.4 Throughput of Different DTM schemes
The throughput of different schemes for the heterogenous task sets is shown in
Figure 5.9 and homogenous workloads is shown in Figure 5.10. We present the
results for both the cases separately.
Heterogenous Task Sets
When a set of heterogenous tasks execute on a multi-core system, the difference
in the properties of tasks results in difference in temperature between the cores.
In the case of global DVFS, the performance of the entire system is limited by the
temperature of the hottest core (task). Distributed DVFS provides the flexibility
to scale the frequency of each core separately and hence the performance of the sys-
tem is not limited by the hottest task. From Figure 5.9 it can be seen clearly that
Distributed DVFS results in 18% higher throughput than Global DVFS. Global
DVFS + Migration employs migration to balance the temperature between the
cores and it helps to partially bridge the performance difference between global
102
DVFS and distributed DVFS, on an average migration helps to boost the perfor-
mance of global DVFS by 7.25%.
Our proposed techniques Global DVFS+ throttling and Global DVFS + adaptive
employ DVFS at all core level and simpler easy to implement techniques such as
throttling and adaptivity at core level. Global DVFS + throttling performs compa-
rably with distributed DVFS (4% slower) and better than global DVFS+migration
(3%) on an average. If adaptive hardware is employed at per core level with global
DVFS, this can outperform the more complex distributed DVFS resulting in 5%
better throughput on an average. It results in 12% better throughput than global
DVFS+ migration. The results show that there is significant performance advan-

















Dist DVFS Global DVFS Global DVFS+Migration
Adaptive GlobalDVFS+Fetch Gating
Figure 5.9: Throughput of different DTM schemes for heterogenous workloads
Homogenous Task Sets
We simulate a homogenous task set by executing the same benchmark on all four
cores and this a commonly used configuration for bench marking multi-core designs
in the SPEC suite [10]. Homogenous threads are also commonly found in data par-
allel benchmarks. Figure 5.10 presents the throughput of all DVS schemes for the
homogenous workloads. When a set of homogenous tasks execute on the system,
103
the temperature difference between the cores is not very significant. Therefore the
flexibility of being able to change the frequency of each core independently (dis-
tributed DVFS) gives a very small benefit over global DVFS. For the same reason,
there are very few opportunities to migrate threads for thermal balancing, therefore
migration+global DVFS also provides a similar throughput to global DVFS.
For homogenous workloads both our proposed techniques (Global DVFS + Throt-
tling and Global DVFS + adaptive) outperform both global DVFS and distributed
DVFS. Adaptive DTM performs 21% better than distributed DVFS while Global
DVFS+ Throttling performs 9% better. When the hardware is adaptive, the max-
imum operating frequency under a temperature constraint depends both on the
hardware configuration of cores as well as the workload. For homogenous work-
loads, adaptive DTM has the flexibility to down scale processor structures if nec-
essary and ramp up the operating frequency.
For instance in a workload mix comprising of four copies of gcc, when the issue
width of each core is at six, the processor has a maximum possible operating
frequency of 2.7 Ghz and this is the best possible operating point for all other
DTM techniques. However if adaptivity is employed in conjunction with DVFS
then a operating frequency of 3.4 Ghz is possible when the issue width of cores is
scaled down to 4 at some points during execution. In a similar fashion, when fetch
gating is employed at a per core level, the operating frequency can be scaled up
by employing fetch gating selectively at some cores. Thus our proposed techniques
exploit a wider range of options than previously proposed techniques for thermal





















Dist DVFS Global DVFS Global DVFS+Migration
Adaptive GlobalDVFS+Fetch Gating






















Dist DVFS Global DVFS
Global DVFS+Migration Adaptive
GlobalDVFS+Fetch Gating
Figure 5.11: Weighted performance for DTM schemes
5.6.5 Weighted Performance
Another commonly used metric for evaluating the performance in multi-programmed








N is the number of programs executing
Thi is the throughput of program i under a DTM scheme
ThBasei is the throughput of thread i without any thermal constraints
105
Weighed performance is commonly used to evaluate the performance of multi-
threaded and multi-core architectures [90]. It reflects the performance impact
of a scheme on all programs in the workload. Given a set of programs, the total
throughput can be a misleading metric to gauge the impact on individual programs
in some scenarios. For instance given a high IPC and a low IPC program, the
throughput can be artificially boosted by giving higher preference to the high IPC
thread. Weighed performance on the other hand normalizes performance and hence
does not suffer from this problem. Unlike, existing hardware based schemes which
use a single design time policy our software based framework can be altered to
optimize any metric.
Figure 5.11 shows the weighed performance of all DTM schemes and in this case our
framework is programmed to optimize weighed performance. Clearly our adaptive
and throttling scheme result in better weighed performance than existing DVS
schemes. Weighed performance is one illustration of the flexibility of our scheme.
Our scheme can be extended to handle priorities if some of the executing programs
is more critical than the other.
5.6.6 Configurations Selected
In this section we present an overview of the configurations selected by our DTM
framework. We provide the points in the configuration space most often used for
a heterogenous workload and a homogenous workload. Consider the task set T1
comprising of wupwise, gcc, art and crafty. From the unconstrained tempera-
ture profiles (Figure 5.1(a)) it is clear that gcc is the hottest benchmark and art
is the coldest benchmark. For this workload, our scheme chooses the configuration
< 4, 32, 4 > for art and this remains active for more than 90% of the sampling
intervals. For crafty two configurations < 5, 48, 5 > and < 3, 16, 3 > are chosen for
106
more than 40% and 55% of the sampling intervals; gcc has three active configura-
tions < 4, 32, 4 >,< 3, 16, 3 > and < 5, 48, 5 > each for about 32% of the samples,
while wupwise’s configurations oscillate between < 4, 32, 4 > and < 3, 16, 3 >
equally. The global supply frequency varies between a maximum of 3.4 Ghz and
minimum of 2.8 Ghz with the frequency at 3.2 Ghz for over 90% of the time.
Clearly from the data we can see that wide variety of points in the configuration
space get chosen. It can also be seen that applications have phases of similar
behavior and a specific configuration can optimize for a specific phase. As discussed
earlier, our framework includes a simple phase detector and the whole configuration
search is triggered by phase changes. In a global optimization framework where we
optimize for total throughput, phase changes in one of the programs can trigger
configuration changes in other cores. In this task set, frequent phase changes are
observed in gcc which triggers configuration changes both in the core executing
gcc as well as other cores. On each configuration search the configuration and
global supply frequency is chosen to optimize total throughput for the workload
observed.
Let us now consider the homogenous task set H1 that has four copies of gcc
executing on each of the four cores. In this case three configurations are selected
for all four cores namely < −1, 64, 6 >,< 5, 48, 5 > and < 3, 16, 3 > with operating
frequencies of 2.89 Ghz, 3.07 Ghz and 3.28 Ghz. These configuration points are
the optimal maximum throughput points for large phases we observe in execution
of gcc. Two phases are high ILP phases that produce best throughput at a wider
issue and lower operating frequency while the third phase is a relatively low IPC
phase which optimizes for narrower issue width and a higher operating frequency.
Clearly it can be seen that a wide variety of points in the feature space get selected
in our configuration search and the optimal configuration depends on the amount
of ILP available.
107
5.6.7 Impact of Backup Technique
Our software based framework uses prediction to determine if a specific configu-
ration and workload pair is thermally safe and is backed up by a hardware DTM
mechanism. We use a simple hardware DTM scheme namely global clock gating for
backup where all cores are made inactive for one sampling interval when any of the
cores hits the threshold temperature [21]. In this section we examine the impact
of the backup DTM mechanism on our framework. We compare the throughput of
our scheme (a) with global clock gating as backup DTM and (b) no backup DTM
scheme. Clearly (b) might be thermally unsafe for (rare) scenarios where the pre-
diction of our classifier is wrong. We observe that the backup DTM mechanism is
invoked only in three (T2,T10 and H1) of the eighteen task sets we evaluate. Even
in these three task sets the backup mechanism is invoked very infrequently and the
impact on performance is minimal. There is less than 1% difference in throughput
when the backup technique is present/absent.
5.7 Summary
In this chapter we presented a new software based thermal management frame-
work for multi-core architectures. Our framework is hierarchial and employs a
combination of architecture adaptation locally and DVFS globally for thermal
management. It provides better temperature performance tradeoffs and has lower
design/verification complexity than distributed DVFS, which is the best perform-
ing previously proposed approach for thermal management of multi-cores. Also
our technique performs well for both homogenous and heterogenous workloads
unlike migration based approaches, which are largely ineffective for homogenous
workloads.
Chapter 6
Task Sequencing for Thermal
Management
Temperature constraints have become one of the key issues in the design of embed-
ded systems [9]. The challenges in cooling embedded systems are more severe than
general purpose systems. While general purpose desktop systems can afford com-
plex cooling assemblies like fans embedded systems are size constrained and can
afford only minimal or no heat removal assemblies. With further increase in com-
plexity and miniaturization of embedded systems, thermal problems are expected
to increase severely [104]. Hence thermal management solutions in the context of
embedded systems have started gaining attention.
The problems of thermal management in embedded systems are fundamentally dif-
ferent from challenges in general purpose systems. In general purpose systems the
goal of thermal management is performance maximization while handling thermal
constraints. In embedded system the problem is to maintain temperature (often
minimize temperature) while satisfying a host of other system level constraints.
These system level constraints include soft and hard real time constraints, ensur-
109
ing certain level of service/fairness and others. However embedded systems in
general have a very well defined functionality i.e. workload running on an em-
bedded system is often well known. This increases the scope of applicability of
static/design time techniques for thermal management in such systems. In this
chapter we present a static design time technique for thermal management in em-























Figure 6.1: Peak temperature for all possible task sequences.
Our technique is motivated by the observations in Chapter 3 on the thermal be-
havior of embedded programs executing on a processor. We observed that different
embedded tasks show a wide variety in their thermal behavior. When a set of such
tasks execute on a processor, the resulting temperature profile is highly sensitive to
the schedule of tasks in the task set. In particular we observe that the temperature
profile of a set of tasks depends on the order of execution of the tasks. Figure 6.1
further elaborates this observations. Consider a set of eight tasks (crc, epic,
gsm, stringsearch, dijkstra, djpeg, adpcm, patricia) executing together
on an embedded processor. There are 8! possible execution sequences of these
tasks. Figure 6.1 shows a plot of the peak temperature (maximum temperature
reached) of each of these execution sequences. Clearly, the temperature behavior of
a set of tasks is very sensitive to the execution sequence of the tasks.
110
This observation gives us another dimension for thermal management of simple
static embedded systems. In such systems, often the task set that executes in the
system is well known and can be profiled oﬄine. In the past people have used
voltage scaling to control temperature in such systems. Voltage scaling involves
deciding at design time the voltage at which each task in the system would exe-
cute. We suggest another powerful knob to manage temperature in such systems,
namely, to execute the tasks of the system in an optimal order to produce the best
temperature profile. This chapter elaborates on how to determine a thermally
optimal execution order for an arbitrary set of periodic tasks






















Figure 6.2: Thermal profiles of voltage scaling and combined approach
We also observe that our proposed approach (task sequencing) and voltage scaling
are complimentary approaches that must be used in unison to achieve effective
temperature reduction. When voltage scaling is employed in isolation (as with
earlier approaches), we find that the resultant temperature profile is very sensitive
to the execution order of the tasks that is presented to voltage scaling as input.
This is illustrated in Figure 6.2. When a voltage scaling only approach is pro-
vided with the worst execution sequence of a task set (crc, epic, gsm, stringsearch,
dijkstra, djpeg, adpcm, patricia) as input, the resultant profile has a maximum
temperature of 80.45oC. When voltage scaling approach is provided with the best
111
execution sequence as input, we get a peak temperature of 73.07oC which are the
worst and best possible results with a only voltage scaling approach. When voltage
scaling and sequencing are employed together the resultant temperature profile has
a peak temperature 71.25oC ,which is better than even the best case behavior for
a voltage scaling only approach.
6.1 Related Work
In static embedded systems, often the functionality of the system and the workload
executing on it are known at design time. Hence, design time static approaches for
both performance and temperature optimization can be applied very effectively.
In general purpose systems, dynamic thermal management solutions are used to
manage temperature. Such approaches are based on monitoring on-chip thermal
sensors continuously and when it exceeds a predefined threshold, mechanisms are
invoked to lower the temperature. Such approaches are not commonly used or
preferred in embedded systems mainly because of two reasons. First DTM control
often involves non trivial amount of hardware complexity both in the sensors and
controller, hence they might not be preferred in simple embedded systems which
have tight constraints on area and power [104]. Secondly, in some embedded sys-
tems design such as real time systems, predictable system behavior is often as
important as performance [95]. Dynamic thermal management techniques can ad-
versely impact the predictability of the system. Hence static thermal management
techniques have been widely explored and used in embedded system design. We
present an overview of the static techniques for thermal management in embedded
systems.
Most modern embedded processors can operate at multiple voltage levels and when
the voltage level of the processor is lowered, both operating speed and power con-
112
sumption are reduced. Most prior work on static thermal management in embed-
ded systems have exploited voltage scaling to lower the temperature. Zhang et
al. [104] examine the problem of thermal management of a set of periodic tasks.
Given a set of periodic tasks, they decide the voltage at which each task must
run to satisfy a temperature constraint. They prove that the problem of voltage
mapping is NP-hard and develop polynomial time approximation algorithms. Liu
et al. [59] examine the problem of assigning voltage levels to individual tasks in an
MPSoC system for both power and temperature minimization. Xie et al. [47] map
tasks to different cores in a MPSOC to minimize the peak temperature. Chatem
et al. [25] formulate the problem of allocation and scheduling on an MPSoC as an
MILP problem. While the MILP problem for minimizing steady state tempera-
tures can be solved efficiently, it becomes unsolvable when transient variations in
temperature are considered [25]. Wang et al. [96] derive the processor utilization
for a simple reactive dynamic thermal management scheme where the processor
executes at a constant (low) voltage when the temperature hits the threshold.
Rao et al. [81] derive the optimal throttling policy to maintain temperature for a
given static workload. They express the optimal throttling policy as a function of
time and also show that a two step approximation provides good results. Murali
et al. [67] apply convex optimization to determine the voltage at which different
cores in an MPSoC must execute to minimize the temperature.
As seen above, prior work in the area of static thermal management in embedded
systems have dealt with exploiting multiple voltage levels for thermal manage-
ment. In Chapter 3 we observed that embedded tasks show heterogeneity in their
thermal behavior and when a set of heterogenous tasks execute together, the tem-
perature profile is critically dependent on the order in which these tasks execute.
In this chapter we exploit this observation for thermal management of a set of
heterogenous periodic tasks. In particular we chose the task execution order to
113
minimize the temperature. To the best of our knowledge ours is the first approach
that exploits task reordering for thermal management. For systems where voltage
scaling is applicable, we observe that task reordering and voltage scaling are com-
plimentary and can be used very effectively in a combined fashion. Our scheme of
combining task reordering and voltage scaling outperforms optimal voltage scaling.
Next we present our thermal model briefly and then apply the thermal model to
formalize the problem of task reordering for temperature minimization.
6.2 Background
Thermal Model We choose a lumped RC model proposed by Skadron et al. [89]
as our processor thermal model. If the processor dissipates an average power of
P Watt over a time interval t, then the temperature T (t) at the end of the time
interval is given by
T (t) = P ×R + Tamb − (P ×R + Tamb − Tinit)e−t/RC (6.1)
where R is the thermal resistance measured in oC/Watt, C is the thermal ca-
pacitance measured in Joules/oC, Tamb is the ambient temperature, and Tinit is
the initial temperature. TS = P × R + Tamb is the steady state temperature
associated with an average power dissipation of P Watt.
Our approach only assumes that given an execution time t and a power profile
P , the final temperature T (t) is linearly dependent on the initial temperature
Tinit. Thus our approach is agnostic to the choice of processor thermal model.
We can easily incorporate more detailed thermal models (e.g., [81]) that consider
temperatures at the granularity of individual functional units within the processor.
114
Thermal Profile of a Task Let us now look at the thermal profile of an indi-
vidual task Ji with average power consumption Pi and execution time ci running
on a processor. The thermal model of the processor is given by Eqn 6.1. Therefore
T (ci) = Pi ×R + Tamb − (Pi ×R + Tamb − Tinit)e−ci/RC (6.2)
The steady state temperature of the task Ji is defined as
TSi = Pi ×R + Tamb (6.3)
TSi is the temperature that would be reached if infinite number of instances of task
Ji execute continuously on the processor. Let
mi = e
−ci/RC (6.4)
Then substituting into Eqn 6.2 and rearranging the terms, we get
T (ci) = (1−mi)TSi +miTinit (6.5)
We observe from Eqn 6.5 that if Tinit < TSi , then the temperature rises towards
TSi . Alternatively, if Tinit > TSi , then the temperature falls towards TSi .
6.3 Task Sequencing
In this section, we concentrate on the task sequencing problem. Given a periodic
set of heterogenous tasks (i.e., tasks with different thermal profiles), our goal is to
construct a task sequence that minimizes the peak temperature. However, a proper
formulation of this problem first requires a clear definition of the thermal profile
115
and the peak temperature of a task sequence. So we first proceed to analytically
model the thermal profile of a task sequence.





















0 200 400 600 800 1000 1200
Time
Iteration 1 2 3 4 65
Figure 6.3: Thermal profile of a repeating sequence of tasks.
Let us consider a particular sequence L = 〈J1, . . . , JN〉 of N tasks with execution
times c1, . . . , cN , average power P1, . . . , PN , and the corresponding steady state
temperatures TS1 , . . . , TSN where TSi = Pi × R + Tamb, 1 ≤ i ≤ N . Since the
task set is periodic, the sequence L repeats itself infinitely. Figure 6.3 shows the
thermal profile of a repeating sequence of 4 tasks. It is interesting to observe
that starting from an initial temperature, the processor temperature rises as the
sequence repeats itself. But it gradually reaches a steady state where the thermal
profile of the sequence exhibits a recurring pattern. The existence of the this
recurring pattern is a result of the fact that (a) given a starting temperature and
a repeating sequence, the temperature at the end of each iteration of the sequence
is either non-increasing or non-decreasing (this can be proved using induction on
the number of tasks and Equation 6.5) and (b) the final temperature at the end
of the sequence is upper (lower) bounded by the steady state temperature of the
hottest (coldest) task. There are two important constraints that are satisfied for
the recurring thermal profile in the steady state.
116
• The initial temperature Ti−1 of a task Ji (1 ≤ i ≤ N) is the same for all
its execution instances in the steady state. This also implies that the final
temperature Ti of a task Ji (which is the initial temperature of the next task
Ji+1 in the sequence) is the same for all its execution instances.
• For a single instance of execution of L in the steady state, the temperature
at the beginning of the sequence is identical to the temperature at the end
of the sequence, i.e., T0 = TN .
The peak temperature of the task sequence in the steady state is given by
peak(L) = max(T1, . . . , TN). Next we show how to analytically compute peak(L).
Following Eqn 6.5 and the constraints on the thermal profile in the steady state,
we can express T1, . . . , TN using linear equations.
T1 = (1−m1)TS1 +m1TN
T2 = (1−m2)TS2 +m2T1
...
TN = (1−mN)TSN +mNTN−1 (6.6)
This system of N linear equations in N variables T1, . . . , TN can be solved by




1 0 . . . 0 −m1
−m2 1 . . . 0 0
...





















where Ai is the matrix formed by replacing the i
th column of A by the column
vector c. That is,
Ti =
∣∣∣∣∣∣∣∣∣∣∣∣∣
1 0 . . . (1−m1)TS1 . . . 0 −m1
−m2 1 . . . (1−m2)TS2 . . . 0 0
...
0 0 . . . (1−mN )TSN . . . −mN 1
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 0 . . . 0 −m1
−m2 1 . . . 0 0
...
0 0 . . . −mN 1
∣∣∣∣∣∣∣∣∣∣∣∣∣
The determinant value of the coefficient matrix det(A) = 1−m1m2 . . .mN and hence the
system of equations has a solution if (1−m1m2 . . .mN ) 6= 0. Recall that mi = e−ci/RC .
So the system of equations has a solution if the sum of the execution times of all tasks
is non-zero, i.e.,
∑N
i=1 ci 6= 0. Thus
TN =
(1−mN )TSN +mN (1−mN−1)TSN−1 +mNmN−1(1−mN−2)TSN−2 + · · ·
1−m1m2 . . .mN
To express Ti for all values of i (1 ≤ i ≤ N), we need to define a new operator /
that computes the index of the predecessor tasks in the sequence. Note that due to
the repeating nature of the task sequence, the predecessor of task J1 is task JN . Thus,
given a task Ji in the task sequence 〈J1, . . . , JN 〉, i / k is defined as the index of the kth
predecessor task of Ji. Clearly
i / k =
 i− k if k < iN + (i− k) otherwise
118
Now the temperature Ti at the end of task Ji can be defined as
Ti =
(1−mi)TSi +mi(1−mi/1)TSi/1 + · · ·+mimi/1 . . .mi/N−2(1−mi/N−1)TSi/N−1
1−m1m2 . . .mN
(6.7)
The maximum of all the intermediate temperatures is the peak temperature of the se-
quence, that is,
peak(L) = max(T1, . . . , Ti, . . . , TN ) (6.8)
Now that we have formally defined the peak temperature of a task sequence, we can
present the formulation of the task sequencing problem.
6.3.2 Problem Formulation
The input to our task sequencing problem is a set of N tasks J = {J1, . . . , JN} with ex-
ecution times c1, . . . , cN , average power consumption P1, . . . , PN , and the corresponding
steady state temperatures TS1 , . . . , TSN where TSi = Pi×R+Tamb, 1 ≤ i ≤ N . Our goal
is to construct a sequence of these N tasks that minimizes the peak temperature.
Clearly, given N tasks, there exist N ! possible sequences. Let L be one such sequence
and posL(Ji) (1 ≤ i ≤ N) define the position of task Ji in sequence L. By definition,
posL(Ji) 6= posL(Jk) if i 6= k. Also for any position p (1 ≤ p ≤ N) in the sequence L,
there exists a task Ji (1 ≤ i ≤ N) such that posL(Ji) = p. Given the sequence L, we can
determine the peak temperature of the sequence in the steady state by solving a system
of linear equations as described in Section 6.3.1. An optimal solution is the sequence
with the minimum peak temperature among the N ! possible sequences. An exhaustive
search technique can enumerate each of the N ! possible sequences, compute the peak
temperature for each such sequence, and then return the sequence with the minimum
peak temperature. However, as the number of tasks N increases, the computational
complexity of this search technique becomes prohibitive.
Moreover, even a special case of the problem where (a) the task sequence executes only
119
once starting with some initial temperature Tinit and (b) the temperature at the end of a
task in the sequence depends only on the previous task, finding the optimal sequence that
minimizes the peak temperature is still NP-hard. This can be proved by a polynomial
reduction from the well known bottleneck traveling salesman problem (Bottleneck TSP),
which is NP-hard. Bottleneck TSP problem finds the Hamiltonian cycle in a weighted
graph with the minimal weight of the most weighty edge of the cycle. Let us construct
a complete weighted graph G with N vertices, where each vertex u maps to a distinct
task task(u) and the edge weight between two vertices u→ v is the temperature at the
end of execution of the task sequence 〈task(u)task(v)〉. Finding an optimal solution to
the special case of our problem is equivalent to solving the bottleneck TSP problem on
graph G. Thus, even the special case of our problem is NP-hard.
In the next subsection, we present a heuristic to solve the task sequencing problem with
the objective of minimizing peak temperature.
6.3.3 Task Sequencing Algorithm
Our heuristic for task sequencing is based on the following observation. Eqn 6.7 defines,
in the steady state, the temperature after task Ji in the sequence 〈J1, . . . Ji, . . . , JN 〉.
Note that mi = e
−ci/RC . Thus 0 < mi < 1 and in practice, for all our tasks, mi varies
between 0.2356 and 0.6832 depending on the execution time ci of the tasks. A closer
look at Eqn 6.7 reveals that for task Ji, its execution time ci (contributing towards mi)
and its steady state temperature TSi have the maximum influence on the temperature
Ti at the end of execution of Ji. This is followed by contribution from its immediate
predecessor Ji/1. The contributions from other predecessors of Ji decrease in the order
Ji/2, . . . , Ji/N−1. Based on this observation, what should be the characteristics of a task
sequence that minimizes the peak temperature?
First, a task with higher steady state temperature and longer execution time is more
likely to produce the peak temperature of a task sequence. We can reduce the temper-
ature at the end of this hot task by choosing a cooler task as its predecessor. Also, a
120
Algorithm 1 Task Sequencing
Require: Task set J = {J1, . . . , JN}
1: for (i = 0, . . . , N − 1) Li = Ji+1;
2: while N > 1 do
3: for (i = 0, . . . , N − 1) compute metricLi ;
4: sort (L0 . . .LN−1);
5: for (i = 0, . . . , N2 − 1) Li = Li • LN−(i+1);
6: N = N2 ;
7: end while
cold task is a better candidate to absorb the temperature impact of execution of a hot
task. Therefore, it makes sense to put a cold task as the successor of a hot task. In other
words, a good task sequence that minimizes the peak temperature must place tasks with
opposite characteristics close to each other to get a balanced thermal profile.
J2J1J3J7J4J8J5J6






J1 J2 J3 J4 J5 J6 J7 J8
67.59 71.59 70.73 .
77.94 62.40 68.20 68.88 69.43 72.30 74.4 72.88
Figure 6.4: Task sequencing algorithm.
Based on this observation about the characteristics of a good task sequence, we pro-
pose a hierarchical algorithm for task sequence construction. Our algorithm proceeds
in a bottom-up fashion by pairing up tasks or task subsequences with opposing thermal
characteristics till a single task sequence is constructed. Given a set of N tasks, we first
pair up tasks with opposite characteristics to create N2 subsequences each containing two
tasks. These subsequences are further paired up to create N4 subsequences each contain-
ing four tasks. We proceed in this manner till we obtain a single sequence containing all
the N tasks. The task sequencing algorithm is illustrated in Figure 6.4.
So how do we choose tasks or task subsequences with opposing characteristics? First,
we need to define a “metric” that characterizes or summarizes the thermal behavior of
121
a task or a subsequence. Let us first consider individual tasks. Following Eqn 6.5, the
temperature at the end of task Ji is defined as Ti = (1−mi)TSi+miTinit where Tinit is the
temperature before execution of task Ji. In a task sequence, however, Tinit depends on
the sequence of tasks executed prior to Ji as shown in Eqn 6.7. As we are constructing the
task sequence, Tinit is unknown. Instead, we approximate the temperature contribution
from the other tasks contribJ−{Ji} and replace Tinit with contribJ−{Ji} in Eqn 6.5 to get
metricJi = (1−mi)× TSi +mi × contribJ−{Ji}
Here metricJi summarizes the thermal characteristics of task Ji. We will describe shortly
how we approximate contribJ−{Ji}. But before that let us discuss how we computemetric
for a subsequence.
Let L be a task sequence consisting of a set of tasks denoted by tasks(L). We treat L












mL = e−cL/RC and TSL = PL ×R+ Tamb
metricL = (1−mL)× TSL +mL × contribJ−tasks(L)
Now how do we approximate the thermal contribution of a set of tasks as in contribJ−{Ji}
or contribJ−tasks(L)? We simply set
contribJ−tasks(L) = TSJ−tasks(L) and contribJ−{Ji} = TSJ−{Ji}
Once we compute this “metric” for individual tasks or subsequences, our pairing strategy
is quite straightforward. Let us assume that we have N tasks or subsequences at some
122
level. We sort the tasks or subsequences in decreasing order of metric value and pair up
the entity in the ith position with the one at N−(i+1)th position, for i = 0, . . . , N/2−1.
For ease of exposition, we assume without loss of generality that N is an even number.
When we pair up two tasks or subsequences, the resulting sequence is formed by placing
the colder subsequence (lower metric value) before the hotter subsequence (higher metric
value). This is based on the observation that the temperature at the end of a task is
influenced the most by its predecessor task in the final sequence. An illustration of the
working of the algorithm for a task set consisting of eight tasks with sample metric values
for the nodes is presented in Figure 6.4. It can be seen that the algorithm works in a
bottom up fashion pairing up tasks or task sequences with opposite characteristics to
get the final balanced sequence. The complexity of our task sequencing algorithm is
O(N × (lgN)2) where N is the number of tasks.
6.4 Sequencing & Voltage Scaling
In Section 6.3, we assume that we can only reduce the peak temperature by sequencing
the tasks appropriately; we have no control over the power consumption and the exe-
cution time of each individual task. However, if the deadline is greater than the total
execution time of the tasks, we have additional flexibility to lower the peak tempera-
ture through voltage scaling and insertion of idle times. Let us first state the problem
formally.
6.4.1 Problem Definition
The input to task sequencing and voltage scaling problem are
• A voltage scalable processor having r distinct active states with supply voltages
and frequencies {(V1, f1), · · · , (Vr, fr)} where (V1, f1) is the highest voltage and
123
frequency level. We also a assume an idle state (idle) where no useful work is done
and the corresponding power consumption is Pidle.
• A set of tasks {J1, · · · , JN} with execution times {c1, · · · , cN} and power consump-
tions {P1, · · · , PN} where ci and Pi are the execution time and power consumption
of task Ji at the highest frequency f1 and supply voltage V1. The power consump-
tion and execution time of task Ji (1 ≤ i ≤ N) at active state j (1 ≤ j ≤ r) are
given by
Pi,j =
Pi × V 21 × f1




• The deadline for one instance of execution of all the tasks where deadline ≥∑N
i=1 ci. The slack can be defined as slack = deadline−
∑N
i=1 ci.
The goal is to produce a task sequence and an assignment idle times and/or voltages levels
to the tasks such that the peak temperature is minimized while satisfying the deadline
constraint. We assume that the voltage switching times are negligible in comparison to
the execution times of the tasks. This is similar to the processor model used in previous
work on voltage assignment, for example [104].
6.4.2 Algorithm
Clearly, this problem requires solutions to two mutually dependent sub-problems, namely,
voltage assignment and task sequencing. The task sequencing algorithm described in Sec-
tion 6.3.3 takes the power consumption and execution time of tasks as input to construct
a sequence that minimizes the peak temperature. The power consumption and execu-
tion time of a task depends on which active state the task executes (voltage assignment).
The voltage assignment for peak temperature minimization, in turn, depends on task
sequencing as the sequence determines the temperatures reached by the tasks.
A straightforward approach would be to perform one phase completely followed by the
other phase such as find the minimal task sequence and then use the slack to perform
124
Algorithm 2 Task sequencing with voltage scaling
Require: Task set J = {J1, . . . , JN}; deadline D
1: slack = D −∑Ni=1 ci;
2: for (i = 1, . . . , N) level(Ji) = 1;
3: repeat
4: L = Task Sequencing (J);
5: Compute peak(L) where Ji is the task with peak(L);
6: if ci,level(Ji)+1 − ci ≤ slack then
7: slack = slack − (ci,level(Ji)+1 − ci);
8: level(Ji) = level(Ji) + 1; ci = ci,level(Ji); Pi = Pi,level(Ji);
9: else
10: if minidle ≤ slack then
11: Insert idle task with execution time minidle into the task set;
12: slack = slack −minidle;
13: end if
14: end if
15: until slack > 0
voltage scaling on the sequence (this is discussed further in Section 7.4). However, this
approach is restrictive since by assigning the entire slack at one go we do not provision
for the inter-dependence between task sequencing and voltage sequencing. A better
approach would be to use the slack in small increments and to determine the best possible
sequence before the use of each increment. This is achieved by our iterative approach
that repeatedly performs (a) task sequencing to minimize the peak temperature and (b)
voltage assignment to the tasks based on the current sequence so as to lower the peak
temperature. The voltage assignment step exploits the “slack” (the difference between
the deadline and the total execution time) to lower the voltage/frequency of a hot task
or insert idle states to lower the peak temperature. Therefore, the iterative algorithm
terminates when slack = 0.
Algorithm 2 presents our iterative solution for task sequencing and voltage assignment.
Initially, we assume all tasks are executing at the highest voltage and frequency level
(V1, f1). We first employ the task sequencing algorithm (Algorithm 1) to return a
task sequence L that minimizes the peak temperature. Given the sequence L, we can
determine the peak temperature of the sequence peak(L) in the steady state by solving a
system of linear equations as described in Section 6.3.1. Let Ji be the task that produces
the peak temperature in the sequence L, that is, the peak temperature is reached at the
125
end of execution of Ji.
Now we proceed to lower peak(L) by exploiting the slack. Let level(Ji) be the cur-
rent voltage and frequency level of task Ji. We first check if we can lower the volt-
age/frequency level of task Ji by one step to level(Ji) + 1 and still meet the deadline.
If the answer is yes, then the algorithm updates the voltage/frequency level of task Ji,
its execution time, power consumption, and the slack. If there is not enough slack to
lower the active state of the hottest task Ji, then we introduce an “idle task” with execu-
tion time minidle and power consumption Pidle to the task set. minidle is the minimum
granularity at which we claim the slack time and clearly minidle ≤ slack. We leave the
appropriate sequencing of this idle task with respect to the existing tasks to the next
iteration when the task sequencing algorithm is invoked again. Note that even though
we introduce at most one idle task in each iteration, we can introduce at most b slackminidle c
idle tasks over the iterations. The algorithm continues till there is no remaining slack
to be exploited. We compare our algorithm that performs combined task sequencing
and voltage scaling with optimal voltage scaling to minimize peak temperature given an
input execution sequence. We present the optimal voltage scaling algorithm in the next
section.
6.5 Optimal Voltage Scaling
We would like to perform a fair comparison of the combined task sequencing and volt-
age scaling algorithm proposed in the previous section (Algorithm 2) with pure voltage
scaling approaches proposed in the literature. Among previous works, the one with a
problem formulation closest to ours is [104]. However, [104] performs voltage assignment
with the objective of minimizing execution time under peak temperature constraint. In
contrast, we consider the dual problem of minimizing peak temperature under execution
time constraint. Therefore, we develop an optimal pseudo-polynomial time algorithm
(based on dynamic programming) to solve our voltage assignment problem. Our algo-
126
rithm is inspired by [104].
The problem formulation is identical to the formulation discussed in Section 6.4.1 with
one major difference: The input is a fixed task sequence L = 〈J1, · · · , JN 〉 instead of a
set of tasks. The goal is to produce an assignment of idle times to the sequence and/or
voltage levels to the tasks so as to minimize the peak temperature while satisfying the
deadline.
To incorporate idle times or sleep modes in the formulation, we consider a sequence of
M = 2N+1 jobs 〈S1, · · · , SM 〉. The task Si refers to the task Ji/2 in the original sequence
when i is even. The task Si, when i is odd, denotes an idle task with an execution time
in the range [0, slack] and power consumption Pidle. We assume q distinct values in
increments of minidle in the range [0, slack], namely [t1, t2, · · · , tq] where t1 = 0. Clearly,
if Si chooses an idle time of t1 = 0, then it implies the processor does not enter the idle
state between tasks J(i−1)/2 and J(i+1)/2 in the original sequence.
We now describe our dynamic programming algorithm that performs voltage assignment
to minimize the peak temperature. Our algorithm is based on the following observation.
Given multiple voltage assignments for a sequence of i tasks with the same final tem-
perature and peak temperature, the voltage assignment that results in the smallest total
execution time is preferred.
Let Ei(Tmax, Tf ) represent the total execution time corresponding to a voltage assign-
ment and sleep states for the tasks S1, . . . , Si, 1 ≤ i ≤ M with maximum observed
temperature Tmax and final temperature Tf . If no such voltage assignment exists then
Ei(Tmax, Tf ) = ∞. Note that Ei(Tmax, Tf ) = ∞ when Tmax < Tf as Tmax < Tf is
not feasible. Before we derive the recurrence equations for Ei(Tmax, Tf ), we need to
introduce a new function.
Given a final temperature Tf and power consumption P over c time units, the function
127
Tinit(Tf , P, c) returns the initial temperature as an inverse of equation 6.2.
Tinit(Tf , P, c) =

RP + Tamb − (RP + Tamb − Tf )ec/RC if RP + Tamb ≥ Tf
undefined Otherwise
(6.9)
We are now ready to present the recurrence equations for Ei(Tmax, Tf ) on a case by case
basis.
Case 1: i is even and Tmax > Tf When i is even, Si represents an original task
and has a choice of r active states with power consumption Pi/2,j and execution time
ci/2,j for 1 ≤ j ≤ r. Hence, in the recurrence of Ei(Tmax, Tf ), we consider all possible
voltage assignments for the task Si. For each voltage assignment, we calculate the initial
temperature Tinit at the end of Si−1 based on equation 6.9 such that the final temperature
at the end of Si is Tf . Moreover, as Tmax > Tf and the temperature at the end of task
Si is Tf , the peak temperature Tmax must have been reached during the execution of the
tasks S1, · · · , Si−1.
The total execution time for one such voltage assignment j for 1 ≤ j ≤ r is given by
Ei−1
(
Tmax, Tinit(Tf , Pi/2,j , ci/2,j)
)
+ci/2,j . The assignment is feasible only if this value is
less than the deadline. Among the feasible assignments we select the one with minimum
total execution time. Thus









Tmax, Tinit(Tf , Pi/2,j , ci/2,j)
)
+ ci/2,j ≤ deadline
}
(6.10)
Case 2: i is even and Tmax = Tf Similar to the previous case, we can choose
among r active states. The key difference from the previous case is that the maximum
temperature Tmax in Ei(Tmax, Tf ) is reached at the end of the task Si. Hence for the tasks
S1, · · · , Si−1, we need to consider all possible voltage assignments with peak temperature
128
Tm ≤ Tmax. The final recurrence is given by










Tm, Tinit(Tf , Pi/2j,, ci/2,j)
)
+ ci/2,j ≤ deadline
}
Case 3: i is odd and Tmax > Tf Here the recurrence is similar to Case 1. But as i
is odd, Si is an idle with power consumption Pidle and q possible options [t1, . . . , tq] for
execution time in the range [0, slack]. The recurrence is given by
Ei(Tmax, Tf ) = min1≤j≤q
{
Ei−1 (Tmax, Tinit(Tf , Pidle, tj)) + tj
if Ei−1 (Tmax, Tinit(Tf , Pidle, tj)) + tj ≤ deadline
}
Case 4: i is odd and Tmax = Tf Here the recurrence is similar to Case 2. But as i
is odd, Si is an idle with power consumption Pidle and q possible options [t1, . . . , tq] for
execution time in the range [0, slack]. The recurrence is given by
Ei(Tmax, Tf ) = min 1≤j≤q
Tm≤Tmax
{
Ei−1 (Tm, Tinit(Tf , Pidle, tj)) + tj
if Ei−1 (Tm, Tinit(Tf , Pidle, tj)) + tj ≤ deadline
}
Base Case and Optimal Solution The base case considers all possible q values
of execution time [t1, . . . , tq] for the first idle task S1 with all possible values of initial
temperature T0. For each case we determine the final temperature Tf and initialize
E1(Tf , Tf ). Then the dynamic programming algorithm can be employed to compute
Ei(Tmax, Tf ) for 2 ≤ i ≤ M . The interval for T0 is derived from the interval of Tf .
The interval of Tf and Tmax is the difference between the ambient temperature and the
steady state temperature of the hottest task in the input set.
The optimal solution is the one with the lowest value of Tmax among all feasible solutions
EM (Tmax, Tf ) where EM (Tmax, Tf ) 6= ∞ and Tf ≤ T0. The second constraint ensures
129
that the schedule is repeatable. The complexity of the algorithm is O(interval(Tf ) ×
interval(Tmax)×M ×max(r, q)). The algorithm is pseudo-polynomial and polynomial
time approximations can be developed [104].
6.6 Experimental Evaluation
In this section, we evaluate our thermal management approach. We use the SimpleScalar-
3.0 [23] toolset for our experimental evaluation. The power consumptions of the tasks are
obtained from Wattch [22], an architectural level power simulator. We use a base supply
voltage of 1.2V and a processor frequency of 1.5Ghz. For voltage scaling results, we use
five different frequency values between 1.5Ghz and 800Mhz. The corresponding supply
voltages are interpolated from the settings in [40]. Wattch includes a set of configurable
libraries to compute the power consumption of the different micro-architectural units. We
configure these libraries to model a simple embedded processor architecture resembling
ARM Cortex A8 [2]: in-order issue with two integer execution units, a 13 stage pipeline,
32 KB instruction and data caches and a 512 entry branch target buffer.
The temperature values are obtained from HotSpot thermal simulator [89]. The floorplan
and silicon area of ARM Cortex A8 processor are provided as input to the thermal
simulator. In a typical processor package, the heat flow occurs in the vertical direction
from the processor to the spreader through the interface material, followed by heat flow
from the spreader to the heat sink, and finally to the ambient environment. The heat
flow can be modeled as a thermal RC-network by computing the thermal resistances
and capacitances at each level. We use the equations and default configuration from
HotSpot [89] to compute the thermal resistances and capacitances for our input chip area
at each layer. This gives us an RC-network with a thermal resistance of 1.83oC/Watt and
capacitance of 112.2mJoules/oC. We use these as the default values for our experiments.
To collect the temperature values, we follow the methodology discussed in HotSpot. The
power consumption is collected every 105 cycles and provided as input to the thermal
130
Label Tasks
T1 lame, sha, djpeg, mp3, ghostscript, blowfish, dijkstra, epic
T2 gsm, patricia, adpcm, pegwit, susan, crc, dijkstra, epic
T3 g721, lame, sha, djpeg, pegwit, blowfish, adpcm, lame
T4 gsm, patricia, pegwit, mp3, susan, blowfish, strsearch, epic
T5 lame, g721, ghostscript, patricia, blowfish, strsearch, pegwit, sha
T6 gsm, mp3, ghostscript, susan, crc, stringsearch, dijkstra, epic
T7 gsm, sha, strsearch, pegwit, mp3, susan, blowfish, patricia
T8 g721, gsm, sha, djpeg, patricia, adpcm, pegwit, strsearch
Table 6.1: Representative task sets.
model. The thermal simulation in done in two phases. In the first phase the steady state
temperature is obtained from the average power consumption of all the samples. In the
second phase, transient simulation is performed by feeding the steady state temperature
from the first phase as the initial temperature [89].
We use a total of 16 benchmarks from MiBench [39] and Media Bench [53] in this study.
The tasks have execution cycles in the range of 4× 107− 6× 108 cycles. Each tasks was
run with all possible inputs provided in the standard benchmark distribution to obtain
the execution times and power consumptions. We investigated the temperature behavior
of these tasks in Chapter 3. The results showed that there is significant variation in the
steady state temperature among the different benchmarks (std. deviation is 12oC). We
also note that there is very little intra-task temperature variation once the steady state
temperature is reached. For our experiments we create 100 task sets each with eight tasks
chosen from the 16 benchmarks listed in Table 3.1. Some representative task sets are
shown in Table 6.1.
6.6.1 Task Sequencing Algorithm
We evaluate our task sequencing algorithm by comparing the peak temperature of the
sequence produced by our algorithm with (a) the peak temperature of the best sequence,
(b) the peak temperature of the worst sequence, and (c) the average value of the peak























T1 T2 T3 T4 T5 T6 T7 T8
Task Set
Figure 6.5: Accuracy of task sequencing Algorithm.
8! (each task set has 8 tasks) possible task sequences and obtain the peak temperature
of each sequence through thermal simulation. From these simulation results for each
task set, we get the peak temperatures of the best sequence (sequence with lowest peak
temperature), the worst sequence (sequence with highest peak temperature) and the ex-
pected value of peak temperature (average peak temperature over all possible sequences).
Finally, we employ our task sequencing algorithm described in Section 6.3 to construct
a sequence L that we expect will minimize the peak temperature. We estimate the peak
temperature of the sequence L returned by our algorithm through thermal simulation.
The results for the eight representative task sets in Table 6.1 are shown in Figure 6.5.
Our task sequencing algorithm achieves significantly lower peak temperature compared
to the worst sequence and the expected value of peak temperature. More importantly,
the peak temperature of the sequence constructed by our algorithm is very close to the
peak temperature of the best sequence. The same trends are reflected when we consider
all the 100 task sets. When all the 100 task sets are considered, our algorithm has, on
an average, a peak temperature 7.47oC lower than the worst sequence and 4.09oC lower
than the expected value of peak temperature. The peak temperature of the sequence
returned by our algorithm is within 0.5oC of the peak temperature of the best sequence
for all the 100 task sets. In the next subsection we examine the impact of voltage scaling
techniques on the peak temperature.
132
Task sequencing as an approach does not work well when there is a very long running
high temperature task in the task set i.e. the hottest task in the system runs long enough
to reach steady state independent of the initial temperature. However, in our simulations
we found that the hottest tasks in the benchmark suite do not run long enough for the
temperature to reach the steady state independent of the initial temperature. Thus the
temperature reached by the hottest tasks in the system is dependent on the temperature



























T1 T2 T3 T4 T5 T6 T7 T8
Task Set
Figure 6.6: Advantage of combined sequencing and voltage scaling (seq+vs) over
voltage scaling alone.
We compare the peak temperature returned by our iterative task sequencing and voltage
scaling algorithm (Section 6.4.2) with the optimal voltage scaling algorithm (Section 6.5).
The result of the optimal voltage scaling depends on the task sequence provided as
input. For each task set, we provide the best and the worst task sequence (at the
highest voltage/frequency level) as input to the voltage scaling algorithm. These two,
(Best + VS) and (Worst + VS) represent the best and worst possible scenarios if only
voltage scaling is used. The slack is assumed to be 5% of the total execution time of
the tasks at highest frequency. The results for the task sets in Table 6.1 are presented
133
in Figure 6.6. Our iterative algorithm that combines voltage scaling and sequencing
performs better than even the best possible results with voltage scaling alone. On an
average, our algorithm (Seq + VS) results in a peak temperature that is 2.1oC lower than
the best scenario (Best + VS) and 6.94oC lower than the worst scenario (Worst + VS).
Our algorithm for task sequencing is more efficient than the optimal voltage assignment
even though it is iterative in nature. It has a runtime (average 1.45 sec) that is much
lower optimal voltage assignment (average 78.57 sec) while running on 3 Ghz Pentium
4 machine with 1 GB memory.



























1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Th l R i t (oC/W)erma   es s ance 
Figure 6.7: Impact of task sequencing on the choice of thermal resistance.
We have shown that the task ordering has significant impact on the peak temperature.
An indirect benefit of our task sequencing algorithm is that we can lower the packaging
and cooling costs due to lower peak temperature. To quantify this benefit, we compute
the peak temperature of a task set at different thermal resistance values. The thermal
resistance depends on a number of factors such as the packaging and sink parameters.
A higher value of thermal resistance implies lower cooling and packaging cost [89]. Fig-
ure 6.7 plots the peak temperature of the worst sequence, the best sequence, the sequence
returned by our task sequencing algorithm, and the expected value of peak temperature
over all possible sequences for a task set T1 . As expected, the peak temperature in-
134
creases with higher value of thermal resistance (cheaper packaging). The trend is also
similar for other task sets.
Consider a specific value of the peak temperature, say 85oC. The peak temperature of
85oC is reached at the following thermal resistance values: (a) 2.33oC/Watt with our
sequencing algorithm, (b) 1.89oC/Watt for the worst sequence, and (c) 2.03oC/Watt for
a sequence with expected value of peak temperature. That is, for the same temperature
as the thermal limit, the thermal resistance (i.e., packaging) required by our sequencing
algorithm is 19% higher (cheaper) than the worst case sequence and 13% higher (cheaper)
than the average sequence.


























0.00% 2.00% 4.00% 6.00% 8.00% 10.00%
Slack
Figure 6.8: Impact of slack amount on voltage scaling.
The amount of slack available to be exploited has direct impact on the reduction in peak
temperature achieved by the voltage scaling approaches. We study the impact of slack
amount on (a) our iterative algorithm (Seq + VS), (b) optimal voltage scaling algorithm
with the best sequence as input (Best + VS), and (c) optimal voltage scaling algorithm
with the worst sequence as input (Worst + VS). The results are summarized in Figure
6.8. The slack is represented as a % of the total execution time of all the tasks executing
at maximum frequency. Again the task set T1 is used , though, we see similar trend for
other tasks sets.
135
As expected, the peak temperature decreases with increasing slack value for all the ap-
proaches. More importantly, our algorithm (Seq + VS) achieves lower peak temperature
than pure voltage scaling at any slack value. By performing task sequencing and voltage
assignments in a continuous fashion, our algorithm exploits the interdependence between
the two steps and results in lower values of peak temperature compared to voltage scaling
alone. Interestingly, at lower slack value (say 1%), if the designer inadvertently provides
the worst task sequence to the voltage scaling algorithm, the peak temperature can be
9oC higher compared to our algorithm. This clearly shows that task sequencing is a
powerful and complimentary mechanism to voltage scaling in thermal management.
6.7 Summary
In this Chapter, we presented task sequencing as a powerful mechanism for thermal
management in static embedded systems. We formalized the problem of task sequencing
to minimize temperature using our analytical model and presented a heuristic to sequence
the tasks. We also examined the interaction between task sequencing and voltage scaling
and presented a combined approach that outperforms optimal voltage scaling.
In the next chapter we present a scheduling technique that is applicable to more generic
embedded systems such as PDA’s, low functionality laptops and others. In such systems,
though the functionality of the system is known at design time, the exact operating
conditions do not remain static. For instance for a PDA the set of possible applications
executing might be known at design time but the exact tasks executing at any point on
a PDA is dependent on the usage scenario. Such systems also have a mixture of soft real
time (media player) and best effort applications (text search, virus scan ,etc.) executing





In the previous chapter, we presented a scheduling strategy for thermal management
of static embedded systems. Such techniques are applicable in the context of simple
fixed functionality systems such as industrial controllers where the tasks executing on
the systems are fixed and periodic. Such systems have very light weight non-preemptive
schedulers [104].
In this chapter, we present a more generic approach applicable in the context of more
dynamic embedded systems such as cell phones, PDA, and others. Such systems have
characteristics in-between pure static embedded systems and general purpose systems.
Unlike general purpose systems, the set of applications that might execute on a cell-
phone/ PDA can be known apriori and they can be profiled extensively oﬄine. However,
unlike pure static embedded systems, the applications active at any point in the embed-
ded system depends on the usage scenario. Moreover such systems include a mixture of
soft real time tasks and best effort tasks. We explain this with a typical scenario.
A typical usage scenario for a PDA could include one or more of these applications
executing simultaneously. The user could be playing a video (audio/video decoder),
137
while syncing his e-mail on the internet (crc decoding+ text processing) and a virus
scanning software (string search) executes on incomming/outgoing e-mail attachments.
This usage scenario includes a set of tasks executing simultaneously on the system while
a different usage scenario might result in a different set of tasks. In addition, if we
observe the previous scenario, the requirements of each of these tasks in the system are
different. The audio/video decoder must satisfy real time constraints, i.e., ideally the
processing of each audio/video frame must be complete in time to meet the display rates
(30 frames per second). The best effort tasks do not have any real time constraints but
the scheduler must try to ensure reasonable response time for these applications.
Clearly, static non-preemptive scheduling approaches such as Time Division Multiplexing
(TDM) are not suitable for these systems as the task model is not periodic and the set
of tasks at any point in the system are variable. Hence dynamic preemptive schedulers
are commonly used in such systems. In addition to variable task sets, some tasks in the
system have real time constraints while others require the best possible response time
(best effort applications). To satisfy the requirements of different classes of applications a
hierarchy of schedulers is employed [37]. In this chapter we present a temperature aware
hierarchial scheduler which schedules both soft real time and best effort applications. The
goal of this scheduler is similar to conventional embedded hierarchial schedulers i.e. (i) to
ensure soft real time tasks meet their deadlines, (ii) ensure best possible response times
for the best effort tasks (iii) keeping temperature of the processor below the threshold.
Our scheduler is based on the observations in Chapter 3. We observed that given a
task set comprising of heterogenous tasks, temperature can be controlled by changing
the relative shares of execution time given to the hot and cold tasks. Our scheduler
controls the temperature by (i) characterizing online the thermal properties of the tasks
executing with the help of a predictive thermal model (ii) use this characterization to
control the processor temperature by changing the shares of execution time to different
tasks and employing voltage scaling when required.Next we review the related work.
138
7.1 Related Work
Increasing power density and on-chip temperature have made temperature management
an important aspect of computer design. High performance general purpose systems
are equipped with on-chip temperature sensors which are monitored continuously and
when the temperature exceeds a predefined threshold, techniques are engaged to lower
the temperature [89]. However, there are a few limitations in hardware based thermal
management. First, hardware based thermal management solutions are generic in na-
ture and often involve lots of additional hardware for DTM controllers which is often
not required in simpler embedded processors where the workload executing on the pro-
cessors is known in advance. Secondly, hardware based DTM schemes are unpredictable
and try to maximize performance which cannot directly by design satisfy system level
requirements such as real time constraints, fairness and others. Hence software based
thermal management solutions are commonly employed for thermal management either
independently or in addition to hardware based thermal management. In this chapter
we present a software based thermal management scheme. We review the related work
in three closely related directions namely (i) scheduling driven approaches for thermal
management in general purpose systems (ii) thermal management solutions for hard real
time systems (iii) application specific approaches for soft real time systems.
7.1.1 General Purpose Scheduler Driven Thermal Manage-
ment
Scheduling driven approaches have been proposed which involve modifications to the
standard linux scheduler for thermal management. Such approaches generally include a
thermal model to classify tasks into hot and cold tasks and a scheduling policy to main-
tain temperature. Kumar et al. [87] employ a regression model that takes hardware
performance counters as input to characterize tasks into hot and cold tasks. The prior-
ity of the hot tasks is lowered whenever the temperature exceeds a predefined threshold.
139
Similarly Merkel et al. [62] employ performance counters to guide processor assignment
and migrations in a multi-processor SMP systems. Yeo et al. [100] employ a predictive
thermal model to which observes the change in temperature over a window of time to
characterize applications and employ a heuristic to change task priorities and migrate
tasks. Our approach also employs a thermal model to perform online task characteriza-
tion. Zhu et al. [106] present an online thermal management strategy for heterogenous
3D systems. In a 3D system, different cores in the system have different thermal(cooling)
characteristics. They modify the linux scheduler to perform (i) appropriate task assign-
ment to cores and (ii) appropriate choice of voltage level to each core with the aim of
maximizing the performance of the system. Srinivasan et al. [67] present a software
based approach for voltage scaling in a multi-core system. Their objective is to deter-
mine the optimal operating frequencies(voltage) for different cores in a multi-core system
to satisfy thermal constraints. They model the problem of voltage assignment to cores
as a convex optimization problem. The convex optimization problem is solved oﬄine
for different combinations of input values (workloads, initial temperature,etc) and the
results are stored in a table. Appropriate settings are chosen online by comparing the
state of the system against the values in the table. All the above mentioned approaches
use task scheduling and voltage scaling to maximize the performance of the system under
thermal constraints. In contrast to these schemes, our scheme employs task scheduling
and voltage scaling to satisfy a wide variety of system level requirements such as real
time constraints, throughput, fairness and others in an online setting.
7.1.2 Thermal Management Approaches for Hard Real Time
Systems
In hard real time systems, thermal management approaches try to satisfy thermal con-
straints while maintaining real time constraints. Wang et al. [96] employ a simple DVS
scheme for thermal management in real time systems where the processor starts execut-
ing at a given safe voltage/frequency once the temperature hits the threshold. Clearly
140
such a scheme cannot directly guarantee that real time constraints would be met as the
constant safe frequency might not be enough to satisfy the real time constraints. Hence
they analytically derive the conditions under which real time constraints can be satisfied
with such a scheme. Chatem et al. [25] use integer linear programming to determine
task mappings and schedules in a MPSoC to satisfy real time constraints. In contrast to
these works which involve modifications to hard real time schedulers, our approach deals
with a mixture of soft real time tasks and best effort tasks. Moreover the above men-
tioned works primarily exploit voltage scaling for thermal management, our approach
exploits both task heterogeneity and voltage scaling for thermal management.
7.1.3 Thermal Management for Media Applications
Many researchers have explored thermal management solutions for soft real time applica-
tions. The application model for soft real time applications is a set of frames, each frame
has a processing time and an associated deadline and the frame is dropped if the frame
is not processed before the deadline. Lee et al. [54] specialize clock gating for MPEG-2
thermal management where periods of frame processing are interspersed with idle time
to keep the temperature of the processor below the threshold. Yeo et al. [103] predict
the processor workload for frame processing and employ voltage scaling to maintain the
temperature below the threshold while minimizing frame loss. Srinivasan et al. [91]
employ adapt the architecture (window size etc) to minimize the frame processing time
for media applications while maintaining temperature constraints. In contrast to media
specific thermal management approaches, our technique is a scheduler driven approach




















Figure 7.1: Temperature aware scheduling framework
7.2 Temperature Aware Scheduling Framework
and Thermal Model
The goal of our thermal management framework is to maintain the temperature below
the threshold while satisfying a variety of system level scheduling requirements such
as throughput, fairness and real time constraints. Power consumption and hence the
thermal profiles vary between different tasks. Given a set of tasks with varying ther-
mal profiles, the temperature can be controlled by varying the relative amount of time
for which hot tasks and cold tasks execute. Our thermal aware scheduling framework
exploits this observation in conjunction with voltage scaling to maintain the processor
temperature below the maximum specified temperature.
Figure 7.1 shows the outline of our thermal management framework. The framework
consists of a predictive thermal model and the temperature aware scheduler. The pre-
dictive thermal model is used to characterize the thermal properties of each task and also
predict the change in temperature when a task executes starting from an initial temper-
ature. The temperature aware scheduler uses the thermal properties of the tasks from
the model and the task execution time requirements to determine the time for which
each task executes. Our framework consists of both real time and best effort tasks. The
scheduler ensures that real time tasks meet the deadline and for best effort tasks, the
142
goal is to maximize the throughput while maintaining a user supplied level of fairness.
For maintaining fairness as well as to maintain real time constraints, our scheduler ex-
ploits dynamic voltage and frequency scaling. Next we present our thermal model and
Section 7.3 presents our temperature aware scheduler.
7.2.1 Thermal Model
In this section, we present a thermal model to predict the processor temperature at any
point during the execution of a specific application. We use a predictive thermal model
which models the temperature profile of a given application as an exponential function
of the form [104]
T (t) = Ts − (Ts − Tinit)× e−Kt (7.1)
where Ts is the steady state temperature of the application which is defined as the
temperature the processor would reach if the application executes indefinitely, T (t) is
the temperature of the processor after the application executes for t time units, Tinit
is the initial temperature, and K is a processor specific and application independent
constant.
The value of the application independent processor specific constant K can be determined
by fitting the observed temperature profiles for different applications into the exponential
function. This process is done oﬄine and the computed value of K is used in the predictive
thermal model. For our processor model1 we compute the value of K = 0.00472.
The steady state temperature of an application can be determined online by observing the
temperature change over a period of time when the application executes and rearranging
Equation 7.1.
Ts =
Tc − Tinit × e−Kc
(1− e−Kc) (7.2)
where Ts is the steady state temperature of the application, Tc is the temperature after
the application executes for c time units, Tinit is the initial temperature before the
1The details of the processor model is presented in the experimental section
143
application starts execution, and K is the application independent processor specific
constant that is computed oﬄine.
Once the steady state temperature of the application is known, Equation 7.1 can be used
to predict the change in temperature when this application executes starting from any
initial temperature.
Accuracy of the Prediction Model In order to check the accuracy of the pre-
dictive thermal model, we run a set of embedded benchmarks on a ARM Cortex A8 [2]
like embedded processor model and observe the temperature profiles of the processor.
We compare the observed temperature from these runs with the temperatures predicted
from the model. For each application, we obtained the temperature curve from HotSpot
with a sampling frequency of 1 milli-second. We also applied our model to predict the
temperature variations and compared the temperature curves from the model and from
HotSpot. For each benchmark, we measured the peak error that this the temperature
difference at the point at which the predicted and the observed curves diverge the most.
The maximum peak error is 0.6oC and the average peak error is 0.14oC across all the
benchmarks. Hence our model provides sufficient accuracy for software based thermal
management. In the next section we present our temperature aware scheduler.
7.3 Temperature Aware Scheduling
An overview of our thermal aware scheduler is shown in Figure 7.2. Our system consists
of soft real-time (multimedia) and best-effort tasks. Our soft real time tasks comprise
of periodic multimedia tasks that release a job per period, e.g., decoding a video frame
every 30 ms. We employ a hierarchical scheduling structure typically used in multi-media
systems [37, 73, 82].
The thermal aware scheduler consists of two sub-schedulers to handle soft real time tasks




















Figure 7.2: Temperature aware scheduling Policy
scheduler consists of a frame execution time predictor and a soft real time scheduler.
The execution requirements for the next frame is predicted using a frame execution
time predictor. We employ the histogram based method for execution time prediction
proposed in [102] for its accuracy and ease of implementation. The predicted frame
execution times are given as input to the soft real time scheduler which schedules the
soft real time tasks. We employ a simple static priority soft real time scheduler in our
scheme where the audio decoding task has a higher priority than the video decoding
task.
Our thermal aware scheduler has an additional thermal adjustment phase. This phase
takes the soft real time schedule and the predicted frame execution time requirements as
input and has two main parts (i) Ensure that the soft real time task remains below the
threshold frequency/voltage scaling the soft real time tasks if necessary (ii) Compute the
starting temperature (Treq) for the next period so that the temperature of the soft real
time tasks remain below the threshold. The slack and the required temperature (Treq)
is provided as input to the best effort task scheduler. We employ a modified version of
a round robin scheduler as our best effort scheduler. Our best effort scheduler classifies
tasks into hot and cold tasks and controls temperature by changing the execution time
provided to the hot and cold tasks. Next we present the thermal adjustment phase
employed in conjunction with the soft real time scheduler.
145
7.3.1 Thermal Adjustment Phase
The thermal adjustment phase takes the frame execution time prediction and soft real
time schedule as input and performs the following tasks (i) Compute Treq the starting
temperature for the next set of soft real time task such that temperature during the
next invocation remains below the threshold (ii) Ensure that current set of soft real time
tasks maintain the temperature below the threshold , voltage scaling/ dropping frames
if necessary. We explain a case with two real time tasks R1 and R2 in the remainder of
this discussion but the scheme can be extended to multiple soft real time tasks
Computing Treq
Treq is defined as the maximum initial temperature such that the execution of the next
real-time task(s) is guaranteed not to exceed Tmax. As the execution time and period of
real-time tasks are known, it is easy to compute Treq. For example, suppose the system
will execute two soft real-time tasks for t1 and t2 time units in the near future with steady
state temperatures Ts1 and Ts2. Then Treq can be determined by using Equation 7.1 as
Treq = Ts1 − (Ts1 − Ts2)eKt1 + (Ts2 − Tmax)eK(t1+t2) (7.3)
This can be easily extended to multiple soft real-time tasks.
Voltage Scaling Soft Real Time Tasks
This phase also checks if the execution of the soft real time tasks maintains the temper-
ature below the threshold using the model. For instance if the present temperature of
the system is Tinit and there are two soft real-time tasks for t1 and t2 time units in the
near future with steady state temperatures Ts1 and Ts2. The temperature at the end of
146
Task 1 and Task 2 are given by
T1 = Ts1 − (Ts1 − Tinit)e−Kt1 (7.4)
T2 = Ts2 − (Ts2 − T1)e−Kt2 (7.5)
If T1 < Tmax and T2 < Tmax then the phase computes the slack and presents the slack
and Treq to the best effort scheduler. If either one of tasks exceed the threshold then the
then the corresponding tasks’s frequency is lowered to the next lower frequency level.
After lowering the frequency the temperature and deadline constraints are verified. If
either constraints are not met then the frame is dropped. At the end of the temperature
adjustment phase, a feasible soft real time schedule with frequency levels for each task
as well as the corresponding slack and Treq values are computed. The slack and Treq
values are sent to the best effort scheduler which uses it for scheduling the best effort
tasks. Next we present our best effort scheduler.
7.3.2 Best Effort Scheduler
We first categorize the best-effort tasks into hot tasks and cold tasks. Pc is a cold task
if its steady state temperature is below Treq as it would cool down the system. Similarly,
Ph is a hot task if its steady state temperature is above Treq as it may heat up the system
beyond Treq.
We observe that a schedule alternating between hot and cold tasks provides a good
solution. The scheduler considers a pair of tasks (one hot and one cold) at a time. The
problem now boils down to dividing up CPU share between these two tasks so as to keep
the temperature below Tmax and the temperature at the end of the schedule is below
Treq. If the set of best effort tasks consists of only hot tasks then our best effort scheduler


















Figure 7.3: CPU share between hot and cold tasks
7.3.3 CPU Share between a Hot and Cold Task
Given (1) current temperature Tcurr, (2) a hot task (Ph) with steady state temperature
Tsh, and (3) a cold task (Pc) with steady state temperature Tsc, the goal of the scheduler
is to allocate N time units between Ph and Pc so as to maintain the system temperature
below Tmax and the temperature at the end of the schedule is less than Treq. In particular,
we determine the maximum share 0 ≤ β ≤ 1 that can be allocated to the hot task (i.e., it
executes for βN) while maintaining the system temperature below Tmax and temperature
at the end of the schedule is less than Treq.
Case 1: Tcurr ≥ Tsc In this case, the cold task should be scheduled first to cool down
the system and maximize the share for the hot task. Figure 7.3(a) shows the temperature
curve over N time units. Tmid is the temperature after executing the cold task and Tfin
is the final temperature.
Tfin = Tsh − (Tsh − Tmid)× e−KβN (7.6)
Tmid = Tsc − (Tsc − Tcurr)× e−K(1−β)N (7.7)
Clearly, the temperature constraints are satisfied if Tfin < Treq. Hence, the maximum






where C1 = Tsh − Treq; C2 = Tsh − Tsc; C3 = Tcurr − Tsc
Case 2: Tcurr < Tsc In this case the maximum share for the hot task is obtained
when it is scheduled first. This scenario is shown in Figure 7.3(b). Here the temperature
is guaranteed to be below Tmax if Tmid <= Tmax and the final temperature constraint is
satisfied if Tfin <= Treq. So the value of β can be obtained from
Tmid = Tsh − (Tsh − Tcurr)× e−KβN (7.9)
Tfin = Tsc − (Tsc − Tmid)× e−K(1−β)N (7.10)












β = min(β1, β2) (7.13)
Best Effort Scheduling Policy The run queue consisting of the ready tasks is split
into two queues corresponding to the hot and cold tasks, respectively. The scheduler
also keeps track of the CPU share given to each task so far. Whenever the scheduler is
invoked, it selects the task with least share in the hot queue (Ph) and the task with the
least share in the cold queue (Pc). Let N be the scheduling unit. The maximum share,
β, that can be allocated to Ph in the next 2N time units is determined using Eqn 7.8
or Eqn 7.13. Our best effort scheduler examines the CPU share allocated to both the
tasks and tries to maintain fairness while ensuring that the hot task gets no more than
β × 2N time units.
149
Enforcing Fairness The scheduling scheme discussed earlier cannot ensure fairness
as it gives higher preference to cold tasks in trying to keep the system temperature
below Tmax. To obtain a tradeoff between throughput and fairness, we employ selective
voltage scaling for the hot tasks in conjunction with our thermal-aware scheduler. We
assume that the processor supports two voltage levels Vmin and Vmax with corresponding
frequencies fmin and fmax. To ensure fairness, we define minimum share smin for any
task. If the current share of a hot task is below smin, then its voltage scaled version
is transferred to the cold queue. The parameter smin represents the tradeoff between
fairness and throughput. Given an aggressive value of smin, the system spends most of
its time in voltage scaled mode thus reducing throughput. A smaller value of smin, in
contrast, may lead to unfairness towards the hot tasks.
7.4 Experimental Evaluation
Setup We use SimpleScalar 3.0 architectural simulator with configurations similar to
a ARM Cortex A8 Embedded Processor [2]. The processor model is a in-order dual issue
processor with 32 KB instruction/data caches, 512 entry branch miss-prediction buffer
and a 13 entry branch miss-prediction pipeline. The temperature values are obtained
using HotSpot-3.0 [89], an architecture-level thermal simulator working in conjunction
with Wattch [22], an architecture-level power simulator. We use Wattch’s linear scaling
to obtain the power consumption at 1.2 V and 1.5 Ghz [2] and for voltage scaling we use a
lower operating frequency of 800 Mhz which we find sufficient to remove all temperature
violations [87]. We use a thermal resistance of 1.83oC/W and a thermal capacitance of
112.4mJ/oC and an ambient temperature of 40oC. We assume that the temperature
should not exceed 80oC based on the cooling solution. The benchmarks selected from
MiBench, MediaBench and EEMBC benchmark suites have steady state temperatures
in the range 63.65oC − 88.5oC. We have tasks with low (patricia, gs, apcm), medium
(jpeg, mpeg, mp3, blowfish,crc,) and high (rijndael, sha, susan) thermal profile.
We create eight task sets using different combinations of these benchmarks as shown in
150
Soft Real-Time Best Effort
S1 mpeg, mp3 sha, jpeg, adpcm, crc
S2 mpeg, mp3 rijndael, susan, patricia, gs
S3 mpeg, mp3 susan, jpeg, blowfish, gs
S4 mpeg, mp3 rijndael, sha, adpcm, patricia
S5 jpeg, susan, crc, gs
S6 sha, rijndael, crc, gs
S7 sha, susan, jpeg, patricia
S8 susan, rijndael, jpeg, blowfish
Table 7.1: Composition of task sets
Task Throughput Fairness
Set TAS(0) TAS(0.2) DVS CG TAS(0) TAS(0.2) DVS CG
S1 1.0 0.96 0.91 0.82 0.94 0.97 0.98 0.94
S2 1.0 0.92 0.89 0.84 0.89 0.96 0.98 0.92
S3 1.0 0.89 0.86 0.82 0.84 0.94 0.98 0.94
S4 1.0 0.98 0.94 0.88 0.91 0.96 0.97 0.90
S5 1.0 1.0 0.95 0.87 0.97 0.97 0.98 0.91
S6 1.0 0.93 0.92 0.89 0.88 0.96 0.98 0.96
S7 1.0 0.96 0.90 0.81 0.89 0.92 0.98 0.90
S8 0.82 0.82 0.84 0.73 0.99 0.99 0.99 0.96
Table 7.2: Throughput and fairness of thermal-aware scheduler (TAS) with smin =
0, smin = 0.2 and DTM Schemes
Table 7.1. Each task set contains applications with varying thermal characteristics. Four
of these task sets have soft real-time applications (mpeg and mp3) while the other four
have only best-effort applications. We assume a frame rate of 30 ms for mpeg and 26
ms for mp3. Tasks sets containing real-time applications are simulated till 450 video or
audio frames complete decoding. Tasks sets containing only best effort applications are
simulated for a total of 500 time slices (each time slice = 20ms). Of these task sets, S8
consists of only hot applications and hence in this case our scheduler performs scheduling
by using a voltage scaled version of available hot tasks as cold tasks.
Traditional Thermal Management: Our temperature-aware scheduler (TAS) main-
tains the temperature below the threshold by appropriately scheduling the best-effort
tasks. We compare it against a standard round robin (RR) scheduler for the best-effort
tasks with a time slice of 20ms. The RR scheduler cannot guarantee that the temper-
ature will not exceed the threshold and hence dynamic thermal management (DTM)





















Figure 7.4: Temperature profile for TAS
with RR scheduler: dynamic voltage scaling (DVS) and global clock gating (CG). DVS
lowers the voltage/frequency of the processor whenever the system hits the threshold
temperature. In case of CG, the processor global clock is gated (i.e., the processor re-
mains idle). Once a DTM mechanism is engaged, the system begins to cool down. The
normal operating voltage and frequency are resumed once the system temperature goes
sufficiently below the maximum temperature. We implement a binary DVS scheme [87]
that is shown to be performing as well as multi-level DVS schemes.
Maintaining Temperature : Figure 7.4 shows the temperature profiles for our ther-
mal aware scheduling scheme as well as the temperature profile if no DTM scheme is
present. We observe that TAS is able to keep the temperature below Tmax for all the
task sets by changing either the share of processing time given to hot and cold tasks or
by appropriately scaling the operating frequency.
Throughput: Let tmin and tmax be the time the system spends in lower (800 MHz) and
higher frequency (1.5GHz) for DVS and TAS. Note that TAS does not use low voltage
as a consequence of hitting the threshold temperature. Instead, it selectively lowers the
voltage of hot tasks to ensure fairness. For both cases, the throughput is defined as
Throughput =









Table 7.2 shows the throughput for TAS, DVS and CG. We use two versions of TAS
scheme with different values of smin that controls fairness (see Section 7.3.3). The first
version (smin = 0) maximizes the throughput while ignoring fairness. The second version
(smin = 0.2) introduces a voltage-scaled version of a hot task when its share drops below
0.2.
The results shows that TAS performs better than clock gating in all cases. It performs
better than DVS in cases where there is at least one cold task (S1-S7). In the case
where only hot tasks are available (S8), TAS gives almost the same throughput as DVS.
As expected, the throughput of TAS drops as the minimum expected share (smin) is
increased from 0 to 0.2.
Fairness: The share of a best effort task Pi is given by
s(i) =
tmax(i) + 0.53× tmin(i)






where tmax(i) and tmin(i) are the amount of time task Pi spends at maximum and
minimum voltage and Q is the number of best-effort tasks. Based on the share given to
each best effort application, our metric for fairness is





where sfair is the expected fair share.
Table 7.2 shows the fairness metric for TAS, CG, and DVS. TAS with smin = 0 maxi-
mizes the throughput with lower fairness than DVS. However, if we use smin = 0.2, then
the fairness improves and becomes comparable to DVS with higher throughput than
153
DVS . As the scheduler tries to be more fair (higher smin), the throughput drops. Thus
our predictive scheme can provide a tradeoff between system throughput and fairness
when operating under thermal constraint. A reactive DVS scheme, on the other hand,
only operates at one of these points.
7.5 Summary
In this chapter we presented a technique for software based thermal management tech-
nique for embedded processors. Our technique involves modifications to a hierarchial
scheduler which can handle both soft real time (multimedia) and best effort tasks. Our
temperature aware scheduler can handle multiple system level requirements such as soft
real time constraints, fairness and others while maintaining the temperature below the
threshold. For a given level of fairness, our temperature aware scheduler can provide
better throughput than hardware based DTM schemes. Moreover, our scheme includes
a dynamically tunable parameter which can provide good throughput while maintaining
a desired level of fairness.
Chapter 8
Conclusion
This chapter concludes the thesis summarizing the major results and outlining the di-
rections for future work.
8.1 Summary of the Thesis
This thesis focusses on the design of workload centric thermal management approaches
for processors. Modern computer systems are severely thermally constrained and one of
the key challenges in the design of these systems is how to maximize the performance of
the system while keeping the temperature under acceptable limits. Solutions and tech-
niques that are aimed at keeping processor temperature under control are referred to as
thermal management solutions. Recently thermal management has been a very actively
researched area and a number of thermal management solutions have been proposed.
These approaches monitor the temperature of the processor using on-chip sensors and
employ mechanisms to reduce the temperature when it exceeds the threshold.
Most of the previously proposed thermal management solutions have focussed on (i)
choice of appropriate mechanisms to control temperature, (ii) determining appropriate
155
control algorithms or heuristics to employ the mechanism. The second aspect of the
design is specifically challenging because any mechanism to reduce the temperature of the
chip entails a performance loss. Thus, the main objective is to chose control algorithms
or heuristics that adjust the severity of the response to the severity of thermal stress.
In this thesis we take a complimentary approach to thermal management. We observe
that the thermal behavior of a processor system is highly dependent on the properties
of the workload and the hardware configuration. Thus temperature can be controlled
either by (i) altering the workload or (ii) altering the hardware configuration. Using this
observation we design two classes of thermal management techniques.
The first class of techniques are entirely software driven and control temperature by
altering the workload executing on the processor in a multi-tasking systems. In a multi-
tasking system, the workload executing on the processor is controlled by the scheduler
that decides on the allocation of the CPU to different processes in the system. Our
thermal management strategies operate in conjunction with the scheduler and construct
a thermally efficient schedule.
The second class of techniques use a combination of hardware and software (hybrid) to
dynamically determine the best possible processor configuration for a given workload.
Hardware based schemes such as dynamic voltage and frequency scaling (DVFS), fetch
throttling, clock gating and others have been previously for thermal management. How-
ever these approaches have been viewed as competing alternatives and thermal manage-
ment solutions have focussed on comparing these techniques and determining the most
appropriate mechanism. We observe that when a combination of such techniques are
employed together, they work synergetically and provide highly efficient thermal man-
agement solution. The key challenge in such a scenario is to manage the explosion in
the configuration space. Our techniques sift dynamically through a large configuration
space and determine the most optimal thermally efficient setting. Our key results can
be summarized as follows
• The temperature of a set of tasks executing on a processor is highly dependent
156
on the order of execution of the tasks. We design a thermal management strategy
that determines the thermally optimal ordering for a set of tasks. On an average,
our technique can lower the temperature of the processor by 4.09oC without any
impact on performance.
• We also observe that the temperature of a multitasking system is sensitive to the
shares of execution time between hot and cold tasks in the system. Based on this
observation, we design a thermal management strategy that manages temperature
by controlling the relative shares on execution time between hot and cold tasks.
Our thermal management strategy provides better performance than more compli-
cated schemes while maintaining a host of scheduling requirements such as fairness
and real time constraints.
• We observe that configuring multiple hardware parameters simultaneously has a
large impact on performance and temperature. We design a software based thermal
management solution that employs multiple thermal control mechanisms simulta-
neously and our framework results in a 39% reduction in overhead in comparison
to the best known existing technique.
• We extend our thermal management strategy to multi-core systems and design
a software based thermal management for multi-core systems. Our strategy is
simpler to implement and results in a significantly better throughput than the
best performing thermal management scheme for multi-core systems.
8.2 Future Work
In this thesis we have established the need for a workload driven approach for thermal
management of high performance computer systems. Exploiting flexibility either in terms
of altering the workload (scheduler driven approaches) or hardware can help manage
temperature effectively without compromising on performance.
157
One avenue for future work is a detailed exploration of the boundaries between scheduling
driven and hardware thermal management solutions. Hardware DTM mechanisms can
provide immediate response to a thermal emergency while scheduling driven mechanisms
kick in slower but help shape the long term thermal profile. All previously explored
solutions including ours have been either entirely hardware based or scheduling based
solutions. Examining the boundary and synergy between hardware and scheduling based
approaches would help us design better thermal management strategies.
A natural extension of our solutions would be for thermal management of heterogenous
multi-core systems where all cores are not equal. A typical example of such a system
is the Intel Core i7 [5] processor which is a SMT processor with four cores and 2 SMT
threads per core. In such a system, the software sees eight logical cores but all of them
are not identical. This is because two SMT contexts share the same physical core. From
the perspective of thermal management there is asymmetry since two logical cores which
share the same physical core would show more temperature dependence than two distinct
physical cores. Extension of our framework to such a setting would be an interesting
avenue for future work.
In this thesis we have focussed on techniques that optimize the performance of the
computer system under thermal constraints. But of late power dissipation and energy
consumption have become important design parameters. The techniques proposed in this
thesis can be extended to handle multiple objective optimizations such as optimizing for
a combination of energy consumption, performance and temperature.
Bibliography
[1] AMD Phenom II X4 and AMD Phenom II X3 Processors. http://www.amd.com/
us-en/Processors/ProductInformation/0,,30_118_15331_15917,00.html.
[2] ARM Cortex A8 Processor. http://www.arm.com/products/CPUs/ARM_
Cortex-A8.html.
[3] BSIM Device Models. http://www-device.eecs.berkeley.edu/~bsim3/.
[4] IBM cools 3-D chips with H2O. http: // www. zurich. ibm. com/ news/ 08/ 3D_
cooling. html .
[5] Intel Core i7 Processor. http://www.intel.com/products/processor/corei7/
index.htm.
[6] Intel Core2 Duo Processor. http://www.intel.com/products/processor/
core2duo/index.htm.
[7] Intel. Pentium 4 Processor. http://www.intel.com/products/processor/
pentium4/specs.htm.
[8] Matlab Neural Network Toolbox. www.mathworks.com/access/helpdesk/help/
pdf_doc/nnet/nnet.pdf.
[9] Semiconductor research corporation packing thrust strategic needs. http: // www.
src. org/ fr/ S200504packaging\ _needs. pdf,2005.
[10] SPEC CPU2000 Benchmark Suite. http://www.spec.org/cpu2000/.
159
[11] TSMC Device Scaling Trends. http://www.tsmc.com/english/b_technology/
b01_platform/b0101_advanced.htm.
[12] David H. Albonesi, Rajeev Balasubramonian, Steven G. Dropsho, Sandhya
Dwarkadas, Eby G. Friedman, Michael C. Huang, Volkan Kursun, Grigorios Magk-
lis, Michael L. Scott, Greg Semeraro, Pradip Bose, Alper Buyuktosunoglu, Pe-
ter W. Cook, and Stanley E. Schuster. Dynamically tuning processor resources
with adaptive processing. Computer, 36(12), 2003.
[13] R. Iris Bahar and Srilatha Manne. Power and energy reduction via pipeline bal-
ancing. In ISCA ’01: Proceedings of the 28th annual international symposium on
Computer architecture, 2001.
[14] Rajeev Balasubramonian, David Albonesi, Alper Buyuktosunoglu, and Sandhya
Dwarkadas. Memory hierarchy reconfiguration for energy and performance in
general-purpose processor architectures. In MICRO 33: Proceedings of the 33rd
annual ACM/IEEE international symposium on Microarchitecture, 2000.
[15] Anirban Basu, Sheng-Chih Lin, Vineet Wason, Amit Mehrotra, and Kaustav
Banerjee. Simultaneous optimization of supply and threshold voltages for low-
power and high-performance circuits in the leakage dominant era. In DAC ’04:
Proceedings of the 41st annual conference on Design automation, 2004.
[16] Reinaldo Bergamaschi, Indira Nair, Gero Dittmann, Hiren Patel, Geert Janssen,
Nagu Dhanwada, Alper Buyuktosunoglu, Emrah Acar, Gi-Joon Nam, Dorothy
Kucar, Pradip Bose, John Darringer, and Guoling Han. Performance modeling for
early analysis of multi-core systems. In CODES+ISSS ’07: Proceedings of the 5th
IEEE/ACM international conference on Hardware/software codesign and system
synthesis, pages 209–214, New York, NY, USA, 2007. ACM.
[17] Engin Martinez Jose F Bitirgen, Ramazan Ipek. Coordinated management of mul-
tiple interacting resources in chip multiprocessors: A machine learning approach.
In MICRO 41: Proceedings of the 41st annual ACM/IEEE international sympo-
sium on Microarchitecture, 2008.
160
[18] Manjit Borah, Robert Michael Owens, and Mary Jane Irwin. Transistor sizing for
minimizing power consumption of cmos circuits under delay constraint. In ISLPED
’95: Proceedings of the 1995 international symposium on Low power design, 1995.
[19] Shekhar Borkar. Design challenges of technology scaling. IEEE Micro, 19(4), 1999.
[20] Shekhar Borkar. Designing reliable systems from unreliable components: The
challenges of transistor variability and degradation. IEEE Micro, 25(6), 2005.
[21] David Brooks and Margaret Martonosi. Dynamic thermal management for high-
performance microprocessors. In HPCA ’01: Proceedings of the 7th International
Symposium on High-Performance Computer Architecture, 2001.
[22] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework for
architectural-level power analysis and optimizations. SIGARCH Comput. Archit.
News, 28(2), 2000.
[23] Doug Burger and Todd M. Austin. The simplescalar tool set, version 2.0.
SIGARCH Comput. Archit. News, 25(3), 1997.
[24] Alper Buyuktosunoglu, David Albonesi, Stanley Schuster, David Brooks, Pradip
Bose, and Peter Cook. A circuit level implementation of an adaptive issue queue
for power-aware microprocessors. In GLSVLSI ’01: Proceedings of the 11th Great
Lakes symposium on VLSI, 2001.
[25] Thidapat Chantem, Robert P. Dick, and X. Sharon Hu. Temperature-aware
scheduling and assignment for hard real-time applications on mpsocs. In DATE
’08: Proceedings of the conference on Design, automation and test in Europe, 2008.
[26] Pedro Chaparro, Jose Gonzalez, and Antonio Gonzalez. Thermal-aware clustered
microarchitectures. In ICCD ’04: Proceedings of the IEEE International Confer-
ence on Computer Design, 2004.
[27] Yen-Kuang Chen and S. Y. Kung. Trend and challenge on system-on-a-chip de-
signs. J. Signal Process. Syst., 53(1-2), 2008.
161
[28] Aviad Cohen, Finkelstein Finkelstein, Avi Mendelson, Ronny Ronen, and Dmitry
Rudoy. On estimating optimal performance of cpu dynamic thermal management.
IEEE Comput. Archit. Lett., 2(1), 2003.
[29] Ayse Kivilcim Coskun, Tajana Simunic Rosing, and Kenny C. Gross. Proactive
temperature management in mpsocs. In ISLPED ’08: Proceeding of the thirteenth
international symposium on Low power electronics and design, 2008.
[30] Ayse Kivilcim Coskun, Tajana Simunic Rosing, and Kenny C. Gross. Temperature
management in multiprocessor socs using online learning. In DAC ’08: Proceedings
of the 45th annual conference on Design automation, 2008.
[31] Ayse Kivilcim Coskun, Tajana Simunic Rosing, and Keith Whisnant. Temperature
aware task scheduling in mpsocs. In DATE ’07: Proceedings of the conference on
Design, automation and test in Europe, 2007.
[32] Erik P. DeBenedictis. Will moore’s law be sufficient? In SC ’04: Proceedings of
the 2004 ACM/IEEE conference on Supercomputing, 2004.
[33] Ashutosh S. Dhodapkar and James E. Smith. Managing multi-configuration hard-
ware via dynamic working set analysis. SIGARCH Comput. Archit. News, 30(2),
2002.
[34] Ashutosh S. Dhodapkar and James E. Smith. Managing multi-configuration hard-
ware via dynamic working set analysis. SIGARCH Comput. Archit. News, 30(2),
2002.
[35] James Donald and Margaret Martonosi. Techniques for multicore thermal man-
agement: Classification and new exploration. In ISCA ’06: Proceedings of the 33rd
annual international symposium on Computer Architecture, 2006.
[36] Mohamed Gomaa, Michael D. Powell, and T. N. Vijaykumar. Heat-and-run: lever-
aging smt and cmp to manage power density through the operating system. In
ASPLOS-XI: Proceedings of the 11th international conference on Architectural sup-
port for programming languages and operating systems, 2004.
162
[37] Pawan Goyal, Xingang Guo, and Harrick M. Vin. A hierarchical cpu scheduler for
multimedia operating systems. 2001.
[38] Binns F. Carmean D. M. Gunther, S. and J. C Hall. Managing the impact of
increasing microprocessor power consumption. In In Intel Technology Journal,
2001.
[39] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B.
Brown. Mibench: A free, commercially representative embedded benchmark suite.
In WWC ’01: Proceedings of the Workload Characterization, 2001. WWC-4. 2001
IEEE International Workshop, 2001.
[40] Heather Hanson, Stephen W. Keckler, Soraya Ghiasi, Karthick Rajamani, Freeman
Rawson, and Juan Rubio. Thermal response to dvfs: analysis with an intel pentium
m. In ISLPED ’07: Proceedings of the 2007 international symposium on Low power
electronics and design, 2007.
[41] Jahangir Hasan, Ankit Jalote, T. N. Vijaykumar, and Carla E. Brodley. Heat
stroke: Power-density-based denial of service in smt. In HPCA ’05: Proceedings of
the 11th International Symposium on High-Performance Computer Architecture,
2005.
[42] John L. Hennessy and David A. Patterson. Computer Architecture; A Quantitative
Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992.
[43] Seongmoo Heo, Kenneth Barr, and Krste Asanovic´. Reducing power density
through activity migration. In ISLPED ’03: Proceedings of the 2003 international
symposium on Low power electronics and design, 2003.
[44] Michael Huang, Jose Renau, Seung-Moon Yoo, and Josep Torrellas. A frame-
work for dynamic energy efficiency and temperature management. In MICRO 33:
Proceedings of the 33rd annual ACM/IEEE international symposium on Microar-
chitecture, 2000.
163
[45] Michael C. Huang, Jose Renau, and Josep Torrellas. Positional adaptation of
processors: application to energy reduction. SIGARCH Comput. Archit. News,
31(2), 2003.
[46] Wei Huang, Mircea R. Stant, Karthik Sankaranarayanan, Robert J. Ribando, and
Kevin Skadron. Many-core design from a thermal perspective. In DAC ’08: Pro-
ceedings of the 45th annual conference on Design automation, 2008.
[47] W-L. Hung, Y. Xie, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. Thermal-
aware task allocation and scheduling for embedded systems. In DATE ’05: Pro-
ceedings of the conference on Design, Automation and Test in Europe, 2005.
[48] Canturk Isci, Alper Buyuktosunoglu, Chen-Yong Cher, Pradip Bose, and Margaret
Martonosi. An analysis of efficient multi-core global power management policies:
Maximizing performance for a given power budget. In MICRO 39: Proceedings of
the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006.
[49] Tejas S. Karkhanis and James E. Smith. A first-order superscalar processor model.
SIGARCH Comput. Archit. News, 32(2).
[50] Tejas S. Karkhanis and James E. Smith. Automated design of application specific
superscalar processors: an analytical approach. In ISCA ’07: Proceedings of the
34th annual international symposium on Computer architecture, 2007.
[51] Mircea Stan Karthik Sankaranarayanan, Sivakumar Velusamy and Kevin Skadron.
A case for thermal-aware floorplanning at the microarchitectural level. In Journal
of Instruction-Level Parallelism, 2005.
[52] Amit Kumar, Li Shang, Li-Shiuan Peh, and Niraj K. Jha. Hybdtm: a coordinated
hardware-software approach for dynamic thermal management. In DAC ’06: Pro-
ceedings of the 43rd annual conference on Design automation, 2006.
[53] Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. Mediabench:
a tool for evaluating and synthesizing multimedia and communicatons systems. In
164
MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium
on Microarchitecture, 1997.
[54] Wonbok Lee, Kimish Patel, and Massoud Pedram. Dynamic thermal manage-
ment for mpeg-2 decoding. In ISLPED ’06: Proceedings of the 2006 international
symposium on Low power electronics and design, 2006.
[55] M. Levy. Keynote talk #1- eembc and the purposes of embedded processor bench-
marking. In ISPASS ’05: Proceedings of the IEEE International Symposium on
Performance Analysis of Systems and Software, 2005, 2005.
[56] Jian Li and Jose F. Dynamic power-performance adaptation of parallel computa-
tion on chip multiprocessors. In In HPCA ’06: Proceedings of the 12th Interna-
tional Symposium on High-Performance Computer Architecture, 2006.
[57] Yingmin Li, Dharmesh Parikh, Yan Zhang, Karthik Sankaranarayanan, Mircea
Stan, and Kevin Skadron. State-preserving vs. non-state-preserving leakage control
in caches. In DATE ’04: Proceedings of the conference on Design, automation and
test in Europe, 2004.
[58] Yongpan Liu, Robert P. Dick, Li Shang, and Huazhong Yang. Accurate
temperature-dependent integrated circuit leakage power estimation is easy. In
DATE ’07: Proceedings of the conference on Design, automation and test in Eu-
rope, 2007.
[59] Yongpan Liu, Huazhong Yang, Robert P. Dick, Hui Wang, and Li Shang. Thermal
vs energy optimization for dvfs-enabled processors in embedded systems. In ISQED
’07: Proceedings of the 8th International Symposium on Quality Electronic Design,
2007.
[60] Zhijian Lu, John Lach, Mircea R. Stan, and Kevin Skadron. Improved thermal
management with reliability banking. IEEE Micro, 25(6), 2005.
165
[61] Ke Meng, Russ Joseph, Robert P. Dick, and Li Shang. Multi-optimization power
management for chip multiprocessors. In PACT ’08: Proceedings of the 17th in-
ternational conference on Parallel architectures and compilation techniques, 2008.
[62] Andreas Merkel and Frank Bellosa. Balancing power consumption in multipro-
cessor systems. In EuroSys ’06: Proceedings of the 1st ACM SIGOPS/EuroSys
European Conference on Computer Systems 2006, 2006.
[63] Pierre Michaud, Andre´ Seznec, Damien Fetis, Yiannakis Sazeides, and Theofanis
Constantinou. A study of thread migration in temperature-constrained multicores.
ACM Trans. Archit. Code Optim., 4(2), 2007.
[64] Kresimir Mihic, Tajana Simunic, and Giovanni De Micheli. Reliability and power
management of integrated systems. In DSD ’04: Proceedings of the Digital System
Design, EUROMICRO Systems, 2004.
[65] Matteo Monchiero, Ramon Canal, and Antonio Gonza´lez. Design space explo-
ration for multicore architectures: a power/performance/thermal view. In ICS
’06: Proceedings of the 20th annual international conference on Supercomputing,
2006.
[66] Rajarshi Mukherjee and Seda Ogrenci Memik. Physical aware frequency selection
for dynamic thermal management in multi-core systems. In ICCAD ’06: Proceed-
ings of the 2006 IEEE/ACM international conference on Computer-aided design,
2006.
[67] Srinivasan Murali, Almir Mutapcic, David Atienza, Rajesh Gupta, Stephen Boyd,
Luca Benini, and Giovanni De Micheli. Temperature control of high-performance
multi-core platforms using convex optimization. In DATE ’08: Proceedings of the
conference on Design, automation and test in Europe, 2008.
[68] Srinivasan Murali, Almir Mutapcic, David Atienza, Rajesh Gupta, Stephen Boyd,
Luca Benini, and Giovanni De Micheli. Temperature control of high-performance
166
multi-core platforms using convex optimization. In DATE ’08: Proceedings of the
conference on Design, automation and test in Europe, 2008.
[69] Srinivasan Murali, Almir Mutapcic, David Atienza, Rajesh Gupta, Stephen Boyd,
and Giovanni De Micheli. Temperature-aware processor frequency assignment for
mpsocs using convex optimization. In CODES+ISSS ’07: Proceedings of the 5th
IEEE/ACM international conference on Hardware/software codesign and system
synthesis, 2007.
[70] Madhu Mutyam, Feihui Li, Vijaykrishnan Narayanan, Mahmut Kandemir, and
Mary Jane Irwin. Compiler-directed thermal management for vliw functional units.
SIGPLAN Not., 41(7), 2006.
[71] Sri Hari Krishna Narayanan, Guilin Chen, Mahmut x. Mahmut Kandemir, and
Yuan Xie. Temperature-sensitive loop parallelization for chip multiprocessors. In
ICCD ’05: Proceedings of the 2005 International Conference on Computer Design,
2005.
[72] Sri Hari Krishna Narayanan, Mahmut Kandemir, and Ozcan Ozturk. Compiler-
directed power density reduction in noc-based multi-core designs. In ISQED ’06:
Proceedings of the 7th International Symposium on Quality Electronic Design,
2006.
[73] Jason Nieh and Monica S. Lam. A smart scheduler for multimedia applications.
ACM Trans. Comput. Syst., 21(2), 2003.
[74] Vidyasagar Nookala, David J. Lilja, and Sachin S. Sapatnekar. Temperature-aware
floorplanning of microarchitecture blocks with ipc-power dependence modeling and
transient analysis. In ISLPED ’06: Proceedings of the 2006 international sympo-
sium on Low power electronics and design, 2006.
[75] David A. Patterson and John L. Hennessy. Computer organization & design: the
hardware/software interface. 1993.
167
[76] Erez Perelman, Greg Hamerly, and Brad Calder. Picking statistically valid and
early simulation points. In PACT ’03: Proceedings of the 12th International Con-
ference on Parallel Architectures and Compilation Techniques, 2003.
[77] Fred J. Pollack. New microarchitecture challenges in the coming generations of
cmos process technologies (keynote address)(abstract only). In MICRO 32: Pro-
ceedings of the 32nd annual ACM/IEEE international symposium on Microarchi-
tecture, 1999.
[78] Dmitry Ponomarev, Gurhan Kucuk, and Kanad Ghose. Reducing power require-
ments of instruction scheduling through dynamic allocation of multiple datapath
resources. In MICRO 34: Proceedings of the 34th annual ACM/IEEE international
symposium on Microarchitecture, 2001.
[79] Michael D. Powell, Ethan Schuchman, and T. N. Vijaykumar. Balancing resource
utilization to mitigate power density in processor pipelines. In MICRO 38: Pro-
ceedings of the 38th annual IEEE/ACM International Symposium on Microarchi-
tecture, 2005.
[80] A. Watwe R. Viswanath, V. Wakharkar and V. Lebonheur. Thermal performance
challenges from silicon to systems. In In Intel Technology Journal 3Q 2000, Q3
2000.
[81] Ravishankar Rao and Sarma Vrudhula. Performance optimal processor throttling
under thermal constraints. In CASES ’07: Proceedings of the 2007 international
conference on Compilers, architecture, and synthesis for embedded systems, 2007.
[82] John Regehr and John A. Stankovic. Hls: A framework for composing soft real-
time schedulers. In RTSS ’01: Proceedings of the 22nd IEEE Real-Time Systems
Symposium (RTSS’01), 2001.
[83] H. Rosten and R. Viswanath. Thermal modeling of the pentium processor package.
Proc. 44th Electron. Comp. Technol. Conf, 1994.
168
[84] Hector Sanchez, Belli Kuttanna, Tim Olson, Mike Alexander, Gian Gerosa, Ross
Philip, and Jose Alvarez. Thermal management system for high performance pow-
erpctm microprocessors. In COMPCON ’97: Proceedings of the 42nd IEEE Inter-
national Computer Conference, 1997.
[85] Ruchira Sasanka, Christopher J. Hughes, and Sarita V. Adve. Joint local and
global hardware adaptations for energy. In ASPLOS-X: Proceedings of the 10th
international conference on Architectural support for programming languages and
operating systems, 2002.
[86] N. N. Schraudolph. A Fast, Compact Approximation of the Exponential Function.
In Technical Report INDISA-07-98.
[87] Kevin Skadron. Hybrid architectural dynamic thermal management. In DATE ’04:
Proceedings of the conference on Design, automation and test in Europe, 2004.
[88] Kevin Skadron, Tarek Abdelzaher, and Mircea R. Stan. Control-theoretic tech-
niques and thermal-rc modeling for accurate and localized dynamic thermal man-
agement. In HPCA ’02: Proceedings of the 8th International Symposium on High-
Performance Computer Architecture, 2002.
[89] Kevin Skadron, Mircea R. Stan, Karthik Sankaranarayanan, Wei Huang, Sivaku-
mar Velusamy, and David Tarjan. Temperature-aware microarchitecture: Modeling
and implementation. ACM Trans. Archit. Code Optim., 1(1), 2004.
[90] Allan Snavely and Dean M. Tullsen. Symbiotic jobscheduling for a simultaneous
multithreaded processor. In ASPLOS-IX: Proceedings of the ninth international
conference on Architectural support for programming languages and operating sys-
tems, 2000.
[91] Jayanth Srinivasan and Sarita V. Adve. Predictive dynamic thermal management
for multimedia applications. In ICS ’03: Proceedings of the 17th annual interna-
tional conference on Supercomputing, 2003.
[92] Gilbert Strang. Introduction to Linear Algebra. Wellesley Cambridge Press, 1993.
169
[93] Haihua Su, Frank Liu, Anirudh Devgan, Emrah Acar, and Sani Nassif. Full
chip leakage estimation considering power supply and temperature variations. In
ISLPED ’03: Proceedings of the 2003 international symposium on Low power elec-
tronics and design, 2003.
[94] Scott Taylor, Michael Quinn, Darren Brown, Nathan Dohm, Scot Hildebrandt,
James Huggins, and Carl Ramey. Functional verification of a multiple-issue, out-
of-order, superscalar alpha processor—the dec alpha 21264 microprocessor. In
DAC ’98: Proceedings of the 35th annual conference on Design automation, 1998.
[95] Lothar Thiele and Reinhard Wilhelm. Design for timing predictability. Real-Time
Syst., 28(2-3), 2004.
[96] Shengquan Wang and Riccardo Bettati. Delay analysis in temperature-constrained
hard real-time systems with general task arrivals. In RTSS ’06: Proceedings of the
27th IEEE International Real-Time Systems Symposium, 2006.
[97] Shengquan Wang and Riccardo Bettati. Reactive speed control in temperature-
constrained real-time systems. Real-Time Syst., 39(1-3), 2008.
[98] Jonathan A. Winter and David H. Albonesi. Addressing thermal nonuniformity in
smt workloads. ACM Trans. Archit. Code Optim., 5(1), 2008.
[99] Raj Yavatkar and Murli Tirumala. Platform wide innovations to overcome thermal
challenges. Microelectron. J., 39(7), 2008.
[100] Inchoon Yeo, Chih Chun Liu, and Eun Jung Kim. Predictive dynamic thermal
management for multicore systems. In DAC ’08: Proceedings of the 45th annual
conference on Design automation, 2008.
[101] D. Zhigang Hu Skadron K. Yingmin Li Lee, B. Brooks. Cmp design space explo-
ration subject to physical constraints. In High-Performance Computer Architec-
ture, 2006. The Twelfth International Symposium on High-Performance Computer
Architecture, 2006.
170
[102] Wanghong Yuan and Klara Nahrstedt. Energy-efficient soft real-time cpu schedul-
ing for mobile multimedia systems. In SOSP ’03: Proceedings of the nineteenth
ACM symposium on Operating systems principles, 2003.
[103] Inchoon Yeo Heung Ki Lee Eun Jung Kim Ki Hwan Yum. Effective dynamic
thermal management for mpeg-4 decoding. In 25th International Conference on
Computer Design, ICCD., 2007.
[104] Sushu Zhang and Karam S. Chatha. Approximation algorithm for the temperature-
aware scheduling problem. In ICCAD ’07: Proceedings of the 2007 IEEE/ACM
international conference on Computer-aided design, 2007.
[105] Xiangrong Zhou, Chenjie Yu, and Peter Petrov. Compiler-driven register re-
assignment for register file power-density and temperature reduction. In DAC
’08: Proceedings of the 45th annual conference on Design automation, 2008.
[106] Changyun Zhu, Zhenyu Gu, Li Shang, R.P. Dick, and R. Joseph. Three-
dimensional chip-multiprocessor run-time thermal management. Computer-Aided
Design of Integrated Circuits and Systems, IEEE Transactions on, 27(8):1479–
1492, Aug. 2008.
