Application aware performance, power consumption, and reliability tradeoff by Gorti, Naga Pavan Kumar
Graduate Theses and Dissertations Iowa State University Capstones, Theses andDissertations
2014
Application aware performance, power
consumption, and reliability tradeoff
Naga Pavan Kumar Gorti
Iowa State University
Follow this and additional works at: https://lib.dr.iastate.edu/etd
Part of the Computer Engineering Commons
This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University
Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University
Digital Repository. For more information, please contact digirep@iastate.edu.
Recommended Citation
Gorti, Naga Pavan Kumar, "Application aware performance, power consumption, and reliability tradeoff " (2014). Graduate Theses and
Dissertations. 13933.
https://lib.dr.iastate.edu/etd/13933
Application aware performance, power consumption, and reliability tradeoff
by
Naga Pavan Kumar Gorti
A dissertation submitted to the graduate faculty
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
Major: Computer Engineering
Program of Study Committee:
Arun K. Somani, Major Professor
Akhilesh Tyagi
Joseph Zambreno
Philip Jones
David Fernandez Baca
Iowa State University
Ames, Iowa
2014
Copyright c© Naga Pavan Kumar Gorti, 2014. All rights reserved.
ii
DEDICATION
To my family and to my daughter, who is yet to step into this beautiful world
iii
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Current trends in computing industry . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Performance trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Power consumption trend . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Reliability trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Nature of PPR demands . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Research contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
CHAPTER 2. REVIEW OF LITERATURE . . . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Dynamic voltage and frequency scaling . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Benefit of DVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Early work on DVFS and classification of DVFS schemes . . . . . . . . 11
2.2.3 Online DVFS schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Oﬄine DVFS schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.5 Hybrid DVFS schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.6 Thermal aware DVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Miroarchitecural adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
iv
2.3.1 Classic research in microarchitectural adaptation . . . . . . . . . . . . . 19
2.3.2 Closely related research in microarchitectural adaptation . . . . . . . . 26
2.4 Uniqueness of current research . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
CHAPTER 3. PERFORMANCE RELIABILITY TRADEOFF . . . . . . . . 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Procurement of task execution characteristics . . . . . . . . . . . . . . . 34
3.3 OC Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 GOPS Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 LOPS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Evaluation of the developed DVFS based schemes . . . . . . . . . . . . . . . . 44
3.4.1 Experimentation with synthetic task sets . . . . . . . . . . . . . . . . . 45
3.4.2 Simulation based performance reliability tradeoff analysis . . . . . . . . 50
3.5 Performance reliability tradeoff using DVFS and microarchitectural adaptation 56
3.5.1 Need for considering DVFS and microarchitectural adaptation together 56
3.5.2 Selection of adaptive microarchtiectural components . . . . . . . . . . . 57
3.5.3 Performance reliability tradeoff considering both DVFS and microarchi-
tectural adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
CHAPTER 4. ADAPTIVE MICROARCHITECTURAL CONFIGURATION
SPACE PRUNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Selection of Advantageous Control Knobs (SACK ) . . . . . . . . . . . . . . . . 63
4.2.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Elimination of ineffective configurations (ELIC ) . . . . . . . . . . . . . . . . . . 66
4.3.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Configuration Set Selection for Runtime (CSSR) . . . . . . . . . . . . . . . . . 67
v4.4.1 Merit based selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.2 Bound based selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.3 Neighborhood based selection . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Evaluation of the different CSSR pruning methods . . . . . . . . . . . . . . . . 76
4.5.1 Final configuration space . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.2 User demand tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
CHAPTER 5. DEGRADATION OF PERFORMANCE-POWER TRADE-
OFF UNDER PERMANENT FAULTS . . . . . . . . . . . . . . . . . . . . . 82
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Fault model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Evaluation of tradeoff degradation . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.1 Dispatch port failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Cache way failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.3 Instruction window chunk failure . . . . . . . . . . . . . . . . . . . . . . 90
5.3.4 Voltage and frequency control failure . . . . . . . . . . . . . . . . . . . . 91
5.3.5 Power saving with reduced performance requirements . . . . . . . . . . 95
5.3.6 Avoiding available adaptations for increased power saving . . . . . . . . 96
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
CHAPTER 6. APPLICATION AWARE PERFORMANCE-POWER TRADE-
OFF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Two stage static cum dynamic adaptation strategy . . . . . . . . . . . . . . . . 102
6.2.1 Application phase demarcation . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.2 Static reconfiguration stage . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.3 Dynamic reconfiguration stage . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Dynamic adaptation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
vi
6.4.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4.2 Determination of maximum interval length . . . . . . . . . . . . . . . . 114
6.4.3 Adaptation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4.4 Handling runtime variations in performance and power . . . . . . . . . . 117
6.4.5 Comparison of SDC and dynamic adaptation strategies . . . . . . . . . 118
6.4.6 Scaling of power consumption with performance . . . . . . . . . . . . . 121
6.4.7 Comparison with previous schemes . . . . . . . . . . . . . . . . . . . . . 122
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
CHAPTER 7. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . 126
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
vii
LIST OF TABLES
Table 3.1 OC selection schemes for Peak Reduction . . . . . . . . . . . . . . . . . 39
Table 3.2 Evaluation parameters used for analyzing effectiveness of GOPS algo-
rithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Table 3.3 QOS satisfaction and Energy savings . . . . . . . . . . . . . . . . . . . 49
Table 3.4 Simulation parameters used for performance-reliability tradeoff analysis 51
Table 3.5 SPEC workloads used for simulations . . . . . . . . . . . . . . . . . . . 52
Table 3.6 MTTF model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Table 3.7 Operating voltages and frequencies for Intel Pentium M processor . . . 56
Table 3.8 Adaptive hardware configurations . . . . . . . . . . . . . . . . . . . . . 58
Table 4.1 Considered adaptive components and adaptations . . . . . . . . . . . . 64
Table 4.2 tp for the considered adaptive components . . . . . . . . . . . . . . . . 65
Table 4.3 Performance-power variations provided using the chosen configuration
space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Table 4.4 Different pruning methods for CSSR . . . . . . . . . . . . . . . . . . . 68
Table 4.5 Different performance-power demand scenarios . . . . . . . . . . . . . . 77
Table 5.1 Investigated fault scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 86
Table 5.2 Performance and power characteristics obtained for different perfor-
mance demands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
viii
LIST OF FIGURES
Figure 1.1 Transistor count trend (per chip) for commercial processors . . . . . . 2
Figure 1.2 Core count trend for HPC platforms . . . . . . . . . . . . . . . . . . . 2
Figure 1.3 Power consumption trend for HPC platforms . . . . . . . . . . . . . . . 3
Figure 1.4 Relative importance of factors limiting server growth potential . . . . . 4
Figure 1.5 Decrease in lifetime reliability of processors with shrinking gate length 5
Figure 2.1 Adaptation in Complexity-Adaptive processors . . . . . . . . . . . . . 20
Figure 2.2 Smart Memories tile floorplan . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 2.3 2Bc-gskew-pskew branch predictor organization . . . . . . . . . . . . . 26
Figure 3.1 Example Operations Table . . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 3.2 Example OC selection using WOPS . . . . . . . . . . . . . . . . . . . . 42
Figure 3.3 Effectiveness of GOPS algorithms in reducing inter-task temperature
gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 3.4 Effectiveness of GOPS algorithms in providing energy savings . . . . . 48
Figure 3.5 Scaling of the number of algorithmic iterations of GOPS algorithms
with task set size and task stretch factor . . . . . . . . . . . . . . . . . 49
Figure 3.6 Simulation framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 3.7 MTTF increase using DVFS using (a) Window based selection, and (b)
Peak reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 3.8 Normalized performance vs. (a) IL1-Assoc. (b) Operating VF for se-
lected SPEC benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 3.9 Normalized power vs. (a) IL1-Assoc. (b) Operating VF for selected
SPEC benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
ix
Figure 3.10 Power consumption breakdown among different units on Alpha EV6
floorplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 3.11 Expected MTTF improvement through the combined use of DVFS and
microarchitectural adaptation with (a) Window based selection, and (b)
Peak reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Figure 4.1 Number of configurations eliminated by ELIC . . . . . . . . . . . . . . 67
Figure 4.2 Usage frequency of the individual adaptive settings for the considered
adaptive components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 4.3 Example merit based selection for CSSR. n=8 and k=4 . . . . . . . . . 70
Figure 4.4 CSSR using bound-based selection when n=8 and k=5 . . . . . . . . . 72
Figure 4.5 Graphical representation of the retained configuration space for soplex
benchmark after ELIC pruning step . . . . . . . . . . . . . . . . . . . 73
Figure 4.6 Final adaptive microarchitectural configuration space size . . . . . . . 77
Figure 4.7 PI in tracking high performance demands . . . . . . . . . . . . . . . . 78
Figure 4.8 PI in tracking low power demands . . . . . . . . . . . . . . . . . . . . 78
Figure 4.9 PI in tracking balanced demands . . . . . . . . . . . . . . . . . . . . . 79
Figure 4.10 PI in tracking stringent demands . . . . . . . . . . . . . . . . . . . . . 80
Figure 5.1 Tradeoff degradation when one dispatch port fails . . . . . . . . . . . . 88
Figure 5.2 Tradeoff degradation when one cache way fails . . . . . . . . . . . . . . 90
Figure 5.3 Tradeoff degradation when an instruction window chunk fails . . . . . 91
Figure 5.4 Tradeoff degradation when the lowest VF setting fails . . . . . . . . . 92
Figure 5.5 Tradeoff degradation when the intermediate VF setting fails . . . . . . 93
Figure 5.6 Tradeoff degradation when the highest VF setting fails . . . . . . . . . 94
Figure 5.7 Deliverable peak performance and the associated power consumption
utilizing a subset of available configuration space . . . . . . . . . . . . 97
Figure 6.1 Adaptive architectural reconfiguration process . . . . . . . . . . . . . . 104
Figure 6.2 Evaluation platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xFigure 6.3 Phase count scaling with maximum phase length . . . . . . . . . . . . 115
Figure 6.4 PI degradation with maximum phase length . . . . . . . . . . . . . . . 115
Figure 6.5 Benefit of utilizing intra-application adaptation . . . . . . . . . . . . . 116
Figure 6.6 Effectiveness of static cum dynamic adaptation strategy . . . . . . . . 118
Figure 6.7 Comparison of PI for SDC and dynamic adaptation strategies . . . . . 119
Figure 6.8 Performance delivered for various performance demands . . . . . . . . 121
Figure 6.9 Power consumed for various performance demands . . . . . . . . . . . 122
Figure 6.10 Performance delivered for peak power constraint of 70% . . . . . . . . 123
Figure 6.11 Energy efficiency when serving different performance needs . . . . . . . 124
Figure 7.1 Normalized performance vs. normalized power for FFT . . . . . . . . 129
Figure 7.2 Normalized performance vs. normalized power for Barnes hut algorithm 130
xi
ACKNOWLEDGEMENTS
I feel glad in taking this opportunity to express my gratitude to people who helped me with
my graduate education and research in particular and life in general.
Firstly, I would like to thank Dr. Arun K. Somani for being my major professor. He
has been the guiding beacon for my research. I have always been inspired by his discipline,
dedication and intelligence. I would like to specially thank him for his constant effort to impart
good presentation and technical writing skills to me, and for encouraging free thinking.
I would also like to thank Dr. Akhilesh Tyagi, Dr. Joseph Zambreno, Dr. Philip Jones, and
Dr. David Fernandes-Baca for serving on my committee and providing me with constructive
feedback regarding my research. I have also been associated with all these people in a classroom
setting and am thankful to them for honing my thinking abilities that proved essential for my
research. I am also thankful to other professors in the ECpE department: Dr. Manimaran, Dr.
Zhao Zhang, Dr. Nicola Elia and Dr. Chris Chu for teaching me various aspects of computing
and optimization. I would like to extend a special thanks to the ECpE department and Iowa
State University for providing me an excellent atmosphere to conduct my research.
Research is often inspired by the daily conversations we have with other people. My research
group at Iowa State University, namely the Dependable Computing and Networking Laboratory,
enabled me to network with a large set of such people. I would like to specially thank Prem,
Prasad, Nishanth, Vishwanathan, David, Parijat, Karthik, Piyush, Cory, Ashish, Joy, Utkarsh,
Haoyuan, Teng, Jin Xu, Koray, Matt, Lizandro, Harini, Suresh, Ganesh, and Kritanjali among
others in this regard.
I express my heartfelt gratitude to my parents and grandparents for teaching me about
life and pushing me towards academic success. I am very fortunate to meet the love of my
life, Priyam, at Iowa State University. It would be very fair to say that I would not have
accomplished what I have in my research without her continuous support. I am extremely
xii
indebted to my uncle and his family who helped me both in academic and non-academic
aspects of my life. I would also like extend a special thanks to my cousin Karthik for taking
care of my transition from India to USA and all the help he has rendered to me. Finally, I
would like to thank my dear friends Sharath, Adwait, Teja, Ramya, Jyani, Shantan, Indranil,
Madhu, Yuzhu, Sista, Azhar, Swagath, Avinash, Harsha, Guru, Suman, and many others for the
wonderful times they have shared with me and made my journey through PhD very pleasant.
xiii
ABSTRACT
There has been an unprecedented increase in the drive for microprocessor performance.
This drive is motivated by the increase in software complexity, opportunity to solve previously
unattempted problems especially in scientific domain, and a need to crunch the ever growing
Big Data to enable a multitude of technological advances to benefit mankind. A consequence
of these phenomena is the ever increasing transistor count in deployed computing systems.
Although technology scaling leads to lower power consumption per transistor, the overall
system level power consumption is on the rise. This leads to a variety of power supply related
issues. As the chip die area is not increasing significantly, and the supply voltage reduction
is not keeping on par with the reduction in device dimensions, an increase in power density
is observed. This manifests as an increased temperature profile on the chip floorplan. A rise
in temperature necessitates aggressive and costly cooling mechanisms adding to the design
complexity and manufacturing efforts. It also triggers various failure mechanisms leading to
reduction in the expected chip lifetime/reliability.
Given the conflicting trends in Performance, Power consumption, and chip Reliability
(PPR), it is imperative to balance them in a fine-grained fashion to meet system level goals and
expectations. Sole dependence on the advancements in manufacturing technology is no longer
sufficient. Alternate venues for PPR management are being increasingly paid attention to.
On the other hand, the PPR demands are usually time dependent. For example, the
constraint on power consumption in a green data center is dictated by the energy reserve.
The demand on performance in a cloud based platform depends on the agreed Quality of
Service (QOS) requirements. The reliability of a microprocessor is dependent on the deployment
domain.
The goal of our research is to address the issue of growing microprocessor power consumption
subject to performance and/or reliability goals. Through our developed schemes, we tailor the
xiv
execution context to match application requirements. This leads to judicious use of power
while adhering to aforementioned constraints. It is to be noted that the actual demands on
performance, power consumption, and reliability are highly variant, and depend upon executing
applications and operating conditions. As such, we develop schemes to cater to these variant
demands.
To meet these demands efficiently, the solutions developed are tailored to current hardware-
software interaction characteristics. Two techniques that are very relevant in this area, namely
dynamic voltage and frequency scaling (DVFS) and microarchitectural adaptation, are lever-
aged to produce expected PPR characteristics when executing a wide variety of tasks.
In this dissertation, we demonstrate how the expected chip lifetime can be augmented in
a real-time setting using DVFS while paying heed to performance constraints modeled as QoS
requirements. Individual tasks in a task queue are assigned specific voltage and frequency pairs
to utilize for their execution. This assignment is empowered by knowledge of application-wise
hardware-software interactions to reach solutions that are tailored to the current execution
scenario. Our observations indicate that a 2 to 18 fold improvement in chip lifetime can be
expected by the utilization of the schemes we develop in this regard. Capitalizing on the power
of microarchitectural adaptation, we further improve chip lifetime expectations 2-8 times, based
on the failure mechanism investigated. This increase in expected chip lifetime directly translates
to reduction of both operational and replacement costs.
We also provide mechanisms to co-manage performance and power consumption constraints.
Comprehensive microarchitectural adaptation space is very complex and its usage thus leads to
significant runtime overhead. To tackle this, we devote a fair bit of attention to its pruning so
as to narrow down on and utilize only the most effective adaptations. A two stage adaptation
process is provided to a) improve optimality of the solutions delivered, and b) to keep the
runtime overhead in check. We observe that our schemes provide 20% higher normalized energy
efficiency compared to the state of the art techniques proposed, while using just a very small
fraction of the configuration space. We also find that our schemes effectively cater to a wide
variety of demands on performance and power consumption, providing the necessary hardware
characteristics within 10% bound.
xv
Since only the most useful configuration space is retained for adaptation, occurrence of a
fault that prohibits the usage of a certain adaptive control can lead to the inability to satisfy
a subset of hardware demands. A detailed analysis has been carried out to understand how
the remaining active configurations can preserve the expected hardware behavior. To a good
extent, we observe that the system behavior under a failure closely tracks (with less than 5%
tracking error) the obtainable behavior without the presence of the fault.
We believe that application tailored schemes for PPR management become increasingly
relevant as the microprocessor design advancements saturate in the future. They will be ex-
tremely relevant to extract every possible ounce of performance while confirming to constraints
on power consumption and reliability. Given the effectiveness of our schemes, we are confident
that such schemes are applicable in different markets like embedded computing, desktop com-
puting, cloud platforms and high performance computing. Insights drawn from our research
will guide chip designers in the provision of effective adaptive controls to combat increasing
demands on PPR characteristics.
1CHAPTER 1. INTRODUCTION
This chapter discusses some recent trends in computing industry and their implications
on microprocessor characteristics. The need for managing performance, power consumption,
and/or reliability (PPR) through careful hardware-software co-design is detailed. This is fol-
lowed by outlining the contributions of the current research. An overview of how the thesis is
organized is presented at the end of this chapter.
1.1 Current trends in computing industry
1.1.1 Performance trend
Demand for computing performance is growing at an unprecedented pace. This demand is
motivated by several factors. First, the increase in software complexity necessitates aggressive
hardware designs to provide acceptable latency, response time or throughput. Second, scientific
computing community is attempting to solve ever challenging large problems at high speeds.
Examples of such problems include genome sequencing, weather modeling, molecular dynamics
simulations, etc. Quick processing of applications in this domain results in major advances in
our understanding of the universe and everything it encompasses. Third, immense computing
potential is required to handle ’big data’ that is being produced in various fields like social
networking, universities, remote sensing, etc. It is reported by IBM [46] that more than 90%
data in this world is generated within the past two years.
To satisfy the growing drive for performance, more and more transistors are crammed onto
computing platforms. The increased transistor count is used to both scale-in and scale-out
the computing architectures. Scale-in refers to increasing the chip complexity by making cores
more and more aggressive. Scale-out refers to the increase in number of cores employed in a
20
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
3,000,000,000
3,500,000,000
4,000,000,000
4,500,000,000
5,000,000,000
2000 2002 2004 2006 2008 2010 2012
N
u
m
b
er
 o
f 
tr
an
si
st
o
rs
 o
n
 c
h
ip
Year
Actual count
Moore prediction
Figure 1.1 Transistor count trend (per chip) for commercial processors
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
2000 2002 2004 2006 2008 2010 2012 2014
N
u
m
b
er
 o
f 
co
re
s 
in
 s
ys
te
m
Year
Figure 1.2 Core count trend for HPC platforms
compute node as well as number of compute nodes employed in a computing platform. Figure
1.1 [104] and 1.2 [97] show the consequences of scale-in and scale-out respectively in industry
over the last 14 years. In particular, we can observe that the industry has been consistently
outperforming Moore’s law based predictions starting from 2000.
1.1.2 Power consumption trend
According to semiconductor scaling theory [4], the power consumption per transistor de-
creases by a factor of U2 with each new technology generation, where U is the reduction factor
for supply voltage. However, the trends in transistor count, especially due to scale-out, pave
way for increased net power consumption. This occurs despite the growing maturity in the
semiconductor manufacturing and device scaling. Figure 1.3 [97] shows the increase in system
level power consumption for the most power hungry HPC platforms over the past 14 years.
30
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
2000 2002 2004 2006 2008 2010 2012
Po
w
er
 c
o
n
su
m
p
ti
o
n
 (
K
W
)
Year
Figure 1.3 Power consumption trend for HPC platforms
It is to be noted that this increase can be observed in the desktop and embedded computing
segment as well. Figure 1.3 serves just as an example showing the trend in one particular
computing segment.
The increased power consumption leads to various issues related to power supply and inflates
electricity bills in the high capacity server domain. According to a survey by Intel [47], the
challenges involved in powering and cooling constitute the number one factor limiting the
expansion of server industry to meet the current global demand. The survey’s results depicted
in Figure 1.4 shows that 59% of the surveyed people concur that power consumption is the
bottleneck in the development of server market. In the embedded and personal computing
domain, the increase in power consumption leads to decreased battery life and discomfort in
device handling. It also necessitates the design of aggressive cooling mechanisms and expensive
heat sinks [101, 88, 70].
1.1.3 Reliability trend
We have noted that the power consumption of a chip reduces by a factor of U2 for each
new technology generation, where U is the voltage scaling factor. It is also the case that
the transistor density scales up by a factor of S2, where S is the device dimension scaling
factor. It is a well-known fact that U < S because a) departure from standardized voltage
levels raises compatibility issues, and b) decrease in supply voltage leads to leakage power due
to the corresponding decrease in the threshold voltage Vth needed to maintain proper noise
margins. As such, the power density scales as S2/U2. This ratio is greater than 1. Due to
4Figure 1.4 Relative importance of factors limiting server growth potential
the increased power density, the temperature at which the chip operates increases as well. An
increase in temperature reduces the lifetime of the devices. In fact, the expected chip lifetime
goes down by a factor of 2 as the operating temperature rises by 100 C [78]. Although most
of the power dissipated as heat can be effectively discarded from the chip packaging by heat
sinks, this will not hold well in the future given the rate of rise in chip power density. Active
cooling mechanisms like fans are not possible for computing platforms like smartphones and
tablets. Some aggressive and innovative cooling techniques like fluid based cooling have been
implemented in the high end server market (ex. IBM Aquasar [108]). However, the usage of
such mechanisms for low end desktop and embedded computing markets is far-fetched.
It is be noticed that the expected lifetime for a processor depends on the application exe-
cuting on it, in addition to the technology dependent parameters. This is because the various
applications exercise the different units on the chip floorplan to varying degrees, leading to dif-
ferent operating temperatures. Figure 1.5 [92] shows how the lifetime of processors changes in
accordance with transistor gate length and the executing applications (from SPEC 2000 suite
[38]). The lifetime is modeled in terms of chip Mean Time To Failure (MTTF).
In addition to decrease in lifetime reliability, there are a few other issues with increasing
chip temperature. Firstly, the carrier mobility decreases with increasing temperature, leading
to performance degradation. Secondly, the sub-threshold leakage, which is a dominant leak-
age power mechanism increases with increasing temperature. The increased thermal energy
5Figure 1.5 Decrease in lifetime reliability of processors with shrinking gate length
possessed by the electrons due to increased temperature makes it easy to traverse the channel
when gate voltages are lower than the threshold. Thirdly, the interconnect delay increases since
its resistivity increases with temperature.
From the trends discussed so far, it is clear that PPR considerations play a major role in
both the design and operation of computing infrastructure. It is not sufficient to optimize the
hardware to satisfy one of the PPR demands at a time. Increasing performance generally leads
to higher power consumption and decreased reliability. As such, it is important to co-manage
performance-power and performance-reliability together. This dissertation makes an effort to
provide efficient solutions to the aforementioned co-management issues.
1.1.4 Nature of PPR demands
Computational needs in terms of delivered PPR characteristics are often varied over time.
In the server domain, high performance is required when the system load is high and the agents
loading the system are guaranteed a high Quality of Service (QoS ). Performance can be traded
off for lower power consumption and higher reliability during periods of low activity. Servers
employed in green data centers are powered by electricity generated through various renewable
energy sources, like sun and wind. Since the availability of these sources is time variant, so will
the power generated be. In spite of maintaining energy reservoirs, it becomes imperative to
6look for measures to tradeoff performance with power to avoid catastrophic blackouts. In the
personal and embedded computing domain, one might require high performance when executing
intense tasks, while low power operation may be required when running on battery without
any active power sources. A major motivation to reduce power consumption in these domains
is the impracticality of designing active cooling mechanisms like fans and to avoid the use of
aggressive heat sinks to keep costs low.
It is clear from the PPR trends discussed in the last subsection that sole reliance on tech-
nology scaling to tackle the related issues is not sufficient. Exploration of alternate venues
to manage all or a few of PPR demands is crucial. It is also the case that rigid architec-
tural decisions to provide expected PPR characteristics is impossible at microprocessor design
time. For example, we have seen in Figure 1.5 that reliability of a processor is dependent on
the applications executing on it. Similar arguments can also be made for performance and
power consumption. As such, a single architectural configuration does not lead to similar PPR
characteristics when executing different applications.
We strongly believe that the solution to the intricate problem of providing required PPR
characteristics can be achieved through careful hardware-software co-design. Our research
leverages on the simple fact that applications vary with regards to their interaction with hard-
ware. As a result, different hardware components become critical to provide good performance
for different applications. Leveraging this fact, our research focuses on utilization of DVFS and
microarchitectural adaptation hand in hand to produce expected behavior on a microprocessor
when executing a variety of applications. The complexity and effectiveness of such techniques
in PPR management are analyzed and tackled. The techniques we develop in this research work
utilize static knowledge of hardware-software interactions to provide good performance and a
given power level or to provide good reliability at a given performance level. Simplistic runtime
mechanisms are also developed to ascertain rigid PPR guarantees in the face of misjudgment
of the static expectations and runtime uncertainties.
71.2 Research contributions
The goal of this dissertation is to address the issue of growing microprocessor power con-
sumption subject to performance and/or reliability constraints. We provide application-aware
mechanisms for PPR management according to the corresponding constraints specified as in-
puts. It is clear that provision of good performance is orthogonal to lowering power consumption
or improving reliability. As such, a design goal is to balance out the obtainable PPR character-
istics in the most efficient manner in accordance with the priority associated with the different
constraints. Since at least a component of the management mechanisms falls under application
execution time, we include the design goal to make such a component fast and lightweight in
nature. The main contributions of this thesis are as follows.
1. Existing DVFS schemes are not equipped to deal with thermal cycling. This phenomenon
occurs when the chip temperature rapidly fluctuates due to the varying hardware-software
interactions resulting from different applications executing sequentially. We include this
awareness into DVFS mechanism and observe an improved expectation in chip lifetime.
2. Microarchitectural adaptation space is very large, given the number of adaptive compo-
nents and the ways in which each component can be individually configured. We make
a conscious effort to methodically prune the configuration space to retain only the most
effective configurations for adaptation. We evaluate how our pruned configuration space
can be used to produce varied PPR characteristics.
3. Static schemes for PPR management are comprehensive and provide the most optimal
operation characteristics. However, they suffer on account of their rigidity and inability
to adapt to runtime fluctuations. On the other hand, dynamic schemes are adaptive but
they are not optimal due to runtime overhead considerations. We combine the advantages
of these two approaches to develop hybrid schemes that contain a comprehensive static
component to optimize the decision process as well as a runtime component to adapt to
runtime fluctuations. We also develop alternate lightweight runtime only schemes that
perform close to the comprehensive approaches.
84. We utilize simplistic metrics as inputs reflecting the PPR demands/expectations. This
helps users with varied computer literacy levels to be able to specify and achieve required
characteristics from hardware.
1.3 Thesis organization
The reminder of this thesis is organized as follows. Chapter 2 provides an overview of
previous research on both DVFS and microarchitectural adaptation. A few examples of classi-
cal work as well as current state-of-the-art approaches in these areas are elaborated in detail.
Chapter 3 describes how DVFS alone as well as DVFS in conjunction with microarchitec-
tural adaptation can lead to increased chip reliability in real-time environments. A Quality
of Service (QoS) metric is used to dictate performance requirement. A methodology to prune
adaptive microarchitectural configuration space is presented in Chapter 4. Since we reduce
the configuration space to include only the most beneficial adaptations for performance-power
tradeoff, Chapter 5 analyzes how the tradeoff is affected when a single configuration out of this
set is unavailable for use due to a permanent fault. Chapter 6 deals with application aware
performance-power tradeoff. Questions on when and how to adapt the microarchitecture to
satisfy set constraints on performance and power are answered in this chapter. The merit of
using the developed adaptation mechanisms in catering to varying user demands from hardware
is investigated and the relevant results are presented. We conclude our discussion in Chapter
7 by reiterating the achievements of the research work. Since the research is focused on a
uniprocessor adaptation, a discussion on how to leverage this work to perform adaptations
in homogeneous or heterogeneous multicore environments is added. The conclusions of this
discussion will trigger investigation of multicore adaptation.
9CHAPTER 2. REVIEW OF LITERATURE
This chapter provides an overview of previous research on both DVFS and microarchitec-
tural adaptation. A few examples of classical work in this area are elaborated in detail. Also,
some cutting-edge approaches that are proposed recently are discussed and distinguished. We
conclude this chapter by specifying how the current research work differs from the existing
body of work.
2.1 Introduction
The drive for improved microprocessor performance led to aggressive superscalar designs
during late 90’s and early 2000’s. These superscalar designs increased in their complexity
from being a simplistic single-issue low frequency design to multi-issue designs operating at
very high frequencies. As a result, the chip power consumption rose exponentially, needing
aggressive heat sink designs to adequately dissipate the heat produced. The increased power
consumption motivated the research for a variety of hardware/software based techniques to
counter the increased power dissipation. An alternate solution proposed was to design and
utilize multicores. Due to the increasing aggressiveness ofPPR constraints, there exists a need
to adopt both these solutions together. In this dissertation, we focus on the hardware/software
based techniques for PPR management or tradeoff.
A large number of schemes were developed to manage power and/or limit energy consump-
tion and/or maintain hardware reliability, while sacrificing little performance. These schemes
are based either on microarchitectural adaptation or DVFS. The primary objective of these
schemes can be one of the following: 1) reduction of dynamic power, 2) reduction of static
power, 3) adhering to a particular power or energy budget, etc. The hardware entity carefully
10
regulated to achieve the set objective can be the CPU voltage and/or frequency, or the vari-
ous microarchitectural entities like issue queue size, number of active functional units, certain
ways of cache, etc. Some schemes are particularly targeted towards multi-cores while others
are generic and can be applied individually to each core. In the following sections, we will
look at a few classic as well as state-of-the-art schemes based on DVFS and microarchitectural
adaptation before distinguishing the current research from the previous work. Since there exist
a plethora of schemes in both these areas, we limit ourselves to providing examples of ground
breaking or closely related work.
2.2 Dynamic voltage and frequency scaling
2.2.1 Benefit of DVFS
The power (both static-Pstatic and dynamic-Pdynamic) consumed by an application executing
on a processor depends on the core voltage (V ) and frequency (f) as
Pstatic ∝ V (2.1)
Pdynamic ∝ V 2f (2.2)
Hence, lowering the core voltage and frequency is an effective way to manage power con-
sumption, and indirectly, the chip reliability. When V is lowered, the circuits become slower
since it takes more time to charge and discharge the load capacitances in the CMOS gates. As
such, f has to be lowered to allow enough time to charge a capacitance to reflect a logic ’1’ or
to discharge it to reflect a ’0’. Reduction in voltage and frequency negatively affects system
performance as well. It is well known that the time consumed to execute an application (t) can
be computed by
t =
instruction count ∗ CPI
f
(2.3)
where instruction count and CPI refer to the number of instructions in the program and
the average cycles per instruction respectively. From the above equation, the time consumed
is inversely proportional to core frequency. However, the CPI changes with frequency since
11
memory latency is unaffected by core frequency change. Hence, a sub-linear decrease in per-
formance can be observed with lowered voltage and frequency. These trends strongly motivate
the use of DVFS for PPR management.
2.2.2 Early work on DVFS and classification of DVFS schemes
One of the earliest research works proposing the reduction in voltage to counter power
consumption was put forward by Chandrakasan et al [16]. To reduce the detrimental effect
on performance, pipelining and resource duplication were proposed. Using performance and
power consumption modeling techniques, the authors derive the optimal voltage and frequency
to minimize energy-delay product. However, voltage and frequency are not dynamically varied
during execution.
Since then, a slew of DVFS techniques have been proposed for general purpose processors
(eg. [81, 77]), embedded systems (eg. [80, 53, 76]), high performance computing platforms (eg.
[42, 40, 57, 60]), as well as real-time systems (eg. [80, 107, 18]). Real-time systems have the
notion of task deadlines which can be used as explicit performance constraints. In the other
computing domains, such a notion does not exist. Hence, a number of alternate metrics are
targeted for optimization like energy efficiency, energy-delay product, etc.
DVFS incurs two kinds of performance overhead. As stated earlier, the decrease in voltage
leads to slower circuits and hence, the operable frequency. In addition, the transitions between
different voltage and frequency levels incur runtime overhead as well as the time taken for
making the deployed voltage and frequency decisions. As such, any DVFS scheme should avoid
very frequent voltage and frequency transitions and has to be able to make decisions rather
quickly.
Based upon when and how the deployed voltage and/or frequency is (re-)assigned, existing
schemes in this field can be classified into oﬄine [50, 72, 85, 43, 66], online [102, 34, 20, 73,
28, 94], and hybrid schemes [75, 90]. In an oﬄine scheme, the decisions on when and how to
perform DVFS are taken statically. The decisions are usually based on expected application
execution profiles or code characteristics. Since decisions are taken statically, more complex
analysis can be performed leading to better DVFS decisions. However, they cannot exploit
12
runtime hardware-software interactions. In an online scheme, all the DVFS related decisions
are based upon observed system state and a few hardware-software interaction metrics traced
during execution. These schemes are lightweight in nature and work on peepholes of instruction
traces profiled at runtime. Hybrid schemes combine the benefits of both the online and oﬄine
schemes.
2.2.3 Online DVFS schemes
In these schemes, DVFS trigger points can be interval-based (window-based) or event based.
Weiser et al. [102] put forward the idea of interval-based DVFS for general purpose computing
domain. The authors used trace based simulations to evaluate a set of DVFS approaches. The
traces used for simulations were collected from a Unix based workstation over a period of many
hours. Three different DVFS schemes were considered.
1. OPT: The entire execution trace is analyzed and the runtime for all the tasks are stretched
to fill all the idle times.
2. FUTURE: Similar to OPT, but peers into a small window in future rather than the entire
trace. Runtime for the tasks within the window are stretched.
3. PAST: A small window in the past execution profile is considered, and decisions for
a future window are based on hardware-software interactions observed in the previous
window.
The authors observed that the merits of FUTURE and OPT schemes approach that of the
OPT scheme when the DVFS intervals or window sizes are lengthened.
Govil et al. [34] proposed a few additional sophisticated schemes for DVFS management.
Two new metrics are recoded for each interval and utilized for their schemes. The first metric,
namely run percent, computes the percentage of time for which the CPU is active. The second
metric, namely excess cycles represents the work in an interval that is not accomplished by
using the selected speed setting. The new schemes proposed base the voltage and frequency
decisions on being able to run the work corresponding to a predicted run percent for the current
13
interval as well as the excess cycles accumulated over the previous intervals. Different methods
are used to predict the run percent for the current interval. The different prediction schemes
are as follows.
1. FLAT: A speed setting, which can accomplish the predicted work for the current interval
plus the excess cycles pushed into the current interval is selected. The run percent is
assumed to be a constant.
2. LONG SHORT: Two averages of previous run percents, one averaged over the last 3
intervals and the other averaged over the last 12 intervals is maintained. The run percent
for the current interval is predicted as a weighted average of the previous two recordings.
A speed setting that satisfies the predicted run percent and the accumulated excess cycles
is selected for the current interval.
3. AGED AVERAGES: This is a variant of LONG SHORT, where the predicted run percent
is calculated by a weighted average as in the previous case. The weights decrease as we
go deeper into the past.
4. CYCLE: A cyclic behavior of run percents in the past history is looked for. If such a
cycle is found, it is extended to predict the run percent for the current interval. If not, a
constant run percent is assumed.
5. PATTERN: The run percents from previous intervals are categorized into course spells.
The pattern of the run percents for the last few intervals is matched against patterns in
the deeper past successively until a match occurs. The extension of such a pattern in the
past dictates the predicted run percent.
The authors observed that FLAT and LONG SHORT provide the best energy savings compared
to the other strategies.
There are two problems inherent with interval driven schemes based on past workload traces.
First, it may not be possible to always find a suitable pattern in the deeper past reflecting the
characteristics associated with the profile in the recent past. This results in inaccurate workload
14
prediction for future intervals. Second, the assumption that future workload characteristics
mirror the past may not hold well.
Childers et al. [20] proposed to use external demands on the target Millions of instructions
per second (MIPS ) to make voltage and frequency scaling decisions. New scaling decisions are
made once for an interval size of 2µs. The new frequency for an interval (fnew) is calculated as
fnew = fold ∗ MIPSgoal
MIPSobserved
(2.4)
where fold corresponds to the frequency employed for the previous interval. MIPSgoal and
MIPSobserved are the MIPS set as the target and the observed value for the previous interval
respectively. The authors observed a 47% reduction in energy consumption using their scheme
when compared to using a fixed voltage and frequency.
Dhiman et al. [28] propose a machine learning based approach for DVFS. A set of hardware-
software interactions are recorded using hardware counters. These interactions are used to
classify the current execution context into one of the regions/baskets in possible execution
spectrum. Each basket has associated with it a voltage and frequency range that is deployed.
The authors achieved a 49% decrease in energy consumption and reduce the implementation
overhead by a factor of 2 over existing state of the art approaches.
Event driven DVFS schemes trigger a voltage and scaling decision when a certain hardware
event occurs. For example, Marculescu [73] used the knowledge of CPU stall cycles to switch
the processor to lower voltage and frequency levels while preserving performance. The author
observed a 20% reduction in energy consumption, 22% reduction in power consumption, and
14% reduction in peak power while sacrificing less than 6% performance. Stanley et al. [94]
similarly propose hardware based monitoring to detect application regions that are memory
bound and reduce voltage and frequency to reduce energy consumption.
2.2.4 Oﬄine DVFS schemes
Oﬄine DVFS algorithms calculate the trigger points and deployed voltage and frequency
levels statically. Such algorithms calculate the optimal voltages and frequencies by formulating
and solving an ILP based upon the set constraints. Ishihara and Yasuura [50] provided an
15
ILP formulation for selecting optimum voltage levels for a task execution given a processor
with discrete voltage levels. They provided an additional insight that employing just two
voltage levels per task can optimize the energy consumption. In contrast, the authors of [72]
provided a solution to the optimal voltage and frequency selection problem with the assumption
of continuous voltage and frequency levels. We believe that the consideration of continuous
voltages and frequencies is impractical and Ishihara and Yasuura’s approach can also cater to
such situations, if they exist. Zomaya et al. [85] made similar observations for using just two
discrete frequency levels capitalizing on dynamic frequency scaling.
A compiler level DVFS technique was proposed by Hsu and Kremer [43] where the com-
piler instruments applications to supply DVFS related commands. The application is profiled
to identify regions in application code with differing timing characteristics and execution fre-
quencies. For each program region, an optimal voltage and frequency is assigned so that the
performance penalty is never more than a preset threshold. Since the DVFS decisions are based
on profiled characteristics, they would not hold tight with future executions.
Oﬄine techniques for DVFS are comprehensive and result in optimum system behavior.
However, the solutions are not adaptive to runtime fluctuations in expected performance and/or
power consumption characteristics. On the other hand, online techniques can adjust to runtime
deviations in application behavior. Since their operation falls under actual runtime, they
have to be lightweight and hence consider only intervals of execution for making voltage and
frequency selection decisions. As such, they are non-optimal. Hybrid schemes, which leverage
the advantages of both these techniques are highly desirable.
2.2.5 Hybrid DVFS schemes
Checkpointing based hybrid DVFS techniques are proposed by [75, 90]. In the scheme
proposed in [90], static voltage and frequency assignments are provided for different intervals
within an application’s worst case execution path. Intervals are demarcated at the branching
edges of the control flow graph, which correspond to branch or loop statements. When the
actual execution path deviated from the predicted path, the expected time difference for the
application executing along these two paths is used to speed up or slow down the processor
16
accordingly. Also, the predicted execution path is now updated. Using their schemes resulted
in 34% lower energy consumption when compared to the state of the art intra-task DVFS
schemes while executing MPEG-4 decoder program.
DVFS in HPC domain Use of DVFS to provide energy savings in high performance
computing domain has been proposed by [57] and [60]. The inter-task computational imbalance
is used to slow down tasks on the non-critical path to achieve significant energy savings. Freeh
et al. in [32] run the application to collect profile information. Using this information, the
application is divided by hand into multiple phases. Once the phase boundaries are demarcated,
the application is augmented to use the different voltage/frequency pairs to determine the best
combination to use for each phase. The authors in [63] take this a step further by automating
the phase boundary and voltage/frequency pair selection. The hit rates of the different caches,
the ratio of number of floating point operations to the number of memory operations, etc. are
used to characterize the behavior of individual loops. Each loop is then compared against a
known set of benchmark loops in terms of these observed characteristics. Each benchmark loop
has an associated voltage/frequency pair that is deemed best for it. The pair corresponding to
the matched benchmark loop is utilized for a program loop.
2.2.6 Thermal aware DVFS
As technology scales down, the power density on the chip increases leading to higher tem-
peratures. The increase in temperature necessitates the use of costly heat sinks and cooling
mechanisms. To counter this demand and to preserve hardware reliability, thermal awareness
has been introduced into the voltage and frequency selection process.
Xie et al. [106] proposed thermal aware task scheduling policies. These policies scheduled
tasks on a System on chip (SOC) based upon a set of calculated static and dynamic criticalities.
Application control flow graph is used to calculate the static criticality of tasks. Dynamic
criticalities were based both on the position of a task in the control flow graph and the expected
operating temperature. Although voltage and frequency scaling is not used, we mention this
work since it represents one of the earliest efforts in thermal aware task scheduling.
17
Bao et al. [7] looked at temperature aware voltage selection. The authors propose a scheme
that takes the task mapping onto a multicore SOC as an input. The target temperature at
which a core should run also constitutes an input to the voltage selection process. First, a
set of voltages are assigned to the tasks to minimize the energy consumption. The thermal
profile produced as a result is used to readjust the voltages assigned to re-adjust the temperature
profile. This process is repeated until the temperature converges to the set target. No particular
constraints on performance are considered.
The authors of [17] propose a mixed ILP based approach to assigning and scheduling tasks
with hard real-time constraints in MPSOCs. To solve large problem instances, heuristics are
also proposed. The chip peak temperature is subject to constraints in this approach.
It is observed that rapid temperature variations on the chip, along with the absolute tem-
peratures are responsible for a large number of chip failures [87]. This chip failure mechanism,
called thermal cycling, has been studied in previous research but not dealt with proactively.
For example, the authors of [24] show the effect of thermal cycles on lifetime reliability, but do
not put forward an online approach to voltage scaling taking into account the thermal cycles.
The scheme proposed by Bao et al. [7] can be used to tackle thermal cycling but performance
awareness needs to be integrated into the scheme, essentially leading to a complete redesign.
The authors in [26] develop an ILP based approach to tackle thermal challenges. Thermal
cycles are minimized indirectly by constraining the peak temperature. This scheme is static
and does not deal with runtime variations in temperature and expected performance whose
knowledge is assumed for the ILP formulation. An online strategy is proposed in [25] where
an intelligent runtime management system selects one of the possible thermal policies to man-
age temperature profiles. However, the authors do not consider specific performance bounds.
Instead, performance and various reliability aspects are treated with the same priority. In this
dissertation, we develop a set of schemes to co-manage performance and chip reliability (includ-
ing proactive management of thermal cycles) employing DVFS. Specific bounds on performance
are utilized and guaranteed as per defined Quality of Service (QoS) standards while improv-
ing reliability. Effects of both absolute temperatures and thermal gradients are considered in
reliability calculations.
18
Sueur and Heiser [64] pointed out that the efficacy of DVFS alone in achieving energy savings
is diminishing over time due to 1) the increase in leakage power component, 2) reduced memory
latency, and 3) improved sleep modes. In fact, their experiments revealed that DVFS could
result in increased energy usage on modern hardware platforms. As such, the benefits provided
by DVFS should be coupled with techniques leading to leakage power reduction to provide
an increased dynamic range of performance-power values obtainable on a given architecture.
In the next section, a review of such a technique leveraging on the capability to tune the
aggressiveness of the different microarchitectural components is provided.
2.3 Miroarchitecural adaptation
Previous work in this area can be classified according to the adaptation granularity mea-
sured along two directions. Spatial adaptivity corresponds to the size of configuration space
considered. A larger configuration space leads to fine-grained control at the expense of analysis
complexity and runtime overhead. Temporal adaptivity refers to the frequency with which
adaptations are carried out during an application execution. Similar to the previous case, a
tradeoff between control granularity and runtime overhead exists in this regard.
Researchers have proposed adaptive architectural schemes using a single component [1, 13,
33, 44, 100] or multiple components [2, 6, 31, 51, 61, 74]) adaption. For a single component, the
configuration space is small, leading to faster reconfiguration decisions. The authors of [100]
proposed a cache where the fetch size is continually modified based on the application access
patterns. The authors of [13] present a circuit design targeting issue queue in a superscalar
processor where the speed and size are adapted. Branch target buffer adaptation is performed
in [44] along with adapting the components of a hybrid predictor to save significant amount of
energy consumption with very little performance loss. The authors of [1] present a methodol-
ogy to deactivate certain cache ways based on application cache intensity. Some performance
degradation is also tolerated in this process to attain considerable energy savings.
For multi-component schemes, the authors in [2] have adapted the L1 and L2 caches, reorder
buffer, instruction and load/store queues and register files. Cache resizing and DVFS are
considered by [74]. Very large configuration spaces are handled in [61, 31, 65] as well.
19
Microarchitectural adaptation schemes with different temporal granularity have been pro-
posed. The authors in [71] use a single architectural configuration for an application. The
configuration chosen can change for different applications. Such schemes are simplistic but
do not exploit intra-application variations in hardware-software interactions. Dhodapkar et
al. [30] base adaptation decisions on working set signatures. An application execution profile
is split into different regions each having a particular execution signature. This signature is
constructed from hardware-software interactions. All regions having a similar signature use
the same adaptive configuration. Different configurations that are amenable to different ap-
plication signatures are constructed and used. Adaptation based on frame level granularity is
proposed for multimedia applications in [45]. Adaptations are also considered at the granularity
of application intervals [65, 61].
In the following subsections, we will be reviewing a few important or closely related research
works in microarchitectural adaptation.
2.3.1 Classic research in microarchitectural adaptation
IPC and clock speed adaptation The roots of research in microarchitectural adap-
tation can be traced back to the work of David Albonesi, who introduced the concept of
Complexity-Adaptive processors (CAP) [1]. A CAP employed configurable hardware for the
core superscalar control and cache resources. A dynamic clock is also provided to adapt along
with the envisaged hardware structure to optimize the clock speed based on the inherent delay
of the configurable hardware resources. Using these features, the Instructions committed per
cycle (IPC) and clock speed can be traded off with one another. The idea behind the envi-
sioned tradeoff is discussed below. Superscalar microprocessor resources like cache hierarchy,
branch predictor, register rename logic, instruction queue, issue logic, etc. are traditionally
implemented as RAM or CAM based arrays implemented as replicated storage elements driven
by global address and data buses, as shown in Figure 2.1 [1]. The inherent circuit delay as-
sociated with such resources is dependent on the number of active elements in the array. If a
few elements in an array are deactivated, the inherent circuit delay is reduced, thereby mak-
ing the circuit operable with a higher clock speed. Simultaneously, the IPC takes a hit. The
20
Figure 2.1 Adaptation in Complexity-Adaptive processors
actual tradeoff potential is dependent on the application that is being executed and the RAM
or CAM array that is being adapted. Based on this idea, Albonesi configured data caches and
instruction queue to suit the needs of various applications.
Cache line size adaptation Veidenbaum et al. suggested adapting cache line sizes to
suit application needs [100]. The line size of a 16 KB L1 cache has been adapted between
the values of 8 and 256 bytes, each higher size double the size of the previous value. Since
the line size adaption would require reconfiguration of circuitry between RAM and the cache
under consideration, as well as some aspects of the RAM design, a virtual line size is defined
and adapted. Instead of adapting the physical line size, which is assumed to be 8 bytes in this
research, multiple cache lines (constituting a single virtual line) are transferred to the cache
from RAM upon a cache miss.
The adaption scheme reduces the virtual line size upon a cache miss if the previously fetched
words are not being used, but increases the size when an adjacent line is already present in
the cache. In this context, the adjacent line for a particular virtual line is defined as the cache
21
line of same size which would have been part of the line under interest, had the virtual line
size been double of the present value. To know the line usage and adjacency information, each
physical line is augmented to store three extra pieces of information.
1. The current virtual line size corresponding to the physical line
2. A single adjacent bit, to indicate if the adjacent line is already present in the cache
3. A 2-bit saturating counter which measures the use of the cache line of interest
The speed of adaptation is further tuned by sending out line size decrement or increment
requests upon a cache miss. A real adaptation occurs when a particular number of such
requests occur consequently for a single cache line upon a cache miss associated with it. The
authors considered a mixture of two adaptation approaches that are outlined below.
1. Inc-fast : Increment line size instantly but decrement only after 2 consecutive decrement
requests are issued for a single line.
2. Dec-fast : Decrement line size instantly but increment only after 2 consecutive increment
requests are issued for a single line.
The authors found that the most effective solution to reduce memory bandwidth was to use
inc-fast for small line sizes and dec-fast for large line sizes. However, the miss rate encountered
was observed to be higher for some benchmarks when compared to a statically determined best
line width for the particular benchmark.
Cache and TLB hierarchy adaptation A memory hierarchy reconfiguration scheme
was proposed by Balasubramonian et al. [5]. In this work, a single large L1 cache is converted
to a mixture of L1 and L2 caches, and a single large TLB is converted into a two-level TLB
through runtime adaptation. The cache and TLB usage are monitored by detecting phase
changes using miss rates and branch frequencies, and performance is boosted by balancing the
hit latency and miss latency intolerance during execution.
22
The authors start out with a 2 MB 4-way data cache and convert it into a mixture of L1 and
L2 caches using intelligent addition of repeater switches to electrically isolate specific wordlines
of interest, upon L1 cache access. The following configurations for the L1 cache are allowed.
1. 256 KB directly mapped cache
2. 768 KB, 3-way cache
3. 1 MB, 4-way cache
4. 1.5 MB, 3-way cache
5. 2 MB, 4-way cache
The cache miss rate, IPC, and branch frequency are monitored for every 100K cycles of ex-
ecution (called an interval) using hardware counters. When an application starts executing,
the L1 cache is configured to be 256 KB by default and an optimal cache exploration process
initializes. After every interval during this exploration process, the obtained CPI is recorded.
If the cache miss rate is greater than 1%, the next larger L1 size is chosen for the next interval.
This process continues until the largest cache size is selected, or the miss rate drops to a value
less than 1%. At the end of this process, the configuration that provides the smallest CPI
is chosen and is used for the future intervals of execution. Further optimal cache size explo-
rations are necessitated when the number of branches and misses significantly differ beyond a
set threshold between two intervals of execution. If successive exploration phases lead to the
same optimal cache line size, the threshold is incremented. Otherwise, the threshold is decre-
mented. This avoids unnecessary exploration phases where a single optimal cache configuration
will be selected.
To adapt the TLB, a counter tracks TLB miss handler cycles for every 1 million cycles of
execution. The L1 TLB size is incremented if this counter exceeds 3% of the total execution
time for the interval. In contrast, the L1 TLB size is reduced if the TLB usage is less than half.
Application of this adaptation methodology for a two level cache and TLB hierarchy at
0.1µm technology led to an improvement of 15% CPI when compared to a best conventional
two-level hierarchy of a comparable size. The authors also experimented with a conventional L1
23
cache and a similarly adapted L2/L3 cache in sub-0.1µm technology. In this case, the following
energy aware modifications are also made.
1. Only low energy configurations are used for L2 cache.
2. Data and tag lookup processes are serialized.
Using these modifications, a reduction in 43% of energy consumed by the memory hierarchy is
observed, in addition to the performance improvement.
Microprocessor queue adaptation Dmitry Ponomarev et al. proposed adaptation of
the sizes of issue queue (IQ), the reorder buffer (ROB), and the load/store queue (LSQ) based
on periodic sampling of their occupancies. These components are adapted independently of one
another, and the interplay between adaptation of the various components is not considered.
Further, these occupancies were not monitored continuously, to avoid overhead. All the queues
are assumed to be designed using individually controllable partitions, each of which can be
activated or deactivated using a simple control signal.
Different strategies are used to deactivate (downsize) or activate (upsize) partitions. The
downsizing of partitions is considered periodically. During each period, the number of active
entries in a queue are sampled at regular intervals. An average number of active entries is
derived from this monitored data. At the end of a period, the difference between the current
size of the queue and the average active queue size is calculated. If the difference is greater
than the size of a single partition, downsizing occurs. Two different downsizing strategies are
implemented and analyzed.
1. Conservative downsizing: Only one partition is deactivated at the maximum for a single
period.
2. Aggressive downsizing: Multiple partitions, which fit within the difference calculated at
the end of the period, are subject to deactivation.
An overflow counter is used for each individual queue to support upsizing. This counter mea-
sures the number of cycles the microprocessor pipeline is stalled because of inability to find an
24
empty slot in the queue. This counter is initialized to 0 upon commencement of application
execution, and after each upsizing operation. An upsizing is associated with the overflow of
this counter. Only a single partition is reactivated during each upsizing operation.
The authors evaluated the effectiveness of the proposed adaptation methodology by per-
forming the aforementioned adaptations during the execution of various SPEC 95 benchmarks
[84]. The Simplescalar simulator [12] is modified to provide the proposed adaptations and
the execution of these benchmarks is simulated on a 4-way superscalar processor. During this
course, an average of 53 power savings is observed for the combination of IQ, ROB, and LSQ,
while incurring a performance penalty of only 5%.
Modular reconfiguration Mai et al. [71] proposed a high-level modular reconfiguration
platform called Smart Memories. This platform combines the benefit of architectural adapta-
tion with the performance advantage of domain specific computing hardware, while providing
a streamlined modular architecture that can be reconfigured easily to match the demands of
applications from multiple domains.
A Smart Memories chip consists of multiple processing tiles, each containing configurable
memory, wiring, and processing resources which employ multiple computational models. Figure
2.2 shows the Smart Memories tile floorplan. The floorplan consists of a processor, local
memory organized in multiple blocks called mats, and an interface to interact with other tiles
in the system. A crossbar is provided for the processor to interact with the local memory, as
well as other tiles in the system. To support multiprocessing, dedicated networking is provided
between sets of 4 tiles each, which are referred to as a Quad. Per Quad DRAM resources are
also provided to promote efficient communication. Microarchitectural adaptation support
is provided for the processor and the local memory in each tile. The processor in each tile
has two integer clusters and one floating point cluster to enable parallel execution. The Smart
Memories instruction path can be configured to support wide or narrow instruction encoding.
A 256-bit wide instruction format suits explicitly parallel instructions found in media and
signal processing kernels. The processor used this wide instructions to supply instructions
to all available units parallely. A 128-bit VLIW instruction format is supported to benefit
25
Figure 2.2 Smart Memories tile floorplan
applications which contain ILP but are less regular. A 32-bit narrow instruction format is
further supported to benefit applications that do not exhibit high ILP. However, thread-level
parallelism is used for such applications, where parallel execution of 2 concurrent threads issuing
32-bit instructions is supported. Each Smart Memory tile mats can be configured in multiple
ways to support different cache organizations. Further, memory accesses can be performed
both in regular mode, or an auto-decrement/auto-increment mode with configurable strides.
To showcase the flexibility to the adapted hardware, two widely variant processor topologies,
namely the Hydra speculative multiprocessor [35] and the Imagine stream processor [56], are
mapped onto the Smart Memories architecture. Although these implementations perform sub-
optimally with respect to their corresponding standalone implementations, the authors claim
that the flexibility provided to map multiple architectures onto the proposed platform trumps
the sub-optimality.
Branch predictor adaptation Huang et al. [44] proposed reconfiguring various branch
predictor parameters through structure resizing and access gating to minimize energy consump-
tion during periods of ineffective branch prediction. The authors tie the branch prediction
accuracy to program code structure and utilize profiling to analyze branch prediction efficacy
(of different allowed branch predictor configurations) individually for the different program
26
Figure 2.3 2Bc-gskew-pskew branch predictor organization
subroutines. Based on the data obtained, the components of a hybrid branch predictor are re-
configured at the granularity of program subroutines by solving a knapsack problem to minimize
overall energy consumption. The program code is dynamically instrumented to reconfigure the
branch predictor. This oﬄine approach minimizes runtime overhead in determining the optimal
branch predictor configuration. Figure 2.3 shows the baseline hybrid predictor configuration
employed by the authors. The branch predictor contains three top level components: gskew,
pskew, and bimodal. The gskew and pskew components can be activated/deactivated using the
GEN and PEN signals respectively. The bimodal component is not targeted for reconfigura-
tion since it does not consume significant amount of energy. The number of sets in the branch
target buffer are also adapted in addition to the gskew and pskew components.
2.3.2 Closely related research in microarchitectural adaptation
Reducing Peak Power with a Table-Driven Adaptive Processor Core Kontorinis
et al. [61] present a processor peak power management technique. Peak power is of importance
since it directly affects the thermal budgeting, packaging and cooling costs for the processor.
The authors control the peak power by designing a centralized control mechanism that controls
architectural configuration. Peak configuration is not assigned to all the adapted units at the
same time. The units that are considered for adaptation are the I and D caches, integer and
FP instruction queues, reorder buffer, load/ store units, integer and FP execution units, and
register renaming unit. A table driven approach is utilized for consultation on power and
performance characteristics before making a configuration decision.
27
Their design consists of two major components- A config ROM and an adaptation manager.
For any given application, the config ROM is first loaded with a set of allowed configurations
that do not surpass the peak power limits statically decided. The configurations once set stay
constant for a long interval of time which they call epoch (1M instructions). During this epoch,
the adaptation manager collects performance characteristics from the processor. At the end of
an epoch, the adaptation manager chooses a new configuration from the config ROM based on
the performance characteristics. Using this technique, the authors are able to reduce the peak
power by 25% by sacrificing no more than 5% of performance.
This work draws some powerful conclusions and inferences. Firstly, the authors observe
that setting just 2 or 3 microarchitectural units at their maximum level are enough to achieve
high levels of performance. Secondly, by limiting the allowed peak power to appropriate levels,
the search space for the configurations has been reduced drastically. For example, out of 6,144
configurations allowed by the permutations between the different configuration levels of the
adaptive hardware units considered, only 285 combinations hold valid when the peak power
limit is to be reduced by at least 15%.
There are two drawbacks with the proposed methodology. The size of the configuration
search space is entirely dependent on the allowed peak power level. There is no interest to re-
move a subset of the combinations that perform very closely. As such, the pruning methodology
is not very robust. Secondly, there is no analysis on how better the proposed techniques work
can when the epoch length is varied. This would make a case of optimal control granularity.
Predictive Model for Dynamic Microarchitectural Adaptivity Control This re-
search [31] presents a prediction based model to improve the energy efficiency of a processor.
The energy efficiency is defined as the ratio of number of instruction executed to the energy
consumed by the processor. The model developed is constructed empirically by identifying
optimal designs on training data. The model takes as input a set of hardware characteristics
monitored and predicts the best architectural configuration for the 14 adaptive units consid-
ered. The characteristics monitored are in the form of temporal counters with different bins
for different ranges of values.
28
A soft-max distribution is assumed for the probability of the model generating a certain
output configuration given a set of temporal hardware statistics. This distribution has cer-
tain parameters associated with it, which need to be quantified in order to use the model for
generating optimal configurations later on. In order to obtain these parameters, the authors
formulated a training process where a large number of program phases are studied against the
corresponding optimal configurations.
During a real program execution, the hardware characteristics logged when the program
enters a new phase of execution. These characteristics are then fed to the generated model
which will output the optimal configuration for all the adaptive components. The authors
found that their model is effective in doubling the energy efficiency.
Since the model uses the parameters obtained through working on a few selected applica-
tions, it may not be reflective of the requirements of any generic application executing on the
processor. The authors do not justify the effectiveness of using the large set of adaptive con-
trols, since each control potentially has a different degree of effect in improving energy efficiency.
Also, the effect of configuring the different architectural controls are dealt with separately and
their interplay is not paid attention to.
Efficiency trends and limits from comprehensive microarchitectural adaptivity
The large dimensionality of microarchitectural configuration space analysis prohibits designer
from considering a wide range of adaptations since the corresponding analysis becomes pro-
hibitively expensive. Two possible solutions exist. First, the analysis accuracy can be toned
down by employing sampling and predictive modeling techniques. Second, the adaptive config-
uration space is reduced as per requirements using existing simulation methodologies. Lee and
Brooks [65] employ the former strategy and consider an adaptive architectural space contain-
ing 240B configurations. The authors utilize random instruction trace sampling, spline based
regression for predictive modeling, and genetic algorithms for refinement of considered configu-
ration space. They also consider the ill effects of low degree of spatial and temporal adaptivity
in delivering the optimal bips
3
w , which is their optimization metric. The authors also perform
an analysis of how DVFS adds to system efficiency on top of microarchitectural adaptation.
29
However, this analysis is performed at a later stage and we believe that microarchitectural
adaptation and DVFS considerations should go hand in hand rather than one after the other.
In addition, we believe that reducing the microarchitectural adaptation configuration space to
a small subset of possible adaptations will significantly decrease the control complexity as well
as the design complexity. To this end, we develop mechanisms to systematically prune the
available configuration space.
2.4 Uniqueness of current research
The following factors distinguish the current research from the existing body of work.
1. Existing DVFS based schemes for performance-reliability management do not focus on
reduction of thermal cycling. We include the constraint to maintain the chip temperature
within a small window while adhering to performance constraints. This translates to
higher chip lifetime expectations, as observed in our experiments.
2. There have been no schemes that evaluate the effectiveness of microarchitectural adap-
tation for performance-reliability co-management with thermal cycling awareness. We
show how addition of microarchitectural adaptation can improve chip lifetime expecta-
tions when combined with DVFS.
3. A few pointers are provided in the earlier research to reduce the adaptive configuration
space. However, no formal pruning methodology has been proposed in this regard. We
develop a three stage pruning methodology to 1) identify the most effective adaptive
controls to build in hardware, and 2) choose the most beneficial configurations to use
given an application. Such a pruning methodology is essential to decrease the design and
control complexity associated with microarchitectural adaptation.
4. Existing schemes do not attempt to evaluate the effectiveness of the working adaptive
configuration space in trading off performance and power in the presence of faults. Such
an analysis is necessary and indicates what additional hardware capability is needed to
handle failures gracefully. We carry out such an analysis in this research work.
30
CHAPTER 3. PERFORMANCE RELIABILITY TRADEOFF
This chapter presents details of our research on performance and reliability tradeoff. A
soft real-time environment is considered for application execution. Performance guarantees are
provided through a Quality of Service (QoS) constraint. The chip reliability is studied in terms
of the expected Mean time to failure (MTTF ) under different chip failure mechanisms. Specific
hardware parameters are adapted to suit application needs and the constraint on performance,
while improving expected chip lifetime. A two stage performance and reliability management
scheme employing DVFS is designed and its effectiveness in providing the required tradeoff
is investigated. This scheme is then augmented with microarchitectural adaptation to further
improve the chip lifetime expectation.
3.1 Introduction
The need to manage performance and reliability together has been detailed in Chapter 1.
Microprocessor performance and hardware reliability characteristics are generally orthogonal
in nature. Increasing performance requires increasing the aggressiveness of hardware which
leads to higher power density. This increased power density translates to higher temperature
and reduced lifetime reliability. Since different applications stress the hardware to different
degrees, higher performance goals also lead to larger thermal gradients. This further reduces
the reliability. As such, it is important to provide mechanisms to cater to different demands
representing these two entities. Since performance is generally of higher importance than
reliability, we consider it as the primary constraint while reliability is considered as a secondary
constraint. We develop a scheme to improve chip lifetime expectations when executing a set of
known tasks while adhering to a performance constraint.
31
As mentioned earlier, heterogeneous tasks executing on a processor drive the chip to differ-
ent temperatures. The various microarchitectural components are exercised to different extents,
due to the varying CPU and memory intensities of these tasks. In [93], the authors report the
chip peak temperatures when different SPEC2000 integer benchmarks are executed on proces-
sors from different technology nodes. Their report provides the insight that the temperature
difference on a single processor when executing different benchmarks goes up as the feature
size shrinks. Executing a series of tasks leading to different temperature profiles causes the
processor to heat up and cool down alternatively, a phenomenon known as thermal cycling.
Thermal cycles are categorized into large and small scale cycles. The large scale thermal cycles
are often a consequence of switching the processor on and off, and are expected to happen
infrequently. The small scale thermal cycles occur more frequently, and are a function of the
executing task characteristics.
Electronic circuits essentially constitute interconnection of materials with different thermal
coefficients (e.g. metals and dielectrics). These individual materials are subjected to differential
expansion and contraction when the chip is heated and cooled. Due to their physical intercon-
nection, a mechanical stress develops in this process, leading to die cracking, thin film cracking,
solder joint fatigue, etc., over time. The authors of [87] mention that the corresponding failure
mechanisms contribute to a large chunk of chip failures. Most of these failure mechanisms have
not been analyzed statistically due to the complexity involved in the intricate interplay among
the various mechanisms. Even the JEDEC testing standards [52], used for chip reliability
quantification, do not account for the effects of small scale thermal cycles. We make an effort
to reduce both the thermal cycles and the steady state temperatures (SST s) in an integrated
fashion, thereby increasing the expected chip lifetime. In the following, we refer to SST as a
relatively constant temperature to which a chip is heated when executing an application.
3.2 System model
We choose a soft real-time system model for application scheduling and execution. Task
schedulers in real-time environments typically base their schedule construction on the worst
case execution times (WCET ) of the tasks. When deadlines are not tight, static slack accumu-
32
lates in the schedule. The actual execution times also differ from the worst case assumptions
since the application contains multiple execution paths, each incurring potentially different
execution time. As a result, dynamic slack also accumulates in the schedule. Most modern
processors typically support execution of the tasks at multiple frequencies (and correspond-
ingly voltages). As a result, any positive slack generated in the system is currently exploited
by lowering the operating voltage and frequency. This reduces the overall energy consumption
and chip temperature. A majority of the existing DVFS approaches are not thermally aware
and aggressively tune down the operating voltages (frequencies) to achieve maximal energy
savings. To the best of our knowledge, there have been no approaches targeted to handle the
small scale thermal cycles. The following system model is considered.
1. There are N tasks in the system (T1, T2, ..., TN ) in order of decreasing priority i.e.,
Pri(Ti) > Pri(Tj) when i<j and Pri(Tk) represents the priority of task Tk. Such task
lists are readily available in real-time environments. Such lists also exist in batch process-
ing systems. It is further assumed that a non pre-emptive real-time scheduler is present
which assigns priority to these tasks.
2. The underlying processor support m modes of operation. Each mode k, represented by
ck, is associated with a specific voltage vk and frequency fk, and optionally a unique
hardware configuration. For the purposes of DVFS, ck contains just vk and fk. In a later
section, we vary the hardware configuration along with DVFS to improve reliability.
3. Each task Ti is associated with a deadline Di, an expected execution time tik, an expected
energy consumption Eik (product of power consumption and execution time), and steady
state temperature Tik for the k
th mode of operation.
The different task characteristics mentioned above are arranged in a 2-Dimensional grid
which we refer to as operations table (OT ). The OT has m ∗ N operational points (OP).
Each OP is represented as a 4-tuple < ck, Eik, tik, Tik >. Figure 3.1 shows the OT structure.
The values of tik, Eik, and Tik can be obtained through profiling, statistical modeling, or a
simulation based study. The entries along a column of the OT are arranged in increasing order
33
Task 1 Task 2 Task N
OP11
<c1,E11,t11,T11>
OP12
<c2,E12,t12,T12>
OP21
<c1,E21,t21,T21>
OP22
<c2,E22,t22,T22>
OPN1
<c1,EN1,tN1,TN1>
OPN2
<c2,EN2,tN2,TN2>
Mode 1
Mode 2
...
OP1m
<cm,E1m,t1m,T1m>
OP2m
<cm,E2m,t2m,T2m>
OPNm
<cm,ENm,tNm,TNm>
Mode m .
. .. ..
...
...
Figure 3.1 Example Operations Table
of voltage and frequency. This implicitly arranges the entries in a column (from top to down)
in the increasing order of Tik, and in the decreasing order of tik. We develop a two stage OC
selection process to assign voltages and frequencies to different tasks in the queue. The first
stage statically assigns voltages and frequencies to each task by comprehensively considering
all the OT entries. We develop two complementary polynomial time algorithms, together
called the global operational point selection algorithms (GOPS ), for this purpose. The OP
assignments made in this stage satisfy the following constraints in order.
• All task deadlines are met.
• The system steady state temperature never exceeds a threshold limit (Tthresh).
• All the selected OPs are constrained to a temperature window of predetermined size
(Twindow).
These GOPS algorithms are global in the sense that they consider the whole set of tasks
together at once. Our approach is based on the expected execution time of the tasks rather
than the WCET s, hence catering for both static and dynamic slack in the system. We monitor
the actual execution time values during task execution. Significant deviation of the actual
execution time from the expected execution time triggers the second stage of voltage and
frequency selection called local operational point selection algorithm (LOPS ). Thus, LOPS
deals with the incremental slack when each single task gets launched for execution.
A set of OPs, selected one per task is referred to as an operational chain (OC ). Thus, an
OC dictates the set of voltages and frequencies assigned for the set of tasks in the schedule.
34
3.2.1 Assumptions
The following assumptions are made in developing the OC selection algorithms.
1. The soft real-time environment guarantees a Quality of Service (QOS) requirement. QOS
is defined as the ratio of the number of tasks meeting their deadline to the number of
tasks scheduled for execution.
2. The real-time tasks execute long enough so that the chip reaches a steady state temper-
ature (may be different for each task). The transient thermal gradients that arise during
task switching affect minimally.
3. The task performance and thermal characteristics are obtained in advance through ex-
tensive profiling or an analytical model.
4. The number of voltage and frequency pairs supported by the processor are discrete and
finite.
5. Tasks are non pre-emptive and a task schedule constructed by an EDF scheduler is already
known.
3.2.2 Procurement of task execution characteristics
In the construction of OT, we need the values tik, Eik, and Tik for all the tasks in the task
queue. This information will be utilized by our GOPS algorithms to select proper OPs for
each task. We obtained these characteristics both using analytical modeling and cycle accurate
simulations separately. Since simulations are costly, we use analytical modeling to perform a
first order analysis to evaluate the effectiveness of the designed algorithms. Simulations are
later used to include the benefits of microarchitectural adaptation since analytical models in
this regard are not very accurate.
The analytical model we use is based on a similar model used by [37]. This model is used to
calculate tik, Eik, and Tik for the tasks operating at different OPs, given the nominal values at
the highest (nominal) voltage and frequency settings. The analytical model is presented below
for understanding purpose.
35
For an OP, the normalized voltage vnorm is defined as the ratio between the operating
voltage V and the maximum permitted operating voltage Vmax. Similarly, the normalized
frequency fnorm is defined as the ratio between the operating frequency F and the maximum
permitted operating frequency Fmax.
vnorm = V/Vmax, fnorm = F/Fmax (3.1)
The relation between the operating voltage and frequency is approximated as
V = aFα (3.2)
where a and α are hardware related parameters. Hence, the relation between vnorm and fnorm
can be modeled as
vnorm = f
α
norm (3.3)
where α is an architecture dependent constant.
Two additional parameters, namely ρ and µ, are introduced to model the dependence of
energy consumption, execution time and temperature on the nature of tasks and the underlying
architecture. The parameter ρ represents the normalized value of the leakage power consump-
tion to the total power consumption when the processor is operating at Vmax and Fmax. This
is a measure of the transistor leakage characteristics, and is highly dependent on the microar-
chitectural design. The parameter µ represents the CPU intensity of the task. It is defined as
the ratio of CPU computational time to the net execution time for the task. µ can be obtained
by runtime profiling of the tasks. ρ and µ are given as
ρ = static power/(static power + dynamic power) (3.4)
µ = CPU time/(CPU time+memory access time) (3.5)
The normalized task execution time, t is calculated as
t = (1− µ) + (µ)/fnorm (3.6)
This is because the execution time component corresponding to the memory accesses does
not change when changing the processor frequency and voltage settings alone. Only the CPU
intensive component speeds up (slows down) at a more aggressive (less aggressive) OP.
36
Both the static and dynamic power scale down as voltage and frequency are decreased. If
the static and dynamic powers at the maximum voltage and frequency are represented by PS
and PD, then the corresponding values when using vnorm and fnorm, as given by [27] are
PSnorm = ρ ∗ vnorm (3.7)
PDnorm = (1− ρ) ∗ vnorm2 ∗ fnorm (3.8)
Using the definition of static and dynamic power, along with 3.6 yields the value for the
normalized energy consumption.
e = (1− ρ)µf2αnorm + ρ(1− µ)fαnorm + ρµfα−1norm (3.9)
Finally, the normalized steady state temperature is approximately proportional to the power
density on the processor chip. Since chip area stays constant, temperature is proportional to
net power consumption. Hence, it is modeled as
T = ((1− ρ)µf2α+1norm + ρ(1− µ)fα+1norm + ρµfαnorm)/((1− µ)fnorm + µ) (3.10)
3.3 OC Selection
A set of algorithms are developed to select voltages and frequencies for the different tasks
scheduled for execution. This selection is based on expected and actual execution times of
the tasks. As mentioned earlier, a two-step approach is formulated for the OC selection.
The first step, referred to as GOPS, selects an OC based on the expected execution times of
different tasks considered. If additional positive or negative slack arises in the schedule during
actual execution, it is handled in the second step, referred to as LOPS. Since LOPS algorithm
is primarily used to manage and utilize the dynamic slack, it can increase the temperature
gradient between the selected adjacent OPs. To minimize this negative effect, the LOPS
algorithm selects an alternate OP for a task in the case of positive runtime slack only if this
gradient is not significantly aggravated. The next two subsections detail the GOPS and LOPS
algorithms.
37
3.3.1 GOPS Algorithms
The problem of assigning voltages and frequencies to the given task set can be viewed as
selection of the proper OC from the OT that satisfies the constraints detailed in Section 3.2.
A total of mN OC s can be constructed from the OT since each task can operate at any of the
available m operating modes. Using a brute force approach to compare the merit of all these
combinations becomes a computationally daunting task as the number of tasks or operating
modes increase. We develop two complementary polynomial time algorithms to select suitable
a OC from the OT, namely peak reduction and window based selection. The details of these
two algorithms are provided in the following subsections.
3.3.1.1 Peak Reduction Algorithm
The Peak reduction algorithm is iterative in nature, and each iteration involves selection of
a candidate OC from the feasible pool of OC s in the OT. The OPs for the different tasks are
selected such that they fall close to a target temperature set for the iteration. At the end of the
iteration, the currently selected OC is compared against the best OC selected so far during the
previous iterations. If the newly selected OC is deemed to be more meritorious, it will now be
considered as the best OC for future iterations. The following variables are defined and used
for the algorithm execution.
1. Pkr : A reference to the peak reduction algorithm
2. o chain: The OC selected during a Pkr iteration.
3. CurrentBest OC : A reference to the best OC selected by Pkr until a particular iteration.
4. targetT : The target temperature employed for a Pkr iteration.
5. feasibilityflag : A flag indicating whether the chosen OC satisfies the deadline constraints
for all the tasks.
6. doneflag : A flag signaling algorithm termination.
38
The algorithm contains two phases Tup and Tdown, each of which may execute in multiple
iterations. The Tdown phase is executed first, where the chip temperature profile is iteratively
lowered until the algorithm finds a set of OPs that do not satisfy the task deadlines, or the
algorithm selects the lowest OPs for all tasks. When the former situation arises, the algorithm
switches to Tup phase. In this phase, faster OPs are progressively selected leading to higher
temperature profile while striving to satisfy task deadlines.
The algorithm initializes by making the OPs corresponding to the highest voltage and
frequency a part of o chain and CurrentBest OC.
Tdown phase In the Tdown phase, the temperature corresponding to the coolest OP in the
o chain is selected as the targetT. In each iteration, OPs for different tasks that are the closest
to targetT are selected into o chain. We refer to this selection process as nearestSelect. Once
a new o chain is completely formed, it is checked for deadline feasibility. The feasibilityflag is
updated accordingly. If all the deadlines are satisfied, the currently selected o chain is compared
with CurrentBest OC to see how it fares in satisfying the constraints on temperature in the
order mentioned in Section 3.2. A conditional update of CurrentBest OC occurs accordingly.
The algorithm then proceeds to the next iteration and this process repeats until one or few
task deadlines are not satisfied. In case the o chain selected during an iteration coincides with
the OC selected in the previous iteration, the algorithm can get stuck in an infinite loop. To
break out of this loop, a new o chain is selected which contains the OPs for the different tasks
with the next lower voltage and frequency setting compared to the OPs selected in the previous
iteration. We refer to this selection process as lowerSelect.
Tup phase We reach the Tup phase when one or a few task deadlines are not met. Hence,
selecting alternate OPs which are slower doesn’t serve the purpose of fulfilling performance
constraints. Hence, new o chains consisting of more aggressive OPs are selected in this phase.
Each iteration in Tup phase proceeds as follows. The currently existing coolest OP is taken out
of further consideration. The new target temperature is calculated as the temperature of the
coolest OP still under consideration. Once targetT is calculated, a new o chain is selected in
39
Scheme new OC Use condition
lowerSelect OPk(j−1) ∈ OCi ⇐⇒ OPkj ∈
OCi−1
OCi−1 = OCi−2
nearSelect OPkj ∈ OCi ⇐⇒ (@OPkl, |Tkl −
targetT | < |Tkj − targetT |)
OCi−1 6= OCi−2
and ff = 1
alternate
Select
OPkj ∈ OCi ⇐⇒ (@OPkl, |Tkl −
targetT | < |Tkj − targetT |)
OCi−1 6= OCi−2
and ff = 0
Table 3.1 OC selection schemes for Peak Reduction
a fashion similar to nearestSelect. To distinguish the way of selecting targetT from the method
used in nearestSelect process, we refer to the current OC selection process as alternateSelect.
The Tup phase ends when a deadline feasible OC is selected or there exists no considered OPs
in the o chain that is selected just before Tup phase starts.
Listing 3.1 The Peak Reduction Algorithm
1 Function PeakReduction Algorithm
2 feasibilityflag=1; doneflag=0; initializer();
3 while(doneflag==0)
4 { if(feasibilityflag==0) o chain=alternateSelect();
5 else
6 { if(o chain==prev chain) o chain=lowerSelect();
7 else o chain=nearestSelect(o chain.COT);
8 feasibilityflag=checkFeasibility(o chain);
9 if(feasibilityflag==1) optimizeChain(); }
10 prev chain=o chain; }
11 return CurrentBest OC;
12 End Function
13
14 Function alternateSelect
15 COPindex=1
16 while(COPindex < num of tasks)
17 { o chain=nearestSelect(o chain.COT);
18 feasibilityflag=checkFeasibility(o chain);
19 if(feasibilityflag==1) optimizeChain();
20 COPindex=COPindex + 1 }
21 doneflag=1;
22 End Function
23
24 Function optimizeChain
25 if(CurrentBest OC.HOT < Tthresh && o chain.HOT < Tthresh)
26 if(CurrentBest OC.Sum Tdiff > o chain.Sum Tdiff)
27 CurrentBest OC = o chain;
28 else CurrentBest OC = o chain;
29 End Function
40
The three different OC selection processes are listed in Table 3.1. In the table, the notation
OCi is used to denote the OC selected at the end of iteration i. Listing 3.1 depicts the pseudo
code.
Algorithm runtime For a given OT, there can be a maximum of m∗N Tdown iterations
and a maximum of N −1 Tup iterations. The worst case behavior for Tdown phase occurs when
each iteration leads to change of only one OP in the selected o chain. Similarly, the worst case
behavior occurs for the Tup phase when only the last o chain selected in this phase is deadline
feasible. Each Tup or Tdown iteration has a time complexity of O (mN). Hence, the worst case
complexity of this algorithm is O(m2N2). Although this seems reasonable, the complexity does
not scale well with increasing m. This becomes an issue if a large number of operating modes is
considered. An increase in N can be handled by dividing the task queue into multiple windows
and performing OC selection on the different windows separately. In the next subsection, a
complimentary algorithm for GOPS that scales better with increasing m is detailed.
3.3.1.2 Window Based Selection Algorithm
An alternative to the peak reduction algorithm is a Window based OP selection (WOPS )
algorithm. This algorithm restricts the selection of OPs in favor of reducing the temperature
gradients between the tasks. The following variables and metrics are utilized.
1. SumTdiff : Sum of absolute temperature differences between adjacent tasks in the sched-
ule.
2. o chain: The OC selected during a Pkr iteration.
3. CurrentBest OC : A reference to the best OC selected by WOPS until the specified iter-
ation.
4. Pivot: A selected OP on whose basis other OPs in an o chain are selected.
5. dirn: A variable specifying the direction (in terms of temperature) in which the next
candidate OP for o chain has to be selected.
41
The WOPS algorithm is iterative in nature. Each iteration employs a pivot OP and a virtual
temperature window. In each iteration, an o chain is chosen such that the constituting OPs
lie close to this window. The pivot for an iteration is simply a candidate OP for the first task
in the queue. The virtual temperature window constitutes a small temperature range around
the pivot’s expected temperature. Since m such OPs are possible, the algorithm executes in m
iterations. Similar to the case of peak reduction, each o chain selection in WOPS is proceeded
by a conditional update of CurrentBest OC. The WOPS algorithm also starts by initializing
the CurrentBest OC similar to the Peak reduction approach.
Listing 3.2 The WOPS Algorithm
1 Function WOPS Algorithm
2 initialize();
3 while((o chain[0]=selectpivot())!=NULL)
4 { calculateVbounds();
5 while(next task id < num of tasks)
6 { o chain[next task id]=selectionAlgo(o chain,next task id,dirn);
7 dirn=setDirection();
8 next task id++; }
9 current schedule feasibility=checkFeasibility(o chain);
10 if(current schedule feasibility)
11 { calculateNetSwing(); updateBestChain(); updateFlags(); }
12 }
13 End Function
Listing 3.2 shows the pseudo code for our WOPS algorithm. The function selectionAlgo takes
the updated o chain as the input to find the next OP to be included in the chain. A direction
variable dirn is maintained to guide the selection process. This variable is updated on the fly
to force the selection algorithm to choose OPs closer to the virtual bounds, if required. This
process is illustrated in Figure 3.2. The worst case complexity for one iteration of selectionAlgo
is O(Nlogm), since there are N tasks and a binary search can be used to find the candidate OP
for each task that is closest to the window.
Figure 3.2 shows the o chain selection round when there are 4 tasks in the system, each
of which can operate at 2 different OPs. Assume that the point P11 is selected as the pivot
currently. We term the most recently selected OP as the o chain header. The dashed lines in
the figure represent the virtual temperature bounds. Initially, the o chain header lies between
the virtual bounds. dirn is set to 0 to indicate this. For the second task, there are two potential
42
P11
Task 1 
P21
P22
Task 2
P31
P32
P41
P42
Task 3 Task 4
Dirn = 0
Dirn = 0 Dirn = -1
P12
Current Pivot
Dirn=0: Find OP with closest temperature
Dirn=1: Find OP with closest but higher 
temperature
Dirn=-1: Find OP with closest but lower 
temperature
Figure 3.2 Example OC selection using WOPS
OPs available for selection. The selection algorithm selects the closest point to P11, which is
P21. P31 is selected for task 3 using similar logic. Since P31 falls out of the virtual bounds, it
is beneficial for the selection algorithm to select the next OP towards the upper virtual bound,
in order to avoid a large thermal gradient with respect to P11. dirn is set to -1 to achieve this
effect. The selection algorithm thus selects P42 instead of P41.
Once a complete OC is selected, its deadline feasibility is calculated using the checkFea-
sibility function, which takes up O(N) processing time. If the feasibility check succeeds, the
newly selected OC replaces the CurrentBest OC utilizing the same logic as that used for Peak
reduction. At the end of current iteration, the pivot is marked as invalid and the algorithm
starts another iteration by selecting a new pivot.
Runtime for window based selection The WOPS algorithm terminates when it runs
out of valid pivots to choose from. Since there are m pivot points that can be chosen, the total
complexity of the algorithm is O(Nmlogm). The time complexity can be further reduced to
O(Nlogm2) by using a binary search for selection of the pivots.
43
3.3.2 LOPS Algorithm
The LOPS algorithm takes the best OC selected by a GOPS algorithm as input and deals
with the runtime slack by (potentially) altering each task’s OP locally before it starts execution.
Since the execution of LOPS falls into the actual task schedule, it is designed to be faster.
The runtime slack is calculated as the difference of the estimated start time (obtained from
schedule constructed by GOPS ) of a task and the actual start time of the task (obtained during
execution). When there is negative runtime slack in the system, the LOPS algorithm selects a
more aggressive OP compared to the one preselected by GOPS. On the other hand, when there
is a positive runtime slack, the LOPS algorithm makes a change to the OP selected by the
GOPS algorithm only if the newly selected point does not cause an additional local SumTdiff
of Tmax with respect to its adjacent selected OPs in the schedule. The LOPS algorithm, as
listed in Listing 3.3, has a time complexity of O(m).
Listing 3.3 The LOPS Algorithm
1 Function LOPS Algorithm
2 int i;
3 if(current slack == 0) return selected point;
4 else if(current slack < 0)
5 for(i=selected point+1;i< m; i++)
6 if((i.time−selected.time)<current slack) return i;
7 else if(current slack > 0)
8 for(i=selected point−1;i >=0;i−−)
9 if((i.time−selected.time)<current slack)
10 if(cycleEffect(selected point,i,task id)< Twindow/2) return i;
11 End Function
Scheduling in periods of persistent slack As described earlier, the GOPS algorithm
creates an initial OC based on expected execution times. To account for scenarios leading
to continual runtime slack in a single direction, the GOPS algorithm creates additional OC s
based on scaled expected execution times. This helps make comprehensive static decisions that
perform better when slack arises. The OT is modified by scaling the expected execution times
for the different tasks with different scaling factors. For each modified OT, an OC is chosen
using a GOPS algorithm. We have limited the bounds of this scaling factor to 0.8 and 1.2
(with a step of 0.05). Using this methodology, we can cater for dynamic slack of 20% (both
44
+ve and -ve). The LOPS algorithm can employ any of these OC s utilizing the different scaling
factors during task execution.
The entire task execution schedule is split into windows. Each window contains W tasks.
During task execution, a miss counter (MC ) is employed to monitor the number of deadline
misses so far in the schedule. After ith window of tasks finishes execution, we consider the OC
selected for a particular scaling factor for deployment in the next window. The chosen scaling
factor used for the i+ 1th window is adjusted based on the criterion below.
(MC value +W )/(W ∗ (i+ 1)) < 1−QOS (3.11)
If the above criterion is satisfied and there exists positive slack in the schedule, the scaling factor
employed for the next window is reduced by a step to improve the energy savings. If there
exists negative slack in the schedule, the OC s corresponding to a step higher scaling factor is
utilized by the LOPS algorithm for the next window. The OC selection mechanism also learns
how the runtime slack is evolving over time. If a unidirectional slack is continually observed
for two successive windows, the step size used for choosing the scaling factor increments by 1.
If the slack direction reverses for two adjacent window boundaries, the step size is reset to 1.
3.4 Evaluation of the developed DVFS based schemes
We perform both analytical and simulation based analysis to quantify how our GOPS
and LOPS lead to increased lifetime expectations. The performance of GOPS algorithms
in reducing peak temperatures and thermal gradients on the chip are first evaluated. The
runtime for GOPS algorithms in terms of number of iterations is also studied. Such an analysis
is essential since GOPS also needs to execute frequently when there is no single set order in
which tasks arrive in the task queue. This is followed by an analysis of how the LOPS algorithm
caters to the QoS requirements.
For our analysis based on analytical PPR model, a large number of task sets are synthetically
generated. Results obtained over 10000 task sets are averaged to smooth out isolated deviant
behavior. For the simulation based studies, a simulation framework consisting of Simplescalar
simulator, Wattch, and Hotspot is considered. A set of SPEC 2000 benchmarks are used for
45
analysis. The execution profile of each benchmark is divided into blocks of 10 million instruction
each, and configurations are chosen for these intervals. We observed that temperature swings
within each block are low (1-4 degrees). The details of our evaluation are presented in the next
few subsections.
3.4.1 Experimentation with synthetic task sets
The first set of experiments we performed is intended to test the effectiveness of our GOPS
algorithm. We synthetically generated 10000 task sets with µ values ranging between 0.1 and
1. Task sets of sizes 8, 16, 32, and 64 different periodic tasks are investigated. A ρ value of 0.25
is assumed. This is in par with current predictions in semiconductor industry. Eight different
(V, f) combinations are assumed, given by (V, f) ∈{(1.2 300), (1.23, 400), (1.35, 500), (1.53,
600),(1.75, 700), (2.0, 800), (2.35, 900), (2.80, 1000)}, where V values are in volts and f values
are in MHz. These voltage and frequency combinations are taken from a real world processor
[105]. The tasks’ steady state temperatures (SST s) at the OP with maximum voltage and
frequency are selected in the interval of [310 K,390 K], linearly increasing with respect to µ.
Higher the value of µ, higher is the SST selected. Similarly, the power consumed at the nominal
voltage and frequency for the different tasks is selected in the range of [5 W, 10 W] based on µ.
Once the task parameters at the nominal OPs are fixed, their corresponding values for the other
supported operating modes are calculated using the model detailed in Section 3.2.2. The task
execution times at the nominal voltage and frequency are assumed in the range of [120s,240s].
The (percentage of) performance degradation that is accepted to improve reliability is modeled
as task stretch factor. The task stretch factors ranging between 5% and 45% are considered in
increasing steps of 5%. These stretch factors create static slack in the schedule, that is utilized
by our GOPS algorithms for OP selection. The different evaluation parameters employed are
listed in Table 3.2. The performance of the different GOPS algorithms is analyzed at these
different task stretch factors.
Reduction in thermal gradients Figure 3.3 shows the strength of the GOPS algo-
rithms in minimizing the inter-task SST differences when the number of tasks in the schedule
46
Parameter Value
µ [0.1,1]
ρ 0.25
(V, f) {(1.2 v, 300 MHz), (1.23 v, 400
MHz), (1.35 v, 500 MHz), (1.53 v,
600 MHz),(1.75 v, 700 MHz), (2.0
v, 800 MHz), (2.35 v, 900 MHz),
(2.80 v, 1000 MHz)}
tnom [120 s, 240 s]
Pnom [5 W, 10 W]
Tnom [310 K, 390 K]
Tthresh 353 K
Twindow 10 K
Task stretch factor 5% - 45%, 5% +
Table 3.2 Evaluation parameters used for analyzing effectiveness of GOPS algorithms
considered are 8 (Fig. 3.3 (a)), 16 (Fig. 3.3 (b)), 32 (Fig. 3.3 (c)), or 64 (Fig. 3.3 (d)). The
x-axis in the figure indicates the task stretch factors and the y-axis shows the sum of absolute
differences between the SST s of adjacent tasks in the schedule (Sumtdiff ). The max scheme
corresponds to operating all tasks at maximum (V, f).
Figure 3.3 shows three trends.
1. pkr does a slightly better job than WOPS is reducing thermal gradients. This trend
is expected since pkr chooses the final OPs from a larger pool of candidate OC s when
compared to WOPS. It is observed that pkr outperforms WOPS when task stretch factor
is between 15-35%. Both the algorithms perform similarly in both conditions where very
little or very high performance degradation is allowed.
2. As the task stretch factor increases, the GOPS algorithms can do a better job in reducing
temperature gradients. This is because the algorithms have at their disposal a larger
number of OC s to choose from which satisfy the task deadlines.
3. Both pkr and WOPS algorithms fare well even with a larger number of tasks in schedule.
Even though it is tough to find proper OPs that perform closely with respect to their
SST s as the number of tasks increase, this effect is not very pronounced.
47
Figure 3.3 Effectiveness of GOPS algorithms in reducing inter-task temperature gradients
Energy savings The energy savings produced by the different GOPS schemes for the
task set sizes a) 8, b) 16, c) 32, and d) 64 are shown in figure 3.4. The x-axis in the figure
indicates the task stretch factors and the y-axis shows the energy savings as a percentage of
the energy consumed using nominal voltages and frequencies.
It is observed that pkr outperforms WOPS in terms of reducing energy when the number
of tasks is less. As this number increases, WOPS catches up with and even surpasses pkr.
pkr results in very low energy saving when the number of tasks in the schedule is high and
the task stretch factor is low. Note that pkr reduces thermal gradients more effectively than
WOPS for all task set sizes and stretch factors. Hence, it can be concluded that pkr trades
off energy reduction for thermal balance. The energy savings generally decrease slowly as the
number of tasks increase. In order to maintain thermal balance, OPs which do not reduce
energy the most are selected into the final OC. However, this degradation in energy saving gets
less pronounced as higher performance degradation is accepted. It is observed that the GOPS
algorithms provide up to about 55% energy savings when the task stretch factor is set at 45%.
48
Figure 3.4 Effectiveness of GOPS algorithms in providing energy savings
Algorithm iterations Though the theoretical maximum number of iterations in Pkr are
high (m x N + (N-1)), our experiments revealed that the actual value is much smaller than this
bound. For example, the average number of iterations for Pkr is observed to be scaling linearly
with the number of tasks. On the other hand, the average number of iterations for WOPS
is always bounded by the number of operating modes and is typically less than 4. As the
task stretch factor increases, the GOPS algorithms can consider a larger number of candidate
OC s which can satisfy all task deadlines. This results in a very slight increase in the average
number of iterations. Figure 3.5 shows how the number of algorithmic iterations scale with the
performance sacrifice and number of tasks considered. The number of algorithmic iterations
for a) 8, b) 16, c) 32, and d) 64 task set sizes are shown along y-axis and the task stretch factor
as a percentage value is shown on x-axis.
QoS satisfaction To demonstrate how our algorithms perform with respect to meeting
the QOS constraints set, We have scheduled 32 tasks separately for different QOS constraints
ranging between 0.91 and 0.99. The peak reduction algorithm is used for GOPS. Each task
49
Figure 3.5 Scaling of the number of algorithmic iterations of GOPS algorithms with task set
size and task stretch factor
window is constrained to the size of 1024 tasks. It has been verified that our LOPS scheme meets
the QOS requirements specified. We also observed that as the QOS constraint is tightened,
the scheduler utilizes the positive runtime slack more pessimistically, resulting in lower energy
savings. However, this difference in savings is marginal (around 5%).
QOS Constraint (as %) QOS Delivered (as %) Avg. Energy Savings (as %)
91 92.13 51.75
93 93.85 50.17
95 95.54 48.86
97 97.92 46.47
99 99.99 46.76
Table 3.3 QOS satisfaction and Energy savings
50
Simplescalar
Cycle accurate simulator
Wattch
Power modeling tool
Hotspot
Temperature modeling tool
Application
Inputs
Cycle by cycle
Access characteristics
Technology dependent 
power consumption  
characteristics
P
o
w
e
r 
p
ro
file
Chip
floorplan
Performance Power
Temperature
Figure 3.6 Simulation framework
3.4.2 Simulation based performance reliability tradeoff analysis
Thus far, we have reported the effectiveness of our GOPS and LOPS algorithms in reducing
thermal gradients and providing energy savings. The performance, energy, and temperature
characteristics for the different tasks are obtained through an analytical model. We now obtain
these characteristics through cycle accurate simulations. Since such simulations are costly, we
consider only a small task set and provide insight into the reliability improvement provided by
our schemes.
3.4.2.1 Simulation framework
For our simulations, we have used the Simplescalar cycle accurate simulator. This simulator
can be used to model the execution of applications on a Alpha EV6 like processor. Simplescalar
provides different simulation engines with varying degree of detail and accuracy. In particular,
we use the sim-outorder engine which considers out-of-order execution of a superscalar pro-
cessor. sim-outorder is execution driven, making it very accurate for obtaining performance
data. Simplescalar is integrated with Wattch, a cycle accurate power modeling tool. Wattch
has been integrated with Simplescalar to obtain cycle-by-cycle access characteristics of the
different units on the chip floorplan. Wattch is driven by a parameterized power model that
estimates the power consumed each cycle based on the aforementioned access characteristics
and technology dependent circuit load parameters. To obtain temperature data, we use the
51
Parameter Value
Fetch, decode, Issue,
and commit width
4
Functional units 4 INT (and FP) ALUs, 1 INT (and FP)
MUL/DIV
L1-D cache 2 KB 4 way
L1-I cache 2 KB 1 way
Unified L2 cache 32 KB 4 way
Technology node 45 nm
Voltage 1.25 v, 1.15 v, 1.05 v
Frequency 2536 MHz, 2475 MHz, 2402 MHz
Table 3.4 Simulation parameters used for performance-reliability tradeoff analysis
temperature modeling tool called Hotspot. With the knowledge of the microprocessor floor-
plan, Hotspot decomposes the logic circuits into RC networks. Heat sources are modeled as
voltage sources to the RC network. The power consumption profiles are provided by Wattch
to Hotspot. Figure 3.6 shows the interfaces and data flow between Simplescalar, Wattch, and
Hotspot. The floorplan assumed by Wattch are slightly different from the floorplan used by
Hotspot. As such, modifications are made to Wattch to produce power profiles amenable to
Hotspot. Such modifications were made Prem Kumar Ramesh, who was one of my colleagues
in my research group. The baseline parameters used for simulation are shown in Table 3.4.
3.4.2.2 Simulation workloads
A set of 8 SPEC 2000 benchmarks were chosen for experimentation. These benchmarks
have been widely used for analysis in the past research. The benchmarks used are listed in
Table 3.5 [23]. The inputs for the benchmarks were obtained from Simplescalar website [69].
3.4.2.3 Reliability modeling
Reduction of thermal cycling and chip temperatures on the chip improves its lifetime. To
quantify this effect of our schemes, we utilize the chip Mean Time To Failure (MTTF ), a met-
ric that is widely used to quantify chip reliability [91, 93, 96]. MTTF is defined as the mean
expected time to fail of a non-repairable component. Accordingly, the failure mechanisms
52
Benchmark Description
gzip File compression
mcf Combinational optimization
perlbmk PERL programming language
vpr FPGA placement and routing
gcc C compiler
eon Computer visualization
bzip2 Compression
gap Group theory, interpreter
Table 3.5 SPEC workloads used for simulations
investigated should make the chip non-functional and non-repairable. The following failure
mechanisms are analyzed- Electromigration, Stress migration, large and small scale Thermal
cycles, Time dependent dielectric breakdown, and Negative bias temperature instability. Al-
though there are many other failure mechanisms, we have restricted ourselves to using the most
investigated ones, due to availability of near-accurate analytical models to predict the MTTF
associated. For each failure mechanism, we calculate the ratio of MTTF when the processor
executes tasks with the OPs selected using our scheme to the MTTF obtained when operating
at nominal voltage and frequency. In the latter case, the tasks finish faster, and the processor
enters a low power mode dictated by the lowest possible voltage and frequency, which is ac-
counted for. Since operating each task at a particular OP results in a different MTTF value,
we use weighted harmonic mean to calculate the average MTTF value. The details of the
investigated failure mechanisms and our experiments are provided below. Although the tem-
perature profiles on a chip are continuous in nature, we believe that the discrete temperature
modeling detailed below gives a good first order estimate of the lifetime improvement.
3.4.2.4 Chip Failure Modeling
Electromigration The atoms in interconnects are gradually displaced due to momentum
transfer by the conducting electrons. Because of this, the atoms get shifted within interconnect
and lead to higher resistance values and possibly, shorts. The MTTF due to Electromigration
is given by
MTTFEM ∝ (J)−neEaEM/KT (3.12)
53
where J is the current density in interconnect, n is a material dependent constant, EaEM is
the activation energy for electromigration, K is the Boltzmann constant and T is the steady
state operating temperature. The value of J is directly proportional to the operating voltage
and frequency.
Stress Migration Due to differential thermal coefficients in the interconnect material,
a thermo-mechanical stress is generated when the interconnect heats leading to migration of
atoms. It results in open circuits and high resistance values within interconnects. The MTTF
due to Stress migration is given by
MTTFSM ∝ |T0 − T |−meEaSM/KT (3.13)
where T0 is the metal deposition temperature, T is the steady state operating temperature, m
is a material dependent constant, EaSM is the activation energy for stress migration and K is
the Boltzmann constant.
Thermal Cycling We have mentioned both the large and small scale thermal cycling
earlier. The MTTF due to large scale thermal cycling is given by
MTTFLTC ∝ (1/(T − Tambient))q (3.14)
where T is the steady state operating temperature, Tambient is the ambient temperature and
q is the coffin-Manson exponent which is a measure of the effect of thermo-mechanical stress.
Small scale thermal cycles cause solder joint failures due to uneven expansion and contraction.
We model the MTTF for small scale thermal cycles based on the MTTF for the solder joints.
This is given by
MTTFSTC ∝ |T1 − T2|neEa/KTmax (3.15)
where T1 and T2 are the steady state temperatures of two tasks, Tmax is the maximum value
between T1 and T2, n is the Coffin Manson based exponent, Ea is the activation energy and K
is the Boltzmann constant.
54
Failure mechanism Parameters
Electromigration n = 1.1, EaEM = 0.9ev
Stress migration m = 2.5, EaSM = 0.9
LTC q = 2.35
STC n = −1.9, Ea/K = 1414
NBTI A = 1.6328, B = 0.07377, C = 0.01, D =
0.06852, β = 0.3
Table 3.6 MTTF model parameters
Negative Bias Temperature Instability The negative bias applied to the gate re-
sults in gradual increase in the threshold voltage and associated decrease in drain current and
transconductance. The MTTF for negative bias temperature instability is given by
MTTFNBTI ∝ {{(ln(E)− ln(E − C)} ∗ T/e−D/KT }1/β (3.16)
E = A/(1 + 2eB/KT ) (3.17)
where A, B, C, D and β are curve fitting parameters, K is the Boltzmann constant and T is
the steady state temperature.
We have used the fitting parameters employed in RAMP model [93] to calculate the MTTF
values. The parameters used to calculate MTTF for just the small scale thermal cycles are
derived from the model used in [99]. The values for all such parameters are shown in Table 3.6.
3.4.2.5 Performance reliability tradeoff
In this section, the improvement in expected chip lifetime when stretching tasks to fit
the acceptable task stretch factor is presented. As the tasks are stretched by decreasing the
operating voltage and frequency, chip power consumption goes down. Accordingly, the power
density and temperature decrease, leading to longer lifetime expectation. Figure 3.7 shows how
the increase in MTTF scales with the task stretch factor.
It can be seen that the MTTF increases increase occurs is steps. This is expected since the
available performance and temperature points are discrete in nature. The highest increase is
observed corresponding to electromigration. Both peak reduction and window based selection
lead to similar MTTF improvements at significant task stretch factors. Since only a small set of
55
Figure 3.7 MTTF increase using DVFS using (a) Window based selection, and (b) Peak
reduction
voltages and frequencies are available, the improvement in MTTF also tapers off with increase
in task stretch factor. As the performance traded off for reliability slowly increases, we observe
that window based selection exploits the performance sacrifice first. Peak reduction does not
result in any reliability improvement until the task stretch factor increases over 5%. This is a
consequence of the inherent differences in the way these two schemes select OPs. In particular,
peak reduction uses the lowest temperature point in each iteration as the target temperature
for next iteration. If the newly selected chain does not satisfy performance constraint, the
algorithm reverts back to selecting the nominal OPs. Window based selection always considers
a candidate OP for the first task for deciding the target temperature.
It should be noted a larger increase in lifetime expectation is hindered by the course-grained
performance-temperature points provided by the different voltage and frequency settings. To
further improve the MTTF value, more operating points as well as even lower power operating
points are needed. There are two architectural alternative available in this regard. Firstly,
more voltage and frequency settings can be provided. Secondly, the aggressiveness of a few
architectural components can be adjusted (microarchitectural adaptation). In this dissertation,
both DVFS and microarchitectural adaptation are considered together. The motivation behind
this synergistic strategy is explained in the next subsection.
56
Mode id Voltage (Volts) Frequency (GHz)
1 1.484 1.6
2 1.420 1.4
3 1.276 1.2
4 1.164 1.0
5 1.036 0.8
6 0.956 0.6
Table 3.7 Operating voltages and frequencies for Intel Pentium M processor
3.5 Performance reliability tradeoff using DVFS and microarchitectural
adaptation
3.5.1 Need for considering DVFS and microarchitectural adaptation together
consider the results we obtained through interval simulations shown in Figures 3.8 and 3.9,
which depict the impact of (a) L1 instruction cache (IL1 ) associativity (assoc) and (b) operat-
ing voltage and frequency (together referred to as VF ) on the normalized performance (Figure
3.8) and power consumption (Figure 3.9) for the SPEC 2006 benchmarks astar, xalancbmk,
tonto, and milc executing on an adaptive Intel Nehalem processor [49]. The execution is simu-
lated using Sniper simulation platform [15] and the cache size per each associative way is 4 KB.
More details on the simulation platform are provided in Chapter 4. The adaptive processor
is assumed to support three levels of cache associativity- 2, 4, and 8. Similarly, the processor
supports 6 different VF pairs, shown in Table 3.7. The values listed in the table are taken
from the datasheet for an Intel M processor [48] based on Nehalem microarchitecture. From
the data obtained through simulations, we observe that decreasing the IL1 assoc from 8 to
2 does not affect normalized performance significantly for astar and milc. Simultaneously, a
significant impact is noticed for tonto and xalancbmk. It must be noted that the core voltage
and frequency are significant factors driving performance for all of these benchmarks. If a
15% reduction in power is desired, assoc can be lowered to 2 for astar and milc, conserving
15.5% power in both cases. If voltage and frequency are scaled instead of IL1 adaption, it
results in higher performance loss to obtain this power reduction (25% in both cases). In case
of xalancbmk and tonto, trading off performance for power reduction by adapting IL1 assoc
57
Figure 3.8 Normalized performance vs. (a) IL1-Assoc. (b) Operating VF for selected SPEC
benchmarks
Figure 3.9 Normalized power vs. (a) IL1-Assoc. (b) Operating VF for selected SPEC bench-
marks
results in a larger performance loss (32% and 12.5% respectively) when compared to operating
in mode 2 to satisfy power constraint (21% and 9% performance loss respectively). From the
above discussion, it is clear that the effectiveness of trading off performance for power con-
sumption, and consequentially reliability, through DVFS or microarchitectural adaptation is
application dependent. Hence, a unified scheme that considers both DVFS and microarchitec-
tural adaptation together can lead to a better tradeoff.
3.5.2 Selection of adaptive microarchtiectural components
A large number of microarchitectural components can be adapted [31]. In this dissertation, a
small subset of those components are used for performance reliability adaptation. Essentially,
the components contributing largely to overall chip power consumption are chosen. Figure
58
0
5
10
15
20
25
30
35
40
b
zi
p
2
cr
af
ty
eo
n
ga
p
gc
c
gz
ip
m
cf
p
ar
se
r
p
e
rl
b
m
k
vo
rt
ex vp
r
w
u
p
w
is
e
sw
im
si
xt
ra
ck
m
gr
id
m
es
a
lu
ca
s
fm
a3
d
fa
ce
re
c
eq
u
ak
e
ar
t
ap
si
am
m
p
av
g.
Po
w
er
 (
W
)
Benchmark
L2 cache
ITB
LdStQ
FPQ
IntExec
IntReg
IntQ
IntMap
FPMap
FPMul
FPReg
FPAdd
DTB
Figure 3.10 Power consumption breakdown among different units on Alpha EV6 floorplan
Table 3.8 Adaptive hardware configurations
Component Considered configurations
L1 D-cache associativity {1, 2, 4}
Int Exec. {1, 2, 4}
FP Add {1, 2, 4}
(V involts, finMHz) {(1.25, 2536), (1.15, 2475), (1.05, 2420) }
3.10 shows the relative power consumption of different units on the floorplan of an Alpha EV6
processor when executing SPEC 2000 benchmarks. In the figure, the x-axis denotes the different
benchmarks considered and y-axis denotes the power consumption in watts. From the figure, it
is clear that the L1 data cache, Integer ALU, and the FP ADD unit are the three most power
hungry components. As such, these components are chosen for adaptation along with DVFS.
The configurations that are considered for adaptation are listed in Table 3.8. Also, different
components affect the performance-power balance differently. hence, it is not guaranteed that
higher performance levels necessarily translate to higher chip temperatures. This insight is
used to remove inefficient OPs on a per-application basis that result in lower performance and
higher temperature when compared to another valid configuration for the same application.
Details on this pruning strategy is made clear in Chapter 4.
59
Figure 3.11 Expected MTTF improvement through the combined use of DVFS and microar-
chitectural adaptation with (a) Window based selection, and (b) Peak reduction
3.5.3 Performance reliability tradeoff considering both DVFS and microarchitec-
tural adaptation
Figure 3.11 shows how a combination of DVFS and microarchitectural adaptation can
further lead to improved lifetime expectations when WOPS algorithm is utilized. In the figure,
the x-axis denotes the task stretch factors considered and y-axis denotes the MTTF values.
The values along y-axis are a ratio of MTTF improvements observed for the cases of DVFS
plus microarchitectural adaptation and DVFS alone.
The MTTF improvement in this case is the highest for short term thermal cycling behavior.
Since a large number of temperature points are available for each task to choose from, it becomes
easier to reduce thermal gradients as well. It is also observed that the increase in MTTF ratio
does not monotonically scale with performance degradation. For small task stretch factors,
window based selection even results in lower MTTF when compared to utilizing just DVFS.
Some OPs that are utilized in the latter case are now eliminated due to their inferiority. As
window based selection tries to restrict all OPs within a window, unavailability of a few such
configurations leads to choosing of higher temperature OPs. It is also observed that the actual
MTTF increases as the task stretch factor increases.
60
3.6 Conclusion
In this chapter, the issue of microprocessor performance reliability tradeoff is dealt with. A
real-time task execution environment is considered. A two stage methodology for selecting good
hardware operating modes to improve reliability in periods of available slack in the task schedule
is developed. This methodology leverages on the knowledge of hardware-software interaction
characteristics which have been obtained separately through analytical modeling and cycle
accurate simulations. Both DVFS and microarchitectural adaptation are utilized to provide
different operating modes on a microprocessor. The results obtained through experimentation
indicate that the developed schemes do a very good job in decreasing chip temperatures and
thermal gradients. The thermal gradients for a task schedule with 32 tasks are reduced by as
much as 80% when the performance degradation accepted is 45%. A combination of DVFS
and microarchitectural adaptation led to an 2.5-15 fold increase in expected chip MTTF values
corresponding to different failure mechanisms and 41 fold in chip MTTF values corresponding
to short term thermal cycling, when 10% performance degradation is allowed.
61
CHAPTER 4. ADAPTIVE MICROARCHITECTURAL
CONFIGURATION SPACE PRUNING
This chapter details the methodology developed for adaptive configuration space pruning.
The adaptive configuration space is introduced formally and a three stage approach designed for
pruning is detailed. The effectiveness of these schemes in retaining the most relevant/beneficial
configurations is analyzed. This configuration space reduction enables the further development
of a static cum dynamic framework for performance-power tradeoff based on profiling/simula-
tion data. The actual methodology for performance-power tradeoff follows in Chapter 5.
4.1 Introduction
Let there be K adaptive hardware components or control knobs (CK ). Let the ith component
be configurable in wi ways, represented by Ci = {ci1, ci2, ..., ciwi}, where Ci is totally ordered
under <. Assuming that the configuration choices for different components are independent,
the total number of possible architectural configurations is given by T =
∏K
i=1wi. The set of
all possible configurations makes the configuration set/space (S). The jth configuration in S is
represented as
sj = {c1a1 , c2a2 , ..., cKaK} | ∀ 1 ≤ ai ≤ wi, ciai ∈ Ci (4.1)
For an application, let the normalized performance delivered (Pnorm) and the normalized av-
erage power consumed (Wnorm) by configuration sj be denoted by Pj and Wj , respectively.
Performance and power values are normalized with respect to the corresponding values ob-
tained with the maximal or non-adapted configuration.
The large value of T is a major hindrance to design lightweight microarchitectural adapta-
tion methods. The problem is identified in previous research (e.g., [31, 65]). For example, the
62
authors in [31] identify 627 billion configurations arising from adapting 14 different hardware
components simultaneously. To counter complexity, peephole optimizations are applied to win-
dows of instructions. The solutions obtained are impressive, yet suboptimal when compared
to an ideal scheme that considers the entire application execution profile holistically. A similar
condition is also encountered by authors in [61], where the effects of adapting the different
components are considered individually, rather than in a holistic manner. In order to enable
the development of an adaptation scheme that overcomes these limitations, reduction of T is
necessary. Reduction of T also motivates the design of adaptive microarchitectures.
To make T small, reduction in K and all |wi|s is required. Since the initially considered
configuration space is very large, a comprehensive analysis of such a space is expensive. How-
ever, a detailed analysis leads to the selection of the most beneficial configurations. We develop
a three step pruning methodology to achieve the configuration space pruning. Following these
steps, the configuration space considered reduces progressively while the analysis complexity
increases. This enables us to reap the benefits of comprehensive analysis within a short course
of time. Each of these steps can be viewed as a transformation of S to S′ using a specific
criterion, such that |S′| ≤ |S|. Further, reducing K also reduces the hardware complexity in
provisioning the associated controls. The three steps are as follows. The details of these steps
are presented in the following sections.
1. Selection of advantageous control knobs (SACK ): This step targets the reduction of k
and all wis for a given microarchitecture. The recommendations following this step can
be used by chip designers to provide the appropriate adaptive controls.
2. Elimination of inferior configurations (ELIC ): This step eliminates configurations that do
not perform well but consume unjust power when compared to any other configuration
in |S|. This step chooses the most beneficial configurations to use for adaptation while
executing a particular application.
3. Configuration set selection for runtime (CSSR): This step brings down T to any desired
number. The target size of |S′| can be decided on the basis of expected adaptation
granularity and complexity.
63
4.2 Selection of Advantageous Control Knobs (SACK )
Let Smax (or simply max ) represent the maximal processor configured in terms of consum-
ing the highest amount of power Wmax, while most probably delivering the highest performance
Pmax. Similarly, let Sminj represent the processor configuration which exactly matches Smax
with the exception of configuring the adaptive component j minimally. Let this configuration
deliver the performance Pminj while consuming power Wminj . In this step, an adaptive compo-
nent is eliminated if its adaptation cannot tradeoff at least 10% of overall power consumption.
This value is chosen so as to avoid unnecessary control and hardware overhead associated with
adapting the component while reaping insignificant benefits. The pruning criterion can be
represented as
{si /∈ S′ ∀ Cjwj /∈ si} ⇔ tpj < 0.1 (4.2)
where tpj is the tradeoff potential for the adaptive component j, given by
tpj = Wmax −Wminj (4.3)
We initially consider a processor with the adaptive units and available adaptive levels shown
in Table 4.1. The adaptive configuration space contains 1,728 configurations. The maximum
adaptive level mentioned for each component is based on an Intel Nehalem family processor [49].
The minimum levels are limited by technology for all controls except for instruction window
size, where performance implications led to the limit. For the case of caches, the total cache size
scales up proportionately with the number of associativity levels. These components selected
have been adapted in previous research and the hardware complexity involved is shown to be
acceptable ([13, 55, 82]).
We first observed that reducing cache associativity from 2 to 1 does not result in significant
reduction in power, but has an adverse effect on performance. Hence, we restrict the minimum
associativity to 2. This leads to T = 729 configurations.
We explored the merit of adapting the six components individually by simulating bench-
marks from the SPEC 2006 suite [39] on an Intel Nehalem processor using the Sniper simulation
platform [15]. All the simulation analysis performed from this point is based on observations
64
id CK name Adaptive configuration levels
1 Dispatch width (DW) 4, 2, 1
2 Instruction window size (IW) 128, 64, 32
3 L1 Instruction cache associativity (IL1) 8, 4, 2, 1
4 L1 Data cache associativity (DL1) 8, 4, 2, 1
5 L2 cache associativity (L2) 8, 4, 2, 1
6 (Voltage (V), Frequency (GHz)) (VF) (1.484, 1.6), (1.228, 1.2), (1.036, 0.8)
Table 4.1 Considered adaptive components and adaptations
that we made for these benchmarks. Sniper is a high-speed and accurate x86 simulator that
employs an accurate mechanistic analytical model which drives the timing simulation of an
individual core. A branch predictor, memory hierarchy, cache coherence and interconnection
network simulators built into Sniper determine miss events. The analytical model derives the
timing for each interval between successive miss events. The cooperation between the analytical
model and the miss event simulators enables the accurate modeling of process execution. The
simulator is also integrated with McPAT [68] to produce accurate power estimations.
4.2.1 Observations
For each benchmark, execution under seven configurations represented by {Smax, Smin1 −
Smin6} are simulated. The resulting performance and power consumption values are recorded
and normalized with respect to the corresponding values obtained under Smax. Table 4.2
shows the maximum tradeoff potential (tp) values for the different components considered for
adaptation. We observed that voltage and frequency scaling control followed by dispatch width
have the highest tradeoff potential. Further, the components DL1 and L2 fit the pruning
criterion mentioned in Equation 4.2, and are thus eliminated. This reduces K by two and
brings down T to 34 = 81 configurations. This is a significant reduction. Altering IW provides
very fine variations in performance and power. This component is retained along with DW,
L1I, and VF to provide fine-grained control.
To quantitatively observe the tradeoff potential for the entire chosen configuration space,
we tabulate (Table 4.3) the average, minimum, and maximum variations in performance and
65
id CK name tp
1 Dispatch width (DW) 0.31
2 Instruction window size (IW) 0.12
3 L1 Instruction cache associativity (IL1) 0.22
4 L1 Data cache associativity (DL1) 0.03
5 L2 cache associativity (L2) 0.04
6 (Voltage (V), Frequency (GHz)) (VF) 0.6
Table 4.2 tp for the considered adaptive components
Characteristic Measured quantity variation (%)
Performance Minimum 47.9
Maximum 80.3
Average 71.7
Power Minimum 65
Maximum 73.9
Average 69.9
Table 4.3 Performance-power variations provided using the chosen configuration space
power consumption that can be provided for the studied set of benchmarks. We observe that
the envisioned configuration space is sufficient to cater to varied user demands.
Provision of adaptive controls in hardware also leads to power consumption overhead. It
also leads to a larger number of failure points in the system. SACK can reduce the ill effects
concerning the above two scenarios. It is true that elimination of certain adaptive controls
can lead to suboptimal tradeoff decisions. However, the resulting simplicity makes a case for
the implementation of such hardware controls. Further, the reduced configuration space can be
analyzed accurately in more detail to counter the sub-optimality of tradeoff. Finally, the SACK
step considers adaptive controls closest to the core presently. The analysis performed can be
extended to larger configuration spaces including off-chip adaptive controls as well. However,
such an analysis is not a part of the current research.
66
4.3 Elimination of ineffective configurations (ELIC )
Reduction in configuration space in this step is based on a per-application analysis. We make
the following observation to further reduce the configuration space. Adaptation of different
control knobs impacts performance-power balance differently. As a consequence, it is not
guaranteed that Pi > Pj whenever Wi > Wj for two configurations si and sj . If that is the
case, si can be removed from S. Using this criterion, all ineffective hardware configurations are
eliminated. Since the obtained performance and power values for the various configurations are
dependent on the application under consideration, this pruning step is independently performed
for each benchmark separately. The pruning criterion can be represented as
si /∈ S′, ∃sj ∈ S | {Pi < Pj} ∧ {Wi > Wj} (4.4)
Pruning is done by sorting configurations in decreasing order of P values and then inspecting
the W values on the sorted list. If a configuration consumes more power than its predecessor,
it is removed. For n configurations, the sorting and the inspection processes take O(nlogn)
and O(n) time respectively.
4.3.1 Observations
We found that the application of ELIC step reduced the configuration space significantly.
The average number of configurations retained is just 24, while the actual number is different
for different benchmarks. The most number of configurations were retained for libquantum
(38), while the least number were retained for astar and gobmk (17 each). The actual number
of retained configurations for the different benchmarks considered is shown in Figure 4.1. In
the figure, the X-axis shows the different benchmarks and the Y-axis shows the number of
configurations eliminated by this pruning step. One important observation we make is that
although an individual benchmark can benefit only from a few configurations, different appli-
cations benefit from different subsets of configurations. Overall, we noticed that 72 out of the
81 configurations considered are useful for at least one application. An important observation
we make is that at high (V, f) or at intermediate (V, f), a low dispatch does not deliver good
67
merit based
bound based
neighborhood based
0
10
20
30
40
50
60
70
co
n
fi
gu
ra
ti
o
n
s 
p
ru
n
ed
 b
y 
EL
IC
Benchmark
Figure 4.1 Number of configurations eliminated by ELIC
performance for the power consumed, and the corresponding combinations have either been
eliminated or sparingly retained for all benchmarks.
Further, each knob does require all the adaptive levels considered, and the distribution of
their usage is fairly uniform. Figure 4.2 shows the percentage utilization of different adaptive
configuration levels for each considered adaptive control in the configurations retained after
ELIC step. This figure shows the frequency in percentage (Y-axis) with which each adaptive
configuration for every adaptive component (X-axis) is used in beneficial configurations for the
considered benchmarks. It is observed that the lowest adaptation level is retained with a higher
frequency for the different components other than dispatch width. This can be explained by
a stronger positive correlation between normalized power and performance delivered at lower
power levels, than at intermediate or higher power levels. If one adaptive configuration has to
be removed further, the lowest dispatch width can be targeted for elimination.
4.4 Configuration Set Selection for Runtime (CSSR)
The previous two steps eliminate several less effective CK s and configurations. The goal
in this step is to reduce |S′| to a target number k. For our experiments, we have set the value
of k at 16. Thus far, the configurations eliminated are deemed ineffective. Further pruning
68
0
10
20
30
40
50
60
70
80
90
100
VF IW DW L1I
%
 c
o
n
tr
ib
u
ti
o
n
 t
o
 p
ru
n
ed
 s
p
ac
e
Adapted component
Config 1 Config 2 Config 3
Figure 4.2 Usage frequency of the individual adaptive settings for the considered adaptive
components
Pruning method pruning criterion
Merit based selection Retain a set of configurations that provide the
best performance per watt
Bound based selection Retain a set of configurations that are spread
evenly along the possible performance-power
spectrum
Negihborhood based selection Aggressively prune subsets of configurations
that behave similarly
Table 4.4 Different pruning methods for CSSR
would potentially eliminate good configurations to reduce runtime adaptation overhead. Hence,
the criterion used for further pruning has to be sensitive to the tradeoff effectiveness. Three
different pruning criteria are individually considered and the pruning has been implemented
accordingly. The three different pruning methods and criteria are listed in Table 4.4. The
details of these methods follow in the next three subsections.
4.4.1 Merit based selection
The merit based selection pruning methodology selects the k best configurations out of the
configuration space when compared with each other using a metric merit (M).
69
The merit of a particular configuration is defined by the ratio of how much performance
advantage the configuration provides with respect to the configuration with the lowest perfor-
mance point (Pl) to the additional power it consumes with respect to the configuration with
the lowest power consumption (Wl). The M value for the configuration with Pl and Wl is set
to 1. For each other configuration i, its merit value Mis is calculated as
Mi =
Pi − Pl
Wi−Wl (4.5)
Merit-based CSSR performs the following actions, in order.
1. The merit M for each configuration is calculated. This step incurs O(n) time complexity.
2. Configurations having the k highest M values are retained in the configuration space.
An approach similar to heap sort is utilized for selection of configurations with high M
values. This step incurs a time complexity of O(klogn).
The algorithm inserts all the configurations into a top-down heap with the M value as the
key. This ensures that the configuration with the highest merit value is at the root position.
Merit-based CSSR then iteratively removes the root of the heap, selects the corresponding
configuration, and re-heaps until all the required k values are chosen. The number of such
iterations are thus k and the complexity of each iteration is O(logn). The algorithm uses O(n)
extra space for the heap. While merit-based CSSR guarantees the selection of k configurations,
it does not guarantee uniform sampling over the possible range of performance and power
values, which might lead to higher TE when satisfying user demands. The pseudo code for
merit-based selection for CSSR is shown in Algorithm 1.
If all the Pis and Wis are plotted along two parallel straight lines, and the points on the lines
that correspond to the same configuration si are joined, the selection process can be graphically
viewed as selecting the configurations whose joining lines have the lowest k slope values. This
situation is shown in Figure 4.3, where n = 8, k = 4. In the figure, the performance and power
consumption characteristics of an initial set of configurations are plotted along two parallel
lines. The configurations are represented as si, where i varies between 1 and 8. The merit
values for all the configurations are calculated as per Equation 4.5. Then, the configurations
with the four largest merit values are retained in the configuration space.
70
Performance
Power
0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
S={S1 (0.5,0.45), S2 (0.55,0.5), S3 (0.65,0.6), S4 (0.7,0.65), S5 (0.75,0.7), S6 (0.8,0.85), S7 (0.9,0.9), S8 (1,1)}
M={1,1,1,1,1,0.75,0.88, 0.9}
S|={S1 (0.5,0.45), S2 (0.55,0.5), S3 (0.65,0.6), S4 (0.7,0.65)}
Figure 4.3 Example merit based selection for CSSR. n=8 and k=4
Algorithm 1 Merit based selection
for s = s 1→ s n do
Merit← CalcMerit(s)
HeapInsert(heap, s,Merit)
end for
for i = 1→ k do
s← RetrieveHead(heap)
Select(s)
ReHeap(heap)
end for
4.4.2 Bound based selection
Since user demands can be diverse in nature, it makes sense to retain configurations spread
out uniformly along the performance-power spectrum to be able to satisfy diverse demands
well. The provided performance spectrum is divided into intervals and configurations that
provide performance values closest to the interval bounds are chosen. Bound based selection
forCSSR performs the following actions in the scenario where k out of n configurations have
to be retained.
1. With the knowledge of the highest (Ph) and lowest (Wh) provided performance levels,
the configuration space is divided into k-1 equal sized intervals such that the first interval
starts at Pl and the last interval ends at Ph. The intervals are thus separated by bounds
bj where 1 ≤ j ≤ k.
2. Bound-based CSSR selects one configuration that is closest to each bound and retains
it in the configuration space. Since each configuration si has associated Pi, a Distance
metric(DM) is used to quantify the distance between configuration si and a bound bj .
71
We define DMj for a configuration si and a bound bj as
DMj = |Pi − bj | (4.6)
3. For each bj , the configuration providing the least value for DMj is retained into the
final configuration space. We designate this configuration as the minimizer for DMj
(i = mzrDMj ). This ensures the uniform sampling of the configuration space. Ties
between two equally close configurations can be broken in favor of the one that provides
a higher PnormWnorm .
Although it appears that the time complexity of the algorithm is O(kn) (we have to calculate
DM for every configuration with respect to every bound), it can be computed in O(n) time by
observing the following. {i = mzrDMl ∧ j = mzrDMm} ∧ {bl ≥ bm} ⇒
{Pi ≥ Pj} ∧ {Wi ≥Wj} (4.7)
Notice that this condition holds since configurations that do not satisfy this property are
removed from the configuration space by the ELIC step. Since the same configuration can
be selected for multiple bounds, an additional constraint of selecting a previously unselected
configuration into the configuration space for every bound is enforced.
The pseudo code for the algorithm is shown in Algorithm 2. An example demonstrating
the working of this algorithm is shown in Figure 4.4, where n = 8, k = 5. In the figure,
the performance and power consumption characteristics of an initial set of configurations are
plotted along two parallel lines. The configurations are represented as si, where i varies between
1 and 8. The dashed vertical lines represent the bounds generated. UM is initially set to a large
value, say 1. The algorithm starts by considering configuration s1 for b1. The corresponding
UM 1 is measured as |0.5 − 0.5| = 0. As UM decreased since the previous case (1), the next
configuration is now considered for the same bound. UM 1 for s2 is calculated as 0.05. Since we
observe an increase in UM, s1 is selected as the minimizer for b1. The algorithm then considers
s2 for b2, and the process continues.
72
S4: UM_2=0.075
mzr(UM_2)=S3
S3 selected
S4:UM_3=0.05
S5:UM_3=0
S6:UM_3=0.05
Performance
Power
0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
S={S1 (0.5,0.45), S2 (0.55,0.5), S3 (0.65,0.6), S4 (0.7,0.65), S5 (0.75,0.7), S6 (0.8,0.85), S7 (0.9,0.9), S8 (1,1)}
Bounds->{0.5,0.625,0.75,0.875,1}
S|={S1 (0.5,0.45), S3 (0.65,0.6), S5 (0.75,0.7), S7 (0.9,0.9), S8 (1,1)}
0.5 0.625 0.75 0.875 1
Algorithm actions:
S1: UM_1=0
S2:UM_1=0.05
mzr(UM_1)=S1
S1 selected
S2:UM_2=0.075
S3:UM_2=0.025
mzr(UM_3)=S5
S5 selected
S6:UM_4=0.075
S7: UM_4=0.025
S8:UM_4=0.125
mzr(UM_4)=S7
S7 selected
S8:UM_5=0
mzr(UM_5)=8
S8 selected
Figure 4.4 CSSR using bound-based selection when n=8 and k=5
Algorithm 2 Bound based selection
b← b 1
U Best← U MAX
U Best Conf ← NULL
s← s 1
while s 6= NULL do
if b = NULL then
break
end if
if UCalc(s, b) < U Best then
U Best← UCalc(s, b)
U Best Conf ← s
s prev ← s
s← (s→ next)
else
Select(s prev)
b← b next
U Best← U MAX
U Best Conf ← NULL
end if
end while
73
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
Figure 4.5 Graphical representation of the retained configuration space for soplex benchmark
after ELIC pruning step
4.4.3 Neighborhood based selection
This pruning method leverages on the fashion in which the different configurations are
distributed in the performance-power space. This method also eliminates more configurations
than the above methods and still meets the user demands with a defined degree of precision.
Let us first consider an example to motivate neighborhood based selection approach. Figure
4.5 shows the configurations retained for the soplex benchmark after the ELIC step as a graph.
The configurations retained in the ELIC step are shown as nodes in the graph. The nodes
whose performance and power values are within 5% range of each other are connected using an
edge. Configurations connected by an edge are referred to as neighbors.
It can be seen that the graph contains a number of connected components. Since connected
nodes deliver similar performances while consuming similar amount of power, only one node
from each pair of connected nodes needs to be kept in the final configuration set. This guar-
antees that the retained configuration space can cater to user demands almost as well as the
un-pruned set. For example, in the figure, the connected components are: {1}, {2}, {3}, {4},
{5}, {6, 7}, {8, 9}, {10, 11}, {12, 13}, {12, 14}, {13, 14}, {15, 16}, {17, 18}. In this example, if
configuration 13 is retained in the final configuration space, then configurations 12 and 14 can
be eliminated. If any of the eliminated configurations was optimal to satisfy a user demand,
then its removal increases the inaccuracy in satisfying any user demand by a maximum of 0.05.
Following this discussion, if a total of only 11 configurations can be retained in the final con-
figuration space, a set of good configurations chosen would be {1, 2, 3, 4, 5, 6, 8, 10, 13, 15, 17}.
74
4.4.3.1 Satisfying a user demand
An individual user demand (Ud) is represented as a 2-tuple Ud =< Pd,Wd >, specifying
the normalized performance (Pd) and power consumption (Wd). We use the Inaccuracy (I ) in
satisfaction of Ud as a measure of imprecision in demand satisfaction. For a given configuration
set S and a user demand Ud, I is given by
I(Ud,S) = minsi∈S
(max(max(Pd − Psi , 0),max(Wsi −Wd, 0))) (4.8)
The goal is to transform S to S′ such that for any arbitrary Ud, I(Ud,S′)− I(Ud,S) < p | |S′| ≤ k,
where p is the acceptable loss in precision.
We use a greedy method, which we call neighborhood-based elimination, to eliminate a
subset of configurations. We represent the configurations in S as a graph G(V, L), where V
represents the set of vertices and L represents the set of links. The construction of G follows
these properties.
• Each configuration si ∈ S is represented by a vertex vi ∈ V .
• Vertices representing two configurations si and sj in S are connected by an edge in G
if max(|Psi − Psj |, |Wsi − Wsj |) < p. In this case, (vi, vj) ∈ L. This is referred to
as the closeness property. For our experiments, we chose p to be 0.05. The condition
guarantees that utilizing si to satisfy any arbitrary Ud instead of sj guarantees that
I(Ud,S′) − I(Ud,S) < p.
For each vertex in G, a neighbor set is constructed. The neighbor set for a given vertex is
defined as the set of vertices that have a link to it. The construction of all neighbor sets can
be collectively represented as ∀vi, vj∈V , vj ∈ N(vi) ⇔ (vi, vj) ∈ L, which incurs O(n2) time
if |S| = n. The pruning problem translates to selecting a set of k vertices (denoted by the
set V ′) from V such that V ′ includes at least one vertex from each neighbor set. This follows
from the argument that utilization of a configuration corresponding to any one vertex in the
neighbor list for a vertex vi will satisfy any arbitrary user demand that is originally satisfied by
the configuration corresponding to vi without increasing E by more than p. Since the neighbor
75
sets exhibit symmetric property i.e., vj ∈ N(vi) ⇔ vi ∈ N(vj), the pruning problem can also
be translated to selection of V ′ vertices from V such that |V ′| ≤ k and ∪vi∈V ′N(vi) = V .
In addition to the neighbor sets, the neighborhood-based pruning algorithm also maintains
a satisfaction set (T ) which contains the configurations for whom the closeness property is
satisfied due to configurations in S′. Both T and S′ are initialized to a null value.
The pruning algorithm executes iteratively to select configurations into S′. In each iteration,
the following actions are performed in order.
1. A configuration whose representative vertex in G has the highest cardinality of neighbor
set is selected. Let this configuration be represented by sm and the corresponding vertex
in G be represented by vm.
2. sm is included into S
′ i.e., S′ = {sm} ∪ S′.
3. All the vertices in N(vm) are included into T , i.e., ∀vi ∈ N(vm), T = {vi} ∪ T . If the
vertex being included already exists in T , it is not added again.
4. All neighbor sets are updated to remove vertices added to T in the present iteration.
Each algorithmic iteration requires O(n2) time. The algorithm terminates when either of the
following situations arise.
1. The cardinality of S′ reaches the target, i.e., |S′| = k.
2. The closeness condition is satisfied for all, i.e., |T | = |S|.
The worst case runtime for the algorithm is O(kn2). In case the algorithm terminates due to
condition 1 above, but |T | 6= |S|, it implies that the algorithm failed to prune S′ to size k while
achieving the set precision level. In such a situation, p is increased by a step and the whole
process repeats.
For the set of benchmarks studied, the pruning algorithm was always able to achieve |S′| <
k. The average and the per-benchmark smallest and largest number of configurations retained
after neighborhood based selection for CSSR are 11, 9, and 14 respectively.
76
The overall runtime for the algorithm is O(kn2), when k configurations are needed to be
retained. As n is already restricted to a low value by ELIC, this does not pose a big problem.
However, an additional optimization can be performed to speed up the algorithm, if deemed
necessary. Since it is observed that configurations separated by larger differences in their indices
are generally not connected, selecting them in parallel can lead to faster population of T . To
achieve this, the list of configurations can be hashed into different buckets using modulo hashing
scheme, and the pruning algorithm can then proceed to select configuration buckets into S′ in
the above mentioned fashion rather than individually. However, we haven’t implemented this
strategy since |S| − |S′| is small after ELIC step.
4.5 Evaluation of the different CSSR pruning methods
4.5.1 Final configuration space
Figure 4.6 shows |S′| following the three CSSR pruning methods. In the figure, x-axis shows
the different benchmarks used for evaluation and y-axis shows |S′|. Merit based selection for
CSSR always selects the target k configurations based upon merit. Bound based selection for
CSSR selects one configuration per interval bound. However, it is observed that |S′| is slightly
less than k. This happens because configurations with high values of Pnorm are sometimes
selected for intermediate interval bounds which leads to unavailability of additional configura-
tions to choose for the final few bounds. As expected, neighborhood based selection for CSSR
aggressively prunes the configuration space and leads to small configuration spaces.
4.5.2 User demand tracking
A first order analysis is performed to quantify how the different methods for CSSR fare in
satisfying user demands. Since user demands are variable, demands with different degree of
performance and power requirements are considered. The different demand scenarios considered
are shown in Table 4.5. For each scenario, a set of 10,000 user demands are synthetically
generated. For each user demand, a single configuration from the retained configuration space
that is expected to satisfy it is chosen. The inaccuracy in demand satisfaction is noted. The
77
0
2
4
6
8
10
12
14
16
18
Fi
n
al
 c
o
n
fi
gu
ra
ti
o
n
 s
p
ac
e 
si
ze
Benchmark
merit based bound based neighborhood based
Figure 4.6 Final adaptive microarchitectural configuration space size
Scenario Pd and Wd
High performance (HP ) Pd > 0.7,Wd = 1
Low power (LW ) Wd > 0.3,Wd < 0.5, Pd = 0.3
Balanced demands (BDδ) Pd < 0.7, Pd > 0.5, Pd −Wd = δ
Stringent demands (SDδ) Pd > 0.7, Pd −Wd = δ
Table 4.5 Different performance-power demand scenarios
inaccuracies in demand satisfaction for the different demands generated for a demand scenario
are averaged to smooth out the effects of individual outliers.
Figure 4.7 shows the inaccuracies reported for the high performance demand scenarios.
The x-axis in the figure denotes the different benchmarks and the y-axis reports the inaccuracy
values as percentages (PI ). Both bound based selection and neighborhood based selection
track different user demands efficiently. The average PI reported for these selection methods
is less than 1%. Merit based selection leads to an average of 9% PI. The increase in power
consumption to boost performance at the higher end of the performance spectrum is typically
higher than the corresponding increase at the low end of performance spectrum. As such,
meritorious configurations are concentrated at the lower end of the performance spectrum.
Many configurations at the higher end of performance spectrum are pruned by merit based
selection which leads to unavailability of proper configurations to satisfy user demands in this
78
1.2
0
5
10
15
20
25
30
P
I (
%
)
Benchmark
Bound based Merit based Neighborhood based
Figure 4.7 PI in tracking high performance demands
Bound based Merit based Neighborhood based
0
0.2
0.4
0.6
0.8
1
1.2
P
I (
%
)
Benchmark
Bound based Merit based Neighborhood based
Figure 4.8 PI in tracking low power demands
scenario.
Figure 4.8 shows the inaccuracies reported for low power scenarios. All the three CSSR
pruning methods perform well in tracking user demands in this scenario. The slightly high PI in
the case of astar, gobmk, and mcf can be attributed to the limitations of the considered adaptive
controls. The least values of normalized power consumption possible for these benchmarks when
using the configuration space retained after ELIC step is 0.34, 0.35, and 0.33 respectively.
Figure 4.9 shows the inaccuracies reported for balanced demands when (a) δ = 0.1, and (b)
δ = 0.2. As expected, the PI values increase when δ is increased from 0.1 to 0.2. Once again,
all the three pruning methods efficiently track the user demands. The increase in PI between
the cases of δ = 0.1 and δ = 0.2 is low for merit based selection compared to the other pruning
methods. This shows how the merit metric used leads to selection of configurations providing
higher performance while consuming lower power. The average PI reported is less than 1%
when δ = 0.1 and is around 4% when δ = 0.2. These lower values are an indication that the
power needed to deliver intermediate performance levels is low.
79
Figure 4.9 PI in tracking balanced demands
Figure 4.10 shows the inaccuracies reported for stringent demands when (a) δ = 0.1, (b)
δ = 0.2, and (c) δ = 0.3. It is observed that utilizing the configurations retained by bound based
selection results in the lowest PI. The corresponding PI values are closely followed by those
obtained for neighborhood based selection. This shows the effectiveness of neighborhood based
selection in simultaneously reducing PI s and configuration space size. Merit based selection
results in large PI values due to reasons explained earlier for the case of high performance
demand scenario. Both bound based selection and neighborhood based selection lead to PI s
which are less than 8% on an average when δ = 0.3. Note that further reduction in PI is
possible if intra-application adaptation is considered.
4.6 Conclusion
In this chapter, a methodology to prune adaptive microarchitectural configuration space is
presented. Using a x86 processor, it is first demonstrated that only a small set of adaptive
hardware components is sufficient to achieve effective performance-power tradeoff. Next, it is
shown that among the chosen adaptive components with different levels of adaptivity, only
80
Figure 4.10 PI in tracking stringent demands
81
a small number of combinations (configurations) of them are meritorious to deliver the most
effective performance-power tradeoff. A set of algorithms are further designed to reduce the
number of configurations to a specified size to keep the run time complexity of utilizing them
low. Finally we make an observation that a small ( 16) pruned set configurations is effective
in satisfying varied user demands. The most effective pruning technique also depends on the
user needs of high performance, or low power or a balance between them.
82
CHAPTER 5. DEGRADATION OF PERFORMANCE-POWER
TRADEOFF UNDER PERMANENT FAULTS
The adapted microarchitectural components and the glue logic providing the adaptivity
are susceptible to permanent faults, like any other component on the microprocessor floorplan.
A fault prohibits usage of a subset of adaptations originally provided. In Chapter 4, the
configuration space has been significantly reduced to retain only the most useful configurations.
Since all configurations retained are deemed important, it is necessary to evaluate how the
tradeoff is affected when one or a few of the considered adaptations fail. This study also provides
insight into how important each of the considered adaptive components and its associated
adaptations are to effectively tradeoff performance and power. Our observations indicate that
the change in system behavior in terms of performance delivered and power consumed due to
occurrence of a fault typically stays below 10% of the required/desired levels when serving a
large set of demands. When required performance cannot be provisioned, we observe that a
significant power saving can be achieved as well. Additionally, we also narrow down to the
adaptive controls that can be deliberately unused to conserve 5-7% additional power while
sacrificing minimal performance when a fault occurs.
5.1 Introduction
The implementation of an adaptive hardware component requires extra logic circuitry to
enable/disable the use of the various adaptive levels associated with it. This logic circuitry
and the other building blocks in these adaptive components are both subject to permanent
failure. The reason for the failure can be manufacturing defects, early life failures, or wear-outs
([22, 3, 11]). Wear-out failures can be further caused by one or a few of the following phenomena:
83
electromigration, stress migration, dielectric breakdown, etc. We analyze the system behavior
when a permanent fault manifests in the hardware associated with the adaptive components.
Since we select only the most effective hardware configurations for implementation, the presence
of all these configurations may be required for effective performance-power tradeoff. This makes
the failure analysis very important.
In the following analysis, it is assumed that the presence of the aforementioned faults
is implicitly detectable. Analysis of fault detection methods is not within the scope of this
research. It is assumed that faults are of permanent nature and are located and marked
through existing fault detection mechanisms (ex. [9], [67], [10]). The reconfiguration is carried
out so as to not utilize the faulty modules. The adaptation scheme then is only managing
available components.
Integrity checking and fault detection cum tolerance techniques pertaining to cache memory
already exist in commercial microprocessors [79]. Modern processors like the ARM cortex
series of processors use a 64-bit ECC [62] to protect the instruction cache. To reduce the
associated latency and power consumption, IBM Power 6/7 [83], AMD Opteron [59], SPARC64
[89], etc. protect the L1 instruction cache using parity. When the parity check detects an
error, the instruction can be refetched. The non-stop architecture proposed in [8] provides a
sniffer/scrubber that tests all memory locations for errors periodically. Errors are corrected
if possible and written back. A reread followed by an integrity check can be used to decide
whether the fault is transient or permanent. The memory locations containing permanent
faults are taken out of service. Given the above, the only fault considered in a cache block is
when the entire tag check logic fails for a single way. Thus, one way becomes inoperable. In
such a case, cache will reduce from 8-way to 4-way associative in the considered fault scenarios.
The considered faulty scenarios are detailed next.
5.2 Fault model
Analysis is performed to study the effectiveness of utilizing the pruned configuration space
in catering to varied user demands when a single permanent fault occurs. The following fault
scenarios are individually studied. Combinations of these faulty scenarios, though possible,
84
are not investigated in detail since such a fault manifestation probability becomes low. It will
be noticed that when one component fails, to achieve proper performance-power balance, a
possible adaptation corresponding to another adaptive component is not utilized any way. So
effectively some scenarios of multiple faults in different control knobs are already implicitly
covered. Further, absence of particular adaptive configurations may require the consideration
of some configurations that were eliminated by the ELIC pruning step. Usage of such ineffective
configurations may lead to lower PI in case of faults. Hence, the entire configuration space
retained after the SACK step is considered for this analysis.
Fault scenarios
1. Adaptive dispatch port failure: This scenario encompasses the failures associated with the
functioning of a single dispatch port. The instructions scheduled for execution reside in
FIFO buffers waiting to be issued for execution by the dispatch logic. Faults in the buffer
can affect the functioning of the dispatch port. Faults in individual entries of the buffer
can be handled using mechanisms discussed in [9]. The adaptive logic to enable/disable a
dispatch port can encounter a failure at which point the dispatch port becomes inoperable.
We study the system behavior under such failures. Since the allowed configurations for
the dispatch width are 1, 2, and 4, failure of a dispatch port prohibits the use of a dispatch
width of 4. Note that we assume that all the dispatch ports are homogeneous and all the
ports can issue any arbitrary allowed instruction.
2. Adaptive cache way failure: Faults in individual memory locations can be dealt with
parity, error correcting codes or other related techniques. Examples for these are provided
later on. However, the tag comparator or the way enable logic for a single cache way can
also fail. This prohibits the use of an associativity setting of 8. The allowed cache
associativity levels when a cache way fails are 2 and 4.
3. Instruction window chunk failure: The instruction window can be designed as multiple
chunks each of which can be enabled or disabled independently [13]. To provide for the
considered adaptive configurations, the instruction window can be designed as 4 chunks,
85
each containing 32 entries each. When a single chunk fails, the deployable configurations
are 32 and 64 entries. We do not consider the case of 96 entries, although possible under
a single fault scenario, to keep all the adaptation sizes as power of 2. We consider faults
that impede the operation of an entire instruction window chunk. Failures in individual
entries can be masked using techniques proposed in previous research [9].
4. VF setting failure: Dynamic voltage and frequency scaling (DVFS ) control is provided in
real-world microprocessors as a 2-step process. Once the DVFS controller determines the
correct VF setting to use, a control signal is sent to an on-board oscillator which adapts
the clock frequency accordingly. Voltage scaling can be implemented in multiple ways [19].
The chip can be fed with multiple supply voltages and an individual voltage rail can be
provided to carry each of these levels. A set of pull up transistors tap these voltage rails to
provide supply voltage for the hardware components. The DVFS controller provides the
necessary gate control signals for these transistors. A failure can occur either in the power
rail (open circuit), the pull up transistor, or the input pin on the chip to which a voltage
supply is connected. If the number of provisioned voltages is large, it becomes impractical
to provide a large number of supply voltages (and power rails) to the chip. Alternatively,
a DC-DC voltage converter can be provisioned on-board to generate the required voltage
levels using a single input supply voltage. Requirement of large inductors and capacitors
in this regard complicates the design. Since the chosen number of VF settings is small,
we consider the former design practice. A single failure prohibits the adaptation of a
single voltage and frequency level. The other two voltages and frequency levels can still
be used.
Table 5.1 summarizes the different fault scenarios investigated and the available adaptation
levels for the faulty components in the presence of faults.
Potential fault detection schemes Current processors come with a set of counters and
mechanisms to measure and store the on-chip voltage and frequency values. This infrastructure
can be used to check the occurrence of desired DVFS transitions. Any discrepancy will indicate
a failure of a certain voltage or frequency setting.
86
No. Fault scenario Retained adaptations
1 Dispatch port failure {1, 2}
2 Cache way failure {2, 4}
3 Instruction window chunk failure {32, 64}
4 Lowest VF setting failure {(1.484 V, 1.6 GHz), (1.228 V, 1.2 GHz)}
5 Intermediate VF setting failure {(1.484 V, 1.6 GHz), (1.036 V, 0.8 GHz)}
6 Highest VF setting failure {(1.228 V, 1.2 GHz), (1.036 V, 0.8 GHz)}
Table 5.1 Investigated fault scenarios
Dispatch port failures are also easy to detect. The function of a dispatch port is to issue
one or more selected instructions to available functional units. Further, the type of instruction
dispatched will also determine which execution unit should process the instruction. As part
of dispatch, proper inputs corresponding to the dispatched instruction are forwarded to the
input wires of the functional unit. Comparing the actual inputs being fed to the functional
unit against the expected inputs can determine if a dispatch occurred properly.
Additional cache tag comparators can be provided in the hardware and a triple modular
redundancy based approach can be used for detection of cache tag comparator failure. Since
the number of cache ways is usually small, we do not expect this to add significant hardware
overhead.
For detecting failures in instruction window chunks, individual entries need to be monitored
both before and after each update to ensure that the update happens as desired. Additional
auxiliary registers can be provided to record the contents of individual entries before updates
happen. The update can be mirrored in the additional registers using additional update logic.
The contents of the auxiliary registers can be compared with corresponding instruction window
entries after the update. Although this mechanism works for detecting faults, the associated
hardware overhead will be significant. Additional studies need to be carried out to investigate
alternate fault detection mechanisms.
87
5.3 Evaluation of tradeoff degradation
To quantify the system behavior under the presence of faults, the average performance de-
livered (Pavg), power consumed (Wavg), and the inaccuracy in satisfying the performance (Pin)
and power constraints (Win) produced while catering to various user demands are analyzed.
Pavg and Wavg are measured as fractions normalized to the maximal possible values (obtain-
able using Smax). The values for these metrics resulting from choosing an optimal configuration
from the configuration space available with and without the presence of a fault are obtained
and compared. The set of SPEC 2006 benchmarks considered for evaluating the different CSSR
in Chapter 4 is used as workload for analyzing the system behavior in the presence of faults.
The Pavg, Wavg, Pin, and Win reported are averaged over the corresponding values obtained
in the case of these benchmarks.
Similar to the evaluation procedure in Chapter 4, user demands are synthetically generated
to represent different demand scenarios. For each scenario, 10,000 user demands are generated.
Each demand contains a primary as well as a secondary constraint. A single configuration that
is deemed best to serve each user demand is chosen and the corresponding inaccuracy values
are measured. For a demand, the chosen configuration satisfies the following properties.
1. The chosen configuration satisfies the primary constraint. If no available configuration
satisfies the primary constraint, the configuration leading to the lowest PI in satisfying
it is selected.
2. If primary constraint is satisfied by multiple configurations, the chosen configuration
performs the best with regards to the secondary constraint.
For all expect low power demands, performance is considered the primary constraint. The
tradeoff degradation for these 10,000 demands is averaged to eliminate individual outliers.
5.3.1 Dispatch port failure
Figure 5.1 ((a) and (b)) shows how the system behavior changes/degrades when one of the
dispatch ports or the associated adaptive logic fails. In the figure, the x-axis denotes the user
88
Figure 5.1 Tradeoff degradation when one dispatch port fails
demand and fault scenarios as mode scene where mode represents the user demand scenario and
scene tells whether a fault exists f or not all. The y-axis denotes the (a) demand satisfaction
inaccuracies (Pin and Win), and (b) the delivered average performance (Pavg) and consumed
power (Wavg).
Observations.
1. An 4% decrease in performance is observed when serving high performance demands.
Although the microprocessor is limited to dispatching a maximum of two instructions
per cycle, utilization of high voltage and frequency guarantees that the performance
stays reasonably high. However, provision of the demanded high performance requires
the utilization of more dispatch ports. It is also noticed that the power consumption stays
fairly similar with or without the fault. This happens since power inefficient configurations
are selected to meet the performance demand when a fault occurs. Hence, more power
consumption is observed while providing lesser performance than demanded.
2. The performance loss observed when serving the low power demands is just ∼2%. Higher
dispatch widths are usually used only for catering to stricter performance requirements,
and the absence of the associated configurations does not affect low power user demands.
For the few cases when a higher dispatch width paired with minimal configurations for
the other adaptive units can be used to serve low power demands, absence of such con-
89
figurations necessitated the use of other ineffective configurations to satisfy the primary
constraint. This resulted in the observed performance loss.
3. When serving balanced user demands, the power consumption is traded off slightly to
satisfy the performance constraint. It is observed that the increase in power consumption
(compared to the all scenario) is ∼3%.
4. The power constraints in the stringent demands are satisfied inaccurately. However, such
an inaccuracy is also observed for the all scenario as well. The difference between the
Win values for the all and f scenarios is negligible. The average increase in Pin due to
failure is capped at 3.5%.
5. Overall, loss of the wider dispatch width adaptation can be reasonably compensated for
by choosing alternate configurations except when performance demands are at the peak
level. We found no one-to-one correlation between the faulty configurations (otherwise
employed in all scenario) and the alternate configurations chosen in place of these.
5.3.2 Cache way failure
Figure 5.2 shows how the tradeoff changes/degrades when one of the cache ways fails. The
axes in this figure as well as the figures following this in the next few subsections investigating
individual failure scenarios follow the notations as in Figure 5.1. Note that failure of a cache
way still theoretically retains 7 other possible settings (1-7). Since we have restricted our
adaptations to just 2, 4, and 8 ways (powers of 2), we proceed to choose only configurations
that utilize 2 or 4 cache ways in case of a cache way failure. If the resultant tradeoff is found
to be significantly suboptimal to the case when no fault occurs, we can later opt to consider
additional cache adaptations in the future.
Observations.
1. We observe that the tradeoff under the presence of a fault is strikingly similar to the
behavior when all adaptive controls are usable. In particular, the primary constraints are
always satisfied as per the requirement.
90
Figure 5.2 Tradeoff degradation when one cache way fails
2. The observed differences between the delivered performance and consumed power be-
tween the fully adaptive and the faulty scenarios are always less than 1%. However, the
limitations with the accuracy of the simulation framework used prohibit us from accu-
rately commenting on the fine differences between the observed performance and power
between the two scenarios.
3. Based on these observations, we claim that investigation of additional levels of adaptivity
for the instruction cache may not be required. The other adaptive controls included into
the configuration space are enough to compensate for the loss of the maximum adaptive
level for the instruction cache.
5.3.3 Instruction window chunk failure
Figure 5.3 shows how the tradeoff changes/degrades when one of the instruction window
chunks fails. Since adaptation of instruction window has the lowest tradeoff potential, we
expect the system to degrade only marginally when a fault occurs.
Observations.
1. A slight inaccuracy (∼1.5%) manifests in serving the performance constraint in the high
performance demands. A similar case also holds for stringent user demands. The max-
imum possible instruction window size becomes the performance bottleneck. To deliver
91
Figure 5.3 Tradeoff degradation when an instruction window chunk fails
the required performance, a number of aggressive cum ineffective configurations which are
otherwise eliminated by the ELIC pruning step are now utilized. This results in a slight
increase in the power consumption when compared to the all scenario for the stringent
user demands.
2. Low power demands are still satisfied well. Lower instruction window sizes, which are
typically used to serve the associated demands, are still available for selection.
3. When serving balanced user demands, power is traded off slightly to satisfy the perfor-
mance as per the requirement. It is observed that the increase in power consumption
(compared to the all scenario) is ∼2% when δ = 0.2.
4. In general, loss of the highest IW setting leads to a slight loss of performance as well as
a slight increase in power consumption when serving high performance requirements.
5. From these observations, we can claim that consideration of the currently unused but
theoretically possible instruction window adaptation size (96 entries) cannot guarantee
significantly better system behavior.
5.3.4 Voltage and frequency control failure
Since the different VF controls can independently fail, we analyze the system behavior
under the possible faulty scenarios separately.
92
Figure 5.4 Tradeoff degradation when the lowest VF setting fails
5.3.4.1 Failure of the lowest VF setting
Figure 5.4 shows how the tradeoff changes/degrades when the lowest VF setting become
unusable. We make the following observations.
Observations.
1. Since the lowest VF setting is almost never used to serve high performance user demands,
the system behavior while serving these demands remains unchanged under this failure.
2. The observed inaccuracy in serving the power constraint in the low power demands is just
3%. Choosing an intermediate VF setting while lowering the dispatch width, instruction
cache associativity, and instruction window size resulted in maintaining a similar power
profile (as that produced in all scenario). A 2% loss of performance is also observed
in this case, since other ineffective configurations are now utilized to satisfy low power
demands.
3. The tradeoff behavior while satisfying balanced and stringent user demands remains rea-
sonably unchanged (∼1% deviation).
93
Figure 5.5 Tradeoff degradation when the intermediate VF setting fails
5.3.4.2 Failure of the intermediate VF setting
The changes in tradeoff behavior noted when the intermediate setting for VF control fails
is shown in Figure 5.5. We make the following observations under this fault scenario.
Observations.
1. The highest VF setting is almost always required to satisfy high performance demands
and the intermediate VF control is sparsely used under such circumstances. Since the
highest VF control is still active, the PI when serving such demands remains unchanged.
Due to the discrete nature of the provided performance and power values, an increase is
both Pavg (1%) and Wavg (3%) is observed when the fault manifests.
2. The effects of trading off the secondary constraint to satisfy the primary constraint are
quiet noticeable when the intermediate VF setting fails. The primary constraint is always
satisfied while the degradation in system behavior with respect to the secondary constraint
varies between 1% and 12% when serving balanced and stringent user demands. This
effect is more pronounced for the case of balanced user demands.
3. Since the obtainable performance and power characteristics are discrete in nature, the
exact satisfaction of the primary constraint is not always possible. This leads to the selec-
tion of configurations that provide for the primary constraint well above the requirement.
94
Figure 5.6 Tradeoff degradation when the highest VF setting fails
Although this can happen under any failure scenario, we observed that these effects are
pronounced when the intermediate VF setting fails. For balanced user demands, we ob-
served that there is 3% increase in performance even when compared to the all scenario.
This also becomes a contributing factor for the aggressive tradeoff of power consumption.
5.3.4.3 Failure of the highest VF setting
The change/degradation in tradeoff noted when the highest setting for VF control fails is
shown in Figure 5.6. We make the following observations under this fault scenario.
Observations.
1. A 7.5% shortfall in performance manifests in the satisfaction of high performance and
stringent user demands. The utilization of an intermediate VF setting coupled with an
aggressive configuration of the other adapted components is insufficient to provide for the
high performance requirements. However, the system behavior degrades gracefully and
the inaccuracy in demand satisfaction is not very large. An average of 10% decrease in
power consumption is observed as well, when performance constraint cannot be satisfied.
2. The tradeoff remains strikingly similar to the scenario where all adaptive controls are
active when serving low power and balanced user demands.
95
5.3.5 Power saving with reduced performance requirements
While investigating single component failures, we have encountered two different effects to
the performance-power balance when demands containing intermediate to higher performance
constraints are posed. When the performance requirement can be satisfied using alternate
ineffective configurations (when the ideal configuration fails), an associated increase in power
consumption is noticed. When such a possibility does not exist, a lower performance is deliv-
ered. In this case, a reduction in power consumption is noticed as well. We next investigate
the sensitivity of each adaptive component’s failure towards provisioning different degrees of
performance and note the power consumption. The results presented in this regard also provide
insights to the user about the achievable power reduction when performance demands are mel-
lowed down. The observed performance and power consumption values reported are averaged
over the set of benchmarks that are being investigated so far.
Table 5.2 shows the Pavg and Wavg values when the performance demanded is set to 90%,
80%, and 70% separately. In the table, none refers to the scenario where all adaptive compo-
nents are active.
Performance requirement
90% 80% 70%
Failure Pavg Wavg Pavg Wavg Pavg Wavg
None 94.2 79.0 84.3 68.8 72.9 52.3
Dispatch port 86.7 76.2 83.8 70.2 74.8 59.2
Cache way 93.6 78.3 84.7 64.0 73.1 53.3
Instruction window 90.8 80.6 84.0 73.2 73.6 58.4
Low VF 94.7 79.7 84.3 68.8 72.9 52.5
Intermediate VF 94.3 79.0 85.8 71.4 75.7 64.5
High VF 79.3 63.4 78.7 61.8 73.7 53.7
Table 5.2 Performance and power characteristics obtained for different performance demands
Observations.
1. We observe that the performance requirement of 90% is deliverable unless a dispatch port
or the highest VF setting fails.
96
2. The maximum deliverable performance when a dispatch port fails is 86.7%. Similarly,
the maximum deliverable performance when the highest VF setting fails is 79.3%. When
a 90% performance demand is encountered in these situations, the observed power saving
is 23.8% and 36.6% respectively.
3. 80% and 70% performance demands are always closely satisfied under faulty scenarios.
4. Since the configuration space is discrete, there arise situations when performance is over-
provisioned just to stay above the required level. It is noticed that this over-provisioning
never exceeds 5.8% for the set performance constraints.
5. When 90% performance is demanded, the power saving obtained under the fault condi-
tions where the performance demand can be satisfied is at least 19.5%.
6. When 80% performance is demanded, the power saving obtained is around 30%. The
power saving obtained while satisfying the performance constraint is nearly equal in the
cases of all single component failures, except when the highest VF setting fails. When
the highest VF setting fails, the reduction in power seen is higher (∼39%). The other
fault scenarios overcompensate for performance and lose some achievable power gain.
7. When 70% performance is demanded, the power saving obtained is 35.5-47.5%.
5.3.6 Avoiding available adaptations for increased power saving
The microprocessor design process is inclusive of the analysis of synergy between the max-
imal configurations provided for the different hardware components. The failure of the most
aggressive configuration for an adaptive component can result in disruption of this synergy for
the available configuration space. Such situations can be analyzed and exploited to save more
power. Consider two adaptive components i and j. Let the maximum performance obtainable
(and power consumed) when the most aggressive configuration for i fails is P ′ (W ′). Similarly,
let the maximum performance obtainable (and power consumed) when the most aggressive
configuration for j is additionally disabled/not considered for adaptation is P ′′ (W ′′). If P ′′ is
close to P ′ but W ′′ W ′, we can choose to not consider the most aggressive configuration for
97
IW+IC VF+DW IW+DW VF+IW
Faulty+disabled components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P
 (
m
ax
),
 W
Faulty+Disabled components
P_max
W
Figure 5.7 Deliverable peak performance and the associated power consumption utilizing a
subset of available configuration space
j as well. We studied the properties of the available configuration spaces under the presence of
the different considered faults to see if such opportunities exist. The analysis performed also
provides some insights into the system behavior under multiple faults.
Figure 5.7 shows the maximum performance obtainable and the associated power consump-
tion when utilizing a subset of configuration space. The x-axis in the figure shows the list of
components whose maximal configuration either failed or are not considered. The Y-axis shows
the maximum performance obtainable using the available and considered configuration space,
and the associated power consumed when delivering that performance.
Observations.
1. When the maximal setting for an adaptive control other than instruction cache fails,
it is observed that further ignoring the instruction cache set associativity setting of 8
results in significant power saving, while limiting the performance sacrifice. The obtained
power savings when the highest setting for VF, IW, or DW fails are 5%, 7%, and 7%
respectively, while the performance losses are limited to 1%, 2.5%, and 1% respectively.
The increase in execution time due to bottlenecks in the failed controls overshadows the
additional execution time due to the extra cache misses caused.
98
2. When a failure occurs in an instruction window chunk, instruction cache way, or dispatch
port, further disabling the highest VF setting provides a 25-30% power gain. However,
a significant performance loss (about 20%) is also observed in these cases.
3. Other possibilities of disabling the aggressive setting for an adaptive component when a
failure occurs in any other component are not beneficial as well.
5.4 Conclusion
In this chapter, we analyzed the effectiveness of a pruned adaptive microarchitectural config-
uration space in effectively trading off performance and power when a permanent fault occurs.
The presence of a fault in an adaptive component necessarily prohibits the use of a subset of
the originally provided configurations. For the considered adaptive controls except (VF ), the
permanent fault prohibits the use of the associated maximal configuration. For the VF control,
the fault prohibits the use of a single VF combination. We observed that the change in the
system behavior in terms of performance delivered and power consumed due to occurrence of
a fault typically stays below 10% when serving a large set of varying demands. At least 90%
of the original performance can still be delivered when a fault occurs in an instruction window
chunk, instruction cache way, or the lower or intermediate VF setting employed. In other faulty
situations, the obtainable maximum performance stays above 80%.
When an adaptive control setting that leads to system operation in a particular region in
the performance-power spectrum fails, alternate configurations are chosen to satisfy demands
associated with this region of the spectrum. The satisfaction of secondary constraint in the
user demand is traded off more aggressively to satisfy the primary constraint when compared to
the scenario where no fault occurs. In particular, this effect is more pronounced (∼10%) when
a fault renders the intermediate VF setting useless. We have observed that an inaccuracy
in satisfying the primary constraint rarely manifests. Under some fault scenarios, extreme
high performance demands cannot be satisfied accurately. In such cases, a drop in power
consumption is observed as well. We note that at least 80% of the maximal performance can
still be delivered when a fault occurs, while saving at least 30% power.
99
Finally, we noticed that the use of high associativity (8) in the instruction cache doesn’t
produce any noticeable performance benefit when a fault occurs that affects the most aggres-
sive setting in any other adaptive control. The performance advantage provided by the high
associativity is overshadowed by the performance loss caused by the failed adaptive control.
Hence, the adaptations inclusive of this associativity setting for cache need not be considered
once a fault is detected. This leads to an additional 5-7% power saving.
From our observations, we conclude that the considered adaptive configuration space pro-
duces system behavior which is generally resilient to a single permanent fault. Failure of con-
figurations corresponding to a single control setting leaves out alternate configurations (which
are still operable) that guarantee similar system behavior.
100
CHAPTER 6. APPLICATION AWARE PERFORMANCE-POWER
TRADEOFF
This chapter covers the details of the different adaptation strategies implemented for mi-
croprocessor performance-power tradeoff. An alternate classification of previously proposed
adaptation strategies is introduced. The advantages and disadvantages associated with vari-
ous adaptation strategies is presented. The details of our comprehensive static cum dynamic,
as well as dynamic only adaptation strategies follow this discussion. A detailed evaluation
is carried out to analyze the effectiveness of the developed adaptation strategies in utilizing
the pruned configuration space to provide the required performance-power tradeoff in different
regions of the possible performance-power spectrum. In most cases, it is observed that these
demands on performance and power are satisfied with up to 90% accuracy. For a set power
level, the obtained performance leveraging on our pruned configuration space is found to be 7%
lower compared to a state-of-the-art scheme using 10 times the configuration space. It is also
observed that the use of the developed dynamic adaptation strategies leads to energy efficient
execution. The observed energy efficiency is close to 95% of the ideal efficiency obtained with
a comprehensive oracular adaptation scheme.
6.1 Introduction
It is a well-known fact that hardware-software interactions vary during application exe-
cution. For example, the exhibited parallelism depends on the instruction stream, and as a
consequence, wide variation in ALU or cache usage exist. The application execution profile is
generally demarcated by phases/intervals with different execution characteristics. It is impor-
tant to exploit these changes in application behavior and find the best configuration for each
101
phase. This serves two purposes. First, it results in better tradeoff decisions since each chosen
configuration will be tailor fit to the associated program phase. Second, due to the discrete
nature of performance-power points provided by the configuration space, a single configuration
may not be sufficient to exactly satisfy different demands. However, a composition of a few
configurations works well.
Microarchitectural adaptation strategies can further be classified into static [44, 65, 29]
and dynamic [61, 31, 2] strategies based upon when the adaptation decisions are made. Static
strategies assume the knowledge of the entire application execution characteristics and stati-
cally find a configuration to use per each application phase. Since decisions are static, they can
be comprehensive in nature since the overhead does not fall within execution time. Also, the
knowledge of future application phases makes the adaptation decisions optimal. Such strategies
have two limitations. First, the assumed knowledge of the execution profile may not be prac-
tical for general purpose computing platforms. Second, the assumed knowledge may become
invalid due to runtime variations. Such strategies are generally suited for real-time systems
and HPC platforms where execution time is quite predictable. A dynamic strategy monitors
specific execution characteristics at runtime and deploys suitable configurations accordingly.
Since the adaptation analysis is performed at runtime, it is generally restrictive in nature to
avoid impractical overhead. Also, the analysis is limited to peephole of instructions and the
adaptation decisions are suboptimal. In Section 6.2, the details of our static cum dynamic
strategy for performance-power tradeoff are presented. The application execution profile is
divided into phases and configurations are chosen for deployment during the individual phases
statically. A simplistic runtime manager adjusts these static decisions to meet runtime re-
quirements on performance and power consumption. The static component ensures tradeoff
optimality and the dynamic component adapts the adaptation decisions to observed runtime
variations. Two lightweight dynamic only adaptation strategies are also developed that lever-
age on the small size of the configuration space and make quick adaptation decisions. The
details of these strategies are presented in Section 6.3.
102
6.2 Two stage static cum dynamic adaptation strategy
6.2.1 Application phase demarcation
Consider an application whose execution profile contains I instructions. A phase generator
(PG) divides this into M intervals, a process that will be referred to as Phase generation
(PGen), where phase i contains Ii instructions. The idea is to initiate a microarchitectural
adaptation once after each phase finishes execution. The process can be represented as <
I1, I2, ..., IM >= PGen(I, HSIP ) | {Ii 6= 0,
∑
i Ii = I}. In this representation, HSIP
denotes the hardware-software interaction patterns. A HSIP is a 2-tuple < IPS, Wa >, where
IPS denotes the average instructions committed per nanosecond, and Wa denotes the average
power consumption, in Watts. HSIPs are collected at the granularity of 1 million instructions
each to avoid impractical profiling overhead.
The value of (M ) directly affects the tradeoff optimality. A large M implies a large overall
runtime overhead, while providing fine-grained adaptations. On the other hand, a small M
restricts adaptation control, as the number of adaptation opportunities becomes limited. Our
strategy is to necessitate reconfiguration when either 1) the HSIP changes significantly, or
2) any application phase gets very long. Since HSIP are collected at the granularity of 1
million instructions, it also constitutes the interval length, the minimum length of instructions
separating two adaptation instances. Incidentally, it has been noticed later that this guarantees
that the runtime adaptation overhead is always < 1%, which is explained later.
The phase generator divides the application execution profile into intervals of 1 million
instructions each, and marks phase changes at the end of selected intervals. The achievable
IPS spectrum utilizing the allowed architectural configurations is first divided into 20 equal
sized intervals. This implies that adjacent intervals are centered on IPS values which are
separated by 5% of the maximum IPS. A similar division is made for the Wa spectrum as
well.
A set of 400 buckets are generated from these windows, each of which covers a specific range
of IPS and Wa values. The application execution with S max leads to IPS and Wa contained
by a single bucket (FindBucket function in Algorithm 3). A phase change is demarcated when
103
the associated buckets for two adjacent intervals are different. A phase change demarcation
also occurs when the current phase length becomes larger than 10 million instructions (Re-
condChange function in Algorithm 3). The latter demarcation strategy is especially useful in
cases where HSIP remain invariant. In such cases, employment of a single configuration for the
entire execution may lead to poor demand satisfaction due to the discrete nature of provided
performance-power spectrum. The working of the phase generator for an application with n
million instructions is shown in Algorithm 3.
Algorithm 3 Phase generation algorithm
p sep← 0.2
w sep← 1
bkt prev ← 0
p len← 0
i← 1
while i 6= n do
bkt← FindBucket(interval i)
if bkt 6= bkt prev or p len = 20 then
RecordChange(i)
bkt prev ← bkt
p len← 0
else
p len← p len+ 1
end if
i← i+ 1
end while
The first stage of our adaptation strategy is called the static reconfiguration stage. In
this stage, the phase-wise performance-power consumption characteristics, jointly referred to
as PWC ), are collected and are recorded into a Stats database. The performance (power)
is stored in terms of execution time (wattage). A set of configurations deemed best for the
different application phases (one per phase) are selected by the Static configuration generator
(SCG) to satisfy the overall demand. The information generated in this stage is passed through
a Metadata database onto the second stage, namely the dynamic reconfiguration stage.
The second stage of adaptation utilizes a novel lightweight Runtime manager (rm) that
selectively alters the deployed configurations from those chosen by SCG for each application
phase locally. These alterations are meant to account for any deviation between the expected
104
Phase Generator
Application
Inputs
.
.
.
Phase 1
Phase 2
Phase m
Interval Simulator
Allowed 
configuration list
Stats
Static Configuration 
Generator
Optimal configurations
Ordering 
Generator
Expectation
Generator
MetadataOptimal 
Configurations
Expected
PWC
Phase
Info
Performance/ power 
ordering
Figure 6.1 Adaptive architectural reconfiguration process
and actual PWC values aggregated over the past phases. Figure 6.1 shows the architectural
adaptation process. The next two sections describe the two stage adaptation strategy in detail.
6.2.2 Static reconfiguration stage
Consider an application execution profile that contains m phases. Let there be n allowed
configurations. Also, let the expected time (power) consumed for executing phase i with Cj be
given by tij (wij). The set of all tijs (wij) can be represented using a single nm× 1 (1× nm)
matrix T (W ), such that the kth row (column) of T (W ) contains t(k/m)(k%m) (w(k/m)(k%m)).
T and W are computable from the phase info and oﬄine profiling discussed in Section 6.2.1.
The problem of demand satisfaction is formulated as an optimization problem involving
binary variables sij . These variables represent the usage of the allowed configurations for
different phases. The utilization (and non-utilization) of Cj for the phase i is represented by
sij = 1 (sij = 0). All sijs are collectively represented using a nm × 1 selection matrix S. The
element in kth row of S is given by s(k/m)(k%m). The following conditions are implicit.
{sij = 0 ∨ 1} ∀{i ∈ [1,m], j ∈ [1, n]}
n∑
j=1
sij = 1 ∀{i ∈ [1,m]}
(6.1)
105
The last condition can also be formulated as M × S = K, where M is a mask matrix of
dimensions mn ×mn. The matrix M is formulated such that in the ith row of M , elements
(i − 1) × n + 1 to i × n are set to 1, and the rest are set to 0. K is a nm × 1 unit matrix.
Utilizing this representation combines the constraints for selecting one configuration per phase
over all the phases into a single equation.
Using these notations, the net time (Tnet) and average power consumed (Wavg) for the
execution can be calculated as
Tnet = S
′ × T, Wavg = S
′ × (T.×W ′)
S′ × T (6.2)
where A′ represents the transpose of a matrix A.
Different priorities (primary and secondary) are allotted to the satisfaction of Pd and Wd as
part of Ud. This distinction enables solutions that satisfy at least the primary constraint when
satisfaction of both is impossible. The SCG targets the exact satisfaction of the primary con-
straint, while optimizing system behavior with respect to the other. The secondary constraint
is implicitly optimized, even if there is no explicit demand on it. Assignment of priorities to
constraints leads to two classes of problems, which are detailed next.
6.2.2.1 Ud with primary constraint on Wd
In this situation, the minimization of Tnet is targeted subject to a hard bound (Wbound)
on Wavg. Wavg is considered the primary constraint rather than energy consumption (Enet).
The latter can favor solutions that adopt the max configuration, which leads to higher Wavg
and reduced reliability. Note that hardware reliability is inverse exponentially proportional to
the chip temperature, which follows the trend of Wavg. It could be argued that such solutions
provide an opportunity to move the processor into a deep sleep state after quick execution.
However, it is assumed that tasks are generally present at all times in the task queue, negating
the possibility of drifting to deep sleep states. This makes our solution very amenable to data
centers running batch jobs. The problem can be formulated as the following Integer linear
program (ILP) and solved.
106
minimize S′ × T
subject to
S′ × (T.×W ′)
S′ × T < Wbound, M × S = K
(6.3)
The constraint on Wavg can be transformed as follows.
S′ × (T.×W ′∗) < 0 where W∗ = W.−Wmat (6.4)
Here, Wmat is a 1×mn matrix with each element being Wbound and the operation ’.−’ represents
element-wise subtraction for two matrices. This transformation makes the problem a convex
binary ILP. Several methods for solving such problems exist, e.g. cutting planes [58], branch
and bound [86], branch and cut [95], and heuristics like tabu search [36], hill climbing [54],
simulated annealing [98], etc. The extensively used branch and cut method [103] is utilized due
to its advantage of combining the optimality provided by cutting planes and time efficiency
provided by branch and bound solutions. Further details on this method are not provided,
since it is not a part of our research contribution. The solution S found by solving the ILP in
turn yields the best configurations selected for the different application phases.
6.2.2.2 Ud with primary constraint on Pd
In this situation, the minimization of Wavg is targeted subject to a hard bound (Tbound) on
Tnet. The problem can be formulated as follows.
minimize
S′ × (T.×W ′)
S′ × T
subject to
S′ × T <= Tbound, M × S = K
(6.5)
It can be immediately observed that the objective function is not convex and linear. Several
approaches have been proposed in literature ([14], [41], [21], etc.) to tackle such problems.
The widely utilized interior point algorithm is employed to select appropriate configurations
for different application phases.
107
6.2.2.3 Storage overhead of Metadata database
As mentioned earlier, information generated during static reconfiguration stage is passed
onto dynamic reconfiguration stage through the Metadata database. In this section, the storage
overhead associated with this database is quantified. Four pieces of information are stored in
the Metadata database.
1. The phase changes list (PCL). Precisely, the serial numbers of the intervals after which
phase changes are demarcated are passed on. If there are m phases, this data occupies
m×dlnme bits, since dlnme bits are required to encode a single phase id. For convenience
of storage and retrieval, it is assumed that each id is actually stored as a short integer
(16 bits), limiting the value of m to 216. In such a case, the storage required for PCL is
2×m bytes. Limits on m can be imposed by reiterating the phase demarcation algorithm
while increasing psep and wsep in steps, until the number of resultant phases are less than
m.
2. The statically selected best configuration list (BCL). If there are n allowed configurations,
dlnne bits are required to uniquely represent a single configuration. Hence, a total of m×
dlnne bits are required. Similar to the previous case, it is assumed that each configuration
id is represented using a single byte. In such a case, the storage required for BCL is m
bytes.
3. The expected aggregated PWC (simply referred to as expected PWC from here on) after
each application phase. Timing values are aggregated over multiple application phases by
simple addition. The average power is aggregated by weighted arithmetic mean, where
times for the different application phases act as weights. Since performance and power
can be represented using floating point values (64 bits each), the net storage for expected
PWC values is 16×m bytes.
4. The relative ordering of the allowed configurations, separately for performance (Perfor-
mance order list- POL) and power (Power order list- WOL). This needs a total storage
of 2×m× n bytes.
108
The total storage overhead (Sover) can be calculated as
Sover = 2×m+m+ 16×m+ 2×m× n bytes = 19 + 2× n bytes (6.6)
Since the maximum value of n is 16, Sover is bounded to 51 × m bytes. When m is further
restricted to 216 (example), Sover = 3.18MB. Note that the entire database need not be cached
at all times, since there is no temporal locality. Intelligent prefetching allows the data to flow
into the cache seamlessly. The real cache footprint for this data corresponds to the data stored
per phase, which is just 51 bytes.
6.2.3 Dynamic reconfiguration stage
6.2.3.1 Working on runtime manager
A lightweight runtime manager (rm) is developed to handle runtime PWC variations after
each application phase. The working of the runtime manager is illustrated in Algorithm 4. The
runtime manager observes the slack in the schedule with respect to the primary constraint and
selectively alters the statically chosen configuration for the next phase. Consider its invocation
after execution of phase i. In the following, the statically selected configuration for phase i+ 1
is referred to as the preselected configuration, and the alternate configuration chosen by the
runtime manager as the reselected configuration.
A Performance monitor (PowerMonitor) is employed to measure the absolute time (average
power) consumed by the application so far (ReadPerformanceMonitor and ReadPowerMonitor
functions in Algorithm 4). Such functionality can be easily availed using hardware existent in
modern processors, e.g. Intel Sandybridge family. These measured values together represent
the actual PWC. The expected PWC values are retrieved from the Metadata database using
ReadMetaPerf and ReadMetaPow functions. A significant difference (∼5%) between the ex-
pected and actual PWC values for the primary constraint, calculated by CalculateSlackDirn
in Algorithm 4, triggers configuration reselection process for phase i+ 1.
The preselected configuration for phase i+ 1 is read from BCL (ReadMetaConfig in Algo-
rithm 4). The relative ordering of the allowed configurations in terms of the primary constraint
satisfaction for phase i + 1 is read from OL (either POL or WOL, according to primary con-
109
straint). If CalculateSlackDirn registers a positive (negative) slack, a less (more) aggressive
configuration in terms of satisfying the primary constraint is reselected.
Algorithm 4 Runtime reconfiguration algorithm
actual perf = ReadPerformanceMonitor();
actual pow = ReadPowerMonitor();
expected perf = ReadMetaPerf(i);
expected pow = ReadMetaPow(i);
static config = ReadMetaConfig(i+ 1);
slack dirn = CalculateSlackDirn(expected, actual);
if slack dirn = prev slack dirn then
Step size← Step size+ 1
else
Step size← 1
end if
prev slack dirn = slack dirn
pos = ReadMetaPos(OL, i+ 1, preselected config);
if slack dirn = positive then
pos = pos+ Step size
else
pos = pos− Step size
end if
AdjustPos();
reselected config = FindMetaConfig(OL, i+ 1, pos);
The rm also tracks the slack development trend. This information is used to tune the
aggression in reselecting alternate configurations for the future application phases. A special
variable step size is used to dictate this aggression. A step size value of k implies that the
preselected and reselected configurations are separated by k entries in the OL. The position of
the preselected configuration is read off from the OL using the ReadMetaPos function. The
new position for the reselected configuration is calculated using the slack direction and step
size, and the reselected configuration is read from OL using FindMetaConfig function. The
AdjustPos function ensures that the position of reselected configuration in the OL is valid.
At the beginning of application execution, step size is set to 1. Monotonic appearance of
slack in a single direction increases step size by 1. If the slack changes direction, step size is
reset to 1. The overall runtime for the algorithm is O(n). Since n is limited to a small value
(16), the rm operates in constant time.
110
6.2.3.2 Runtime management overhead
As the rm is invoked after each phase during execution, its runtime overhead needs to be
factored into the overall execution time. The worst case runtime for rm during execution is
given by
Toverhead = (m− 1)× (trm + treconfig) (6.7)
where trm denotes the time required to reselect a configuration, treconfig denotes the time
required to perform the desired reconfiguration in hardware, and m denotes the number of
application phases. Previous research has indicated that treconfig is low. If it is ensured
that trm is low compared to all tijs, a low value for Toverhead is automatically guaranteed.
This eliminates the need to consider Toverhead during static configuration selection, since small
deviations in PWC can be handled efficiently during runtime. It is observed that the ratio trmtij
is less than 1% for all is and js, considering a minimum phase length of 1 million instructions.
Following this observation, the minimum phase length is fixed at 1 million instructions.
6.3 Dynamic adaptation strategies
Two factors motivate dynamic adaptation methodology. First, it may be too cumbersome to
obtain phase-wise PWC values. As such, the static reconfiguration stage may not be practical
for all computing scenarios. Also, such characteristics may vary with program inputs and values
obtained through oﬄine profiling may not hold well. Second, the size of the Metadata database
becomes a factor of concern as larger m values are allowed. To avoid these concerns, two
simplistic dynamic adaptation strategies employing our already developed runtime manager
are proposed in this research. Since it is decided to avoid storing the expected PWC values
for each program phase, an alternate mechanism to investigate the performance and power
consumption is required. It is chosen to expect constant TPI (time per instruction) and power
profiles for the entire execution. Note that this is not ideal due to the fact that performance and
power consumption vary frequently during execution. As such, this constraint has a pitfall of
not being able to exploit information from future application phases and optimize the tradeoff
accordingly. However, this constraint lets us develop simple dynamic adaptation strategies.
111
Two variations of the dynamic only adaptation strategies are considered. For both these dy-
namic strategies, a single POL (or WOL) is used for the entire application. This list is obtained
through oﬄine profiling. The CPI and watts consumed with Smax is similarly obtained. These
values are scaled according to the normalized components in the demands to produce expected
TPI and power consumption values. Based on the normalized demands, a single configuration
that is expected to satisfy them best is chosen. This can be obtained through oﬄine profiling
as follows. The application is executed with the different allowed configurations one after the
other and the expected time and average wattage are recorded. These values are normalized
with respect to the corresponding values obtained for the execution with Smax. Any input
demand can then be checked against this list of normalized values to choose the configuration
that best satisfies it.
The application execution is demarcated into phases, each consisting 1 million instructions.
Longer phase length is not used since intra-application variations in HSIP are not analyzed.
As such, a conservative assumption is made that the HSIP changes very frequently. After each
program phase, the runtime manager observes the observed time and averaged power values.
The static estimate of expected TPI and wattage is used to calculate the expected execution
time (and power consumption) after the application phase as well. The actual and expected
values of the primary constraint are compared and the runtime manager chooses alternative
configurations instead of the statically chosen configuration as explained in Section 6.2.3.1.
In our experiments, it has also been observed that the baseline POL (and WOL) order
is not valid for all the application phases. On an average, the baseline performance order is
invalid for about 95% of all application phases considered (totaled for all benchmarks) while the
baseline power order is invalid for about 50% application phases. Our two dynamic adaptation
strategies differ in their management of POL and WOL at runtime.
In the first strategy referred to as the dynamic non-learning (DNL) adaptation, the POL
and WOL are unaltered during execution. In the second strategy referred to as the dynamic
learning based adaptation (DL), the POL (or WOL) are constantly updated if the existing
order is found to be violated during execution. To understand the working of the learning
based dynamic adaptation strategy, consider the following example when performance is the
112
primary constraint. Suppose it is found that positive slack manifests after phase i, and the
corresponding configuration used for phase i was Ck. The runtime manager chooses a slower
configuration Cl as per the POL for phase i + 1 to reduce power consumption. If the time
consumed for phase i+ 1 is found to be lesser than the time consumed for phase i, the position
of Ck and Cl in the POL are interchanged. Such interchanges help the runtime manager track
the varying requirements of the different application phases over time. This strategy has a
small pitfall. When a configuration providing low performance on an average ends up at the
top of the POL, it will hinder the runtime manager from properly managing negative slack in
the future. Such a situation can arise if this configuration performs well for a few intermediate
phases and is thus promoted to the top of the POL. To avoid this, the ends of the POL and
WOL are periodically reverted to the corresponding configurations at the ends of the baseline
ordered lists.
6.4 Evaluation
6.4.1 Evaluation Methodology
A commonly used set of benchmarks from the SPEC 2006 suite have been utilized for eval-
uating the newly developed architectural adaptation strategies. Multiple Uds are synthetically
generated corresponding to different operating modes described in Chapter 4. Since the objec-
tive of the current research is the satisfaction of Uds, it is chosen to measure the percentage
inaccuracy PI associated with the satisfaction of both performance (p in) and power (w in)
constraints individually. For each constraint, the corresponding PI becomes 0 if it is satisfied.
Otherwise, PI is calculated as
p in = (Pd − Pact) ∗ 100 (6.8)
w in = (Wact −Wd) ∗ 100 (6.9)
where Pact and Wact are the delivered performance and power consumption respectively.
Figure 6.2 shows our evaluation platform. The PWC corresponding to the different ap-
plication phases observed when using all the allowed configurations are first collected through
interval simulations using Sniper simulator. These values can be obtained using hardware coun-
113
Runtime Manager
Deviator
Max. 
deviation
dirn
Actual performance
Actual power
Expected
performance
Expected
power
Metadata database
Statically selected 
configuration 
id+position in order list
New position
New id
Previous id
Figure 6.2 Evaluation platform
ters provided on the chip. Since such an adaptive hardware is not available, values obtained
through simulations are used. Synthetic variations are also generated in the schedule to model
profiling inaccuracy and runtime effects. The system behavior with and without variations is
separately investigated.
The Metadata database is populated according to the steps described in Section 6.2.2.3. At
the end of every application phase i except the last, the runtime manager fetches the expected
values of performance (Pexpect) and power (Wexpect) from the Metadata database. Similarly,
the configuration selected by the SCG for the next phase (Ci+1) and the POL (or WOL) for
phase i+1 is fetched from Metadata database. A deviator module is responsible for generating
actual PWC. This module takes as input the expected PWC values at the end of phases i and
i−1, and a special dirn variable as input. A deviation dev is applied to the execution time and
power consumed while utilizing Ci during phase i in this process. dev is randomly generated
between 0 and dev max (set to 20% for our experiments). Larger values for dev max are not
considered at this point. The dirn variable, which stays constant for all application phases,
can be given one of three values 0, 1, or -1. If dirn is set to 0, no deviations are produced.
Setting the dirn variable to 1 (-1) produces negative deviation to power (performance) while
producing positive deviation to performance (power) subject to the bounds explained earlier.
For the static cum dynamic adaptation strategy, the runtime manager selectively modifies
Ci+1 in the presence of slack affecting primary constraint. A configuration reached by moving
Step size entries away from the index of original Ci+1 in the corresponding ordered list (POL
114
or WOL) is chosen. This process repeats for all the application phases till the end of execution.
At the end of execution, p in and w in are calculated.
For the dynamic adaptation strategy without learning, the runtime manager considers a
single configuration selected statically for adjustment in all application phases. A single POL or
WOL predetermined statically is also considered. For each phase, the runtime manager adjusts
the baseline configuration similar to the static cum dynamic adaptation strategy. For the
dynamic adaptation strategy with learning, the POL and WOL are also modified additionally
based upon runtime observations as explained in Section 6.3.
6.4.2 Determination of maximum interval length
As mentioned earlier, smaller application phase length leads to a larger number of applica-
tion phases. This allows fine-grained adaptation decisions, leading to better demand satisfac-
tion. Simultaneously, the net adaptation overhead increases. In Section 6.2.1, the maximum
interval length has been set to 10 million instructions. Further observations made in terms of
the decrease in number of phases and degradation in PI for demand satisfaction as maximum
phase length increases have been used as the basis for this judgment. The decrease in the
actual number of program phases as the maximum interval length is increased from 1 million
instructions to 25 million instructions is first noted down. Figure 6.3 shows how the phase count
scales as the maximum interval length is varied. In the figure, the maximum interval lengths
investigated are shown on the x-axis. The values along y-axis are the maximum, minimum, and
average (over different benchmarks) phase counts as a percentage of phases demarcated when
constant interval length of 1 million instructions is employed. It can be seen that the number of
phases drop very quickly as the maximum interval length is increased to 10 million instructions.
After this point, the drop saturates slowly. For two benchmarks, namely hmmer and specrand,
the phase count scales down almost linearly with the maximum interval length. This shows
that the performance and power profiles stay fairly constant throughout the execution for these
benchmarks.
The adaptation approach specified in Section 6.2.2 to select a configuration per individual
phase demarcated for the scenarios with different maximum interval lengths. Most aggressive
115
100
%
 r
ed
u
ct
io
n
 in
 r
ea
d
ap
ta
ti
o
n
 p
o
in
ts
0
20
40
60
80
100
120
0 5 10 15 20 25
N
u
m
b
er
 o
f 
p
ro
gr
am
 p
h
as
es
 
(%
)
Max. inteval length in Millions of instructions
Min Max Avg.
Figure 6.3 Phase count scaling with maximum phase length
0
1
2
3
4
5
6
7
8
9
10
0 5 10 15 20 25
%
 in
cr
ea
se
 in
 P
I
Max. interval length in Millions of isntructions
Max Avg.
Figure 6.4 PI degradation with maximum phase length
demands from balanced and stringent modes are input and the resulting PI values are mea-
sured. For each mode, 10,000 demands are generated and the reported PI values are averaged
over these demands. Figure 6.4 shows the observed PI values. In the figure, the x-axis de-
notes the maximum allowed interval length. The y-axis shows both the average and maximum
degradation in PI when compared to the PI observed for a constant interval length of 1 million
instructions. As expected, the PI degrades as the maximum interval length is increased. The
PI degradation is guaranteed to stay below 5% when the maximum interval length is restricted
to a value below or equal to 10 million instructions. Following these trends, the maximum
interval length is set at 10 million instructions.
116
0
5
10
15
20
25
hp lw bp_0.1 bp_0.2 sp_0.1 sp_0.2 sp_0.3
P
I
User demands
Pin_SC Win_SC Pin_ONE Win_ONE
Figure 6.5 Benefit of utilizing intra-application adaptation
6.4.3 Adaptation strategies
6.4.3.1 Benefit of intra-application adaptation
Static intra-application adaptation (referred to as SC ) exploits the runtime variation in
hardware-software interactions. Such an adaptation strategy can lead to better tradeoff de-
cisions compared to an adaptation strategy that employs a single configuration for the entire
application based on the inherent algorithm behavior (strategy referred to as ONE ). As the
configuration space is pruned, the benefit of intra-application adaptation decreases since the
available performance-power points are discrete in nature. To evaluate the merit of the pruned
configuration space in enhancing tradeoff through intra-application adaptation, the PI observed
for the different demand scenarios is measured. These values are reported in Figure 6.5. In
the figure, the x-axis denotes the different demand scenarios and the y-axis denotes p in and
w in separately. It is observed that the primary constraint is always satisfied except for the
low power demands. The considered configuration space cannot serve some of the low power
constraints, due to which a small amount of w in is observed. As expected, the inaccuracy
in serving user demands increases as Pd−Wd increases. The PI values are highest for strin-
gent demands since in this region of operating spectrum, high performance requires high power
consumption. Intra-application adaptation saves 10% additional power when serving stringent
demands when Pd−Wd = 0.3. For balanced intermediate demands on performance and power,
the observed power saving due to intra-application adaptation is 5%.
117
6.4.4 Handling runtime variations in performance and power
Static adaptation strategy results in optimal tradeoff subject to the correctness of expected
performance and power consumption knowledge. However, such expectations typically do not
hold in a real-world scenario. Variations can arise due to runtime effects like cache contention,
temperature induced mobility degradation, operating system management, inaccuracy of per-
formance and power profiling, etc. When real execution proceeds faster than expected, the
available performance slack should be utilized to reduce inaccuracy in tracking power demands.
Similarly, situations with lower actual performance need to be handled by selecting alternate
aggressive configurations to provide the required performance. The runtime manager designed
for static cum dynamic adaptation strategy (referred to as SDC ) is meant to handle such vari-
ations. The evaluation procedure described above is used to measure the effectiveness of the
runtime manager in handling runtime variations. In Figure 6.6, the PI values observed for SC
and SDC strategies when (a) dirn=1 and (b) dirn=-1 respectively are reported.
When dirn is set to 1, the available positive performance slack is traded off to reduce the
power consumption. This effect is more pronounced for stringent demands which otherwise
result in high w in. In particular, it is observed that a 6% decrease in w in occurs when
Pd − Wd = 0.3. Due to the discrete nature of the available performance-power points, the
runtime manager overdid this tradeoff. This resulted in a slight increase in p in. However, this
effect is found to be negligible. When serving low power demands, performance is aggressively
traded off to constrain power which is the primary constraint. This manifested as a 1.2%
increase in p in.
When dirn is set to -1, power consumption is aggressively traded off to satisfy performance
constraint for all except low power demands. This results in ∼ 5% lower p in while employing
dynamic configuration adjustments. An increase in w in is also noticed. The PI increases for
stringent and balanced demands as Pd−Wd increases and is observed as 7% when Pd−Wd = 0.3.
From these observations, it could be concluded that the runtime manager helps track the
primary constraint well when it is negatively affected by runtime deviations.
118
Figure 6.6 Effectiveness of static cum dynamic adaptation strategy
6.4.5 Comparison of SDC and dynamic adaptation strategies
The PI values observed for SDC strategy depict the tradeoff that can be achieved when
detailed PWC for different application phases are readily available. Such expectations are
possible in real-time system environments but are not entirely practical for general purpose
computing. Our dynamic adaptation strategy uses just a baseline performance and power
order and the measured TPI and power values to perform intra-application adaptation. Figure
6.7 compares the dynamic only strategies with SDC in terms of the observed PI values.
The dynamic adaptation strategies are further compared with the static cum dynamic
adaptation strategy. Dynamic adaptation strategy considers re-adaptation more frequently
(every 1M instructions). However, they do not exercise the knowledge of PWC for future
application phases and are thus suboptimal. Figure 6.7 compares the PI values obtained for
these different strategies when the dirn variable is set to (a) 0, (b) 1, and (c) -1. The x-axis in
the figure denotes the different user demand scenarios considered. The observed PI values for
the performance and power constraints are separately jotted along the y-axis.
119
Figure 6.7 Comparison of PI for SDC and dynamic adaptation strategies
120
When runtime deviations do not occur, the SDC strategy provides the best tradeoff. Both
learning based and non-learning dynamic adaptation strategies have their own merits. Learning
based dynamic adaptation slightly lowers the PI for secondary constraint while increasing the
same for the primary constraint. It is observed that suboptimal configurations, which are fast
only for a few application phases but provide mediocre overall performance can sometimes settle
at the top of POL. When such a situation arises, the corresponding configuration is repeatedly
selected for periods of negative slack leading to lazy slack reclaim. Since such a configuration
consumes lower power than the most aggressive configuration, a power saving is noted. This
effect is pronounced for balanced demands when Pd −Wd = 0.2. In this case, 2% performance
is sacrificed to reduce power consumption by 5%.
Since applications execute in phases, it is reasonable to expect that learning based adap-
tation strategy provides better system behavior when compared to the non-learning strategy.
However, the observed PI values do not reflect this understanding. It is noticed that the POL
and WOL change very rapidly and the learning based strategy is not provided with enough time
to adapt to this change. In particular, it is observed that the POL changes between adjacent
phases with a probability of 0.8.
When positive (performance) slack arises, both the dynamic strategies utilize it to reduce
power consumption. The only exception in this regard is observed for the learning based
dynamic strategy when serving balanced performance demands. As mentioned earlier, mediocre
performance configurations are sometimes pushed to the top of POL. Since there exists positive
slack, such configurations are not employed, and configurations in the middle of the POL are
usually utilized. Note that a few aggressive configurations are pushed to this area of the POL.
Since they are deployed, the overall power consumption increases. Performance constraints are
always satisfied as well.
When negative slack arises, all three considered adaptation strategies sacrifice power con-
sumption while trying to improve performance. The performance constraints are still not
satisfied for high performance and stringent demands. However, the PI for performance stays
low (∼ 3%). The associated increase in power consumption is negligible. For all the consid-
ered slack scenarios, the dynamic adaptation strategies perform closely to the SDC strategy.
121
60
65
70
75
80
85
90
95
100
95% 90% 85% 80% 75% 70% 65%
D
el
iv
er
ed
 p
er
fo
rm
an
ce
 (
%
)
Performance demand (%)
P_DNL P_DL P_SC P_ONE
Figure 6.8 Performance delivered for various performance demands
This means that it is possible to avoid the large overhead associated with SDC strategy with-
out sacrificing the tradeoff significantly. Learning based dynamic adaptation strategy can be
used when a slight reduction in performance is acceptable to lower power consumption. Al-
ternatively, the non-learning strategy can be used when the baseline POL or WOL is easily
available.
6.4.6 Scaling of power consumption with performance
Figure 6.8 shows the delivered performance when the performance demanded from the
microprocessor is varied from 95% to 65% in steps of 5%. The different performance demands
are plotted along x-axis. The y-axis in this figure denotes the obtained performance as a
percentage of maximal value. The SC and ONE strategies always deliver at least the required
performance. In particular, the ONE strategy overcompensates for performance since the
available performance-power points are discrete and no re-adaptation is considered. SC utilizes
the knowledge of hardware software interactions for all application phases, and thus provides
almost exactly the required performance. Both the non-learning and learning based dynamic
strategies provide slightly lower performance than required. This effect is pronounced when
the required performance is high (95%). In this case, these schemes provide 2.5% and 4% lower
performance respectively.
The resulting power consumption for these performance demands is compared as well.
The observed power consumption values for the different adaptation schemes are depicted
122
40
45
50
55
60
65
70
75
80
85
90
95% 90% 85% 80% 75% 70% 65%P
o
w
er
 c
o
n
su
m
p
ti
o
n
 (
%
)
Performance demand (%)
W_DNL W_DL W_SC W_ONE
Figure 6.9 Power consumed for various performance demands
in Figure 6.9. The y-axis denotes the power consumption as a percentage of the maximal
value. The SC strategy results in the lowest power consumption while the ONE strategy
results in the highest power consumption. Both the learning based and non-learning based
adaptation strategies consume similar power. The former strategy results in slightly lower
power consumption than the latter when performance demands are high. Similarly, the latter
strategy results in slightly lower power consumption when performance demands are low. Both
the dynamic adaptation strategies consume power within 5% range of the power consumed by
SC strategy. The difference between these is higher when performance demands are between
80% and 90%.
6.4.7 Comparison with previous schemes
Kontorinis et. al. [61] proposed adaptation strategies based on a table-driven adaptive core
to reduce peak power. The authors adjust a set of 10 different adaptive components to obtain
good performance when the peak power is restricted to different levels. Since the configuration
space is pruned aggressively, it is essential to analyze how the tradeoff is affected due to the
pruning process. The performance obtained as a fraction of performance obtained for Smax for
both our schemes and the schemes developed in [61] are compared. The peak power is con-
strained at 70%. This bound is used for power consumption since it is the lowest power bound
considered in [61] and the authors present the relevant observations as well. The configuration
space considered in [61] for this peak power constraint consists of 132 configurations (against
123
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
N
o
rm
al
iz
ed
 p
er
fo
rm
an
ce
Adaptation strategy
Figure 6.10 Performance delivered for peak power constraint of 70%
16 that are considered for this research). Figure 6.10 shows the performance values (%). In
the figure, x-axis denotes the different adaptation strategies while the y-axis shows the perfor-
mance values. The first five strategies are developed by the authors in [61] while the last three
are ours. It is observed that our SDC strategy results in the highest performance. In cases
where SDC is not practical, our dynamic adaptation strategies provide 7% less performance
when compared to the best strategies proposed in the considered previous research. We believe
this degradation is acceptable considering the reduction in hardware complexity required for
adaptation. Also, the authors of the table driven adaptation scheme report their performance
values based upon execution characteristics of SPEC 2000 benchmarks. On the contrast, this
research uses SPEC 2006 benchmarks for evaluation, which pose larger hardware requirements
for provision of good performance. Thus it is expected that the performance gap between these
strategies will diminish even further in reality.
The proposed dynamic adaptation strategies are also analyzed in terms of how well they
lead to energy efficient execution. The energy/performance efficiency provided when different
performance constraints are imposed are measured. In [31], the authors report that their dy-
namic machine learning based adaptation strategy leads to 74% energy efficiency. The reported
value is normalized to an oracular ideal scheme that chooses the best configuration for each
application phase. Since the configuration space is pruned to retain only the best possible
configurations, it is expected that the normalized energy efficiency to go up. Figure 6.11 shows
the normalized energy efficiency for our dynamic adaptation strategies. It is observed that
124
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
95 90 85 80 75 70 65
N
o
rm
al
iz
ed
 e
n
er
gy
 
ef
fi
ci
en
cy
Performance demand (%)
DNL DL
Figure 6.11 Energy efficiency when serving different performance needs
both the learning and non-learning dynamic strategies result in about 95% normalized energy
efficiency when performance demand is 95%. This shows how well our configuration space
pruning methodology results in selection of efficient configurations. The normalized energy
efficiency decreases as the performance demanded decreases since the optimal oracular scheme
gets to consider larger number of configuration combinations. However, the normalized energy
efficiency never drops below 75% when performance demand is greater than 65%.
6.5 Conclusion
In this chapter, the details of a two-stage adaptation strategy for microarchitectural adap-
tation are first presented. The first stage gathers and utilizes comprehensive information re-
garding expected PWC for the entire application to statically determine a set of configurations
to be used for the various application phases. A lightweight runtime manager is designed to
account for the differences between expected and actual PWC as part of the second stage. Two
alternate dynamic only adaptation strategies are developed for situations where static profiling
for fine-grained gathering of application wide PWC is impractical. These strategies can easily
be adopted even in multicores by specifying core level TPI and wattage requirements.
An extensive evaluation process is employed to analyze how the newly developed adapta-
tion strategies cater to widely variant demands from hardware. In particular, it is noticed that
primary demands can be always be served with less than 2% inaccuracy on an average. The
inaccuracy in serving the secondary constraint never rises above 10% unless very low power
125
consumption is not demanded along with ultra-high performance. The power required to serve
95% and 90% performance demands are close to 80% and 75% respectively. The performance
provided by the dynamic adaptation strategies while constraining peak power to 70% is about
7% lower compared to schemes utilizing 10 times the configuration space. Our dynamic adap-
tation strategies also lead to energy efficient execution where the efficiency normalized to an
ideal oracular scheme is about 95% when a similar performance is demanded.
All in all, the major contribution of the research details presented in this chapter lies in
making microarchitectural adaptation more tractable. The issues of when to adapt and how
to adapt are tackled. The corresponding techniques developed will make microarchitectural
adaptivity an elegant solution for performance-power tradeoff and lets the user dictate the
hardware behavior in a simplistic and flexible manner.
126
CHAPTER 7. CONCLUSIONS AND FUTURE WORK
7.1 Conclusions
In this dissertation, we address the issue of using microprocessor systems in an effective
manner to achieve a balance between performance, power, and reliability. In particular, we
address the challenges of avoiding thermal cycling at the core level to minimize the reliability
threat. We also address the challenge of providing maximum performance for a set power
constraint, or the consuming minimum power for a given performance constraint.
Provision of good performance is at odds with lowering power consumption and improving
reliability. Higher performance requires the inclusion of a larger number of transistors, thereby
increasing power consumption, and possibly the chip temperature. The increase in and runtime
fluctuations in chip temperature further affect reliability. As these factors are interrelated, it is
not sufficient to consider the optimization of a single entity among them. In the current research
work, we provide mechanisms to co-manage performance and reliability, and performance and
power.
Design of aggressive cores for good single-threaded performance, as well as the aggregation
of a large number of such cores to suit current software needs lead to chip reliability concerns.
Factors affecting chip reliability include the chip temperature as well as its fluctuations over
time. The latter factor, which is otherwise referred to as thermal cycling, has been identified as
a major concern in previous research. However, schemes targeting reduction of thermal cycling
have not been investigated. In this research, we provide mechanisms to keep both the chip
peak temperature and the temperature fluctuations in check while adhering to set performance
constraints. A real-time task execution environment with Quality of Service (QoS ) guarantees is
considered to enforce performance demands. The capabilities of DVFS and microarchitectural
127
adaptation are leveraged to select appropriate hardware configurations for the execution of
tasks in a set schedule.
A two stage configuration selection scheme is developed to co-manage performance and reli-
ability. The first stage statically selects the hardware configurations for individual tasks based
upon knowledge of configuration-wise timing and temperature characteristics. Two alternate
algorithms for configuration selection, namely peak reduction and window based selection, are
developed and their effectiveness is analyzed. It is found that the former algorithm is slower but
results in lower thermal gradients when compared with the latter algorithm. The second stage
of configuration selection selectively alters the statically chosen configurations for deployment
based upon runtime slack conditions. We have observed a 3-48 fold increase in chip lifetime
expectancy pertaining to several failure mechanisms when our configuration selection schemes
are employed on a schedule with 8 tasks. The increase in lifetime expectancy pertaining to
thermal cycling is about 20 fold.
A major hindrance to the employment of microarchitectural adaptation is the control com-
plexity. The size of possible configuration space prohibits the necessary analysis to decide
the optimal configuration based on task execution characteristics. Previous research provides
certain cues on how to reduce the configuration space. However, no formal approach to con-
figuration space pruning exists. We designed a three stage methodology to bring down the
configuration space to any desired size. The pruning methodology is based on application-
specific expected performance and power characteristics made available through interval sim-
ulations. Multiple mechanisms to prune the configuration space are developed and compared.
Our observations indicate that the pruned configuration space can be used to provide varied
performance and power consumption levels with up to 92% accuracy. Since only the most
useful configurations are retained after pruning, the presence of a fault can potentially degrade
the system behavior. Our analysis in this regard shows that this degradation can be masked
by up to 95% by utilizing the still available configuration space.
We further investigate mechanisms to provide performance-power tradeoff with the pruned
configuration space exploiting program phases. A two stage comprehensive static cum dynamic
adaptation strategy is developed that exploits phase-wise knowledge of performance and power
128
characteristics. Similar to our performance-reliability co-management scheme, the static com-
ponent strives to preserve tradeoff optimality while the dynamic component deals with runtime
variations. A similar approach to adaptation is not possible without the configuration space
pruning due to the complexity of the associated tradeoff optimization problem. We also de-
velop two alternate dynamic adaptation strategies that provide the required tradeoff without
the knowledge of phase-wise operating characteristics. It is observed that single constraint de-
mands (on performance or power) are served with ∼ 98% accuracy. For demands involving both
performance and power, the inaccuracy in tracking the demands rarely crosses 10%. For a set
power consumption level, our adaptation strategies provide 93% of the performance provided
by a previously proposed strategy that considers 10 times the configuration space size. Our
dynamic adaptation schemes result in about 95% of the maximum possible energy efficiency
against 75% possible for a related state-of-the-art scheme adapting 14 billion configurations.
To summarize, our current research addresses the various aspects associated with micro-
processor performance-power and performance-reliability co-management. Various mechanisms
to obtain required operating characteristics from hardware are proposed. These mechanisms
leverage on DVFS and microarchitectural adaptation. Previously unconsidered ill-effects of
thermal cycling on chip reliability are included to provide holistic performance-reliability co-
management solutions. The adaptive microarchitectural configuration space is reduced as per
set requirements and lightweight schemes are developed for configuration selection. Our re-
search demystifies the complexity involved in microarchitectural adaptation and motivates the
design of such architectures.
7.2 Future work
Our research currently deals with uniprocessor performance-reliability and performance-
power co-management. In the future, this will be extended to address similar issues in mul-
ticores. Our dynamic performance-power adaptation strategies are amenable usage in the
multicore scenario. Per core CPI and wattage requirements can be set up from the given
performance and power budget/demand. Core level adaptation can be performed using the
proposed strategies. The demands can further dictate the number of cores to utilize for a given
129
1 1.2
1.6 1.8 2
2.5 3 3.5
0
0.5
1
1.5
2
2.5
3
3.5
4
0 0.5 1 1.5 2 2.5 3 3.5
N
o
rm
. p
o
w
er
Norm. performance
1 core
2 cores
4 cores
Use 1 core Use 2 cores Use 4 cores
Figure 7.1 Normalized performance vs. normalized power for FFT
application. Our preliminary analysis in this regard confirms this claim. For example, figures
7.1 and 7.2 show the variation of normalized power with the normalized performance for FFT
and Barnes hut method for solving N-body interactions. A multicore platform with 4 cores is
used for the analysis. Either 1, 2, or 4 cores can be made available for application execution.
In these figures, the normalized performance is represented on x-axis. The y-axis shows
the normalized power consumption. The performance and power consumption values are nor-
malized with respect to the values obtained with execution on 1 processor with maximal con-
figuration. It is observed that the benefit of using either 1, 2, or 4 cores can be associated
with different regions of performance spectrum. Also, these regions are application dependent.
As such, the performance demand and the application at hand can be first used to decide the
number of cores to utilize. The performance and power budget can be decomposed to core-level
budgets. Following this, the required tradeoff using the selected cores can be provided using
our dynamic adaptation strategies. Our analysis in this regard is still preliminary and further
experimentation is needed to formalize the tradeoff methodology.
We are also interested in building a hardware prototype for the envisaged adaptive pro-
cessing platform. FPGAs provide an excellent platform for such hardware emulation. Current
FPGAs come with an included general purpose processor integrated into the chip fabric. This
processor can be used to make adaptation decisions which can then be communicated to an
adaptive processor configured on the FPGA. A thorough analysis will be performed to under-
130
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5 3
N
o
rm
. p
o
w
er
Norm. performance
1 core
2 cores
4 cores
Use 1 core Use 2 cores Use 4 cores
Figure 7.2 Normalized performance vs. normalized power for Barnes hut algorithm
stand and implement the required adaptivity in hardware.
The schemes proposed in this research can also be extended to include other components
of computing platforms as well. Nest configuration, on-board graphic cards, and shared caches
can be included to provide comprehensive system level performance, power, and reliability
management mechanisms. Such mechanisms have a higher impact on the considered operating
characteristics when compared to mechanisms considering individual cores.
131
Bibliography
[1] David H. Albonesi. Dynamic ipc/clock rate optimization. In Proceedings of the 25th
annual international symposium on Computer architecture, ISCA ’98, pages 282–292,
Washington, DC, USA, 1998. IEEE Computer Society.
[2] David H. Albonesi, Rajeev Balasubramonian, Steven G. Dropsho, Sandhya Dwarkadas,
Eby G. Friedman, Michael C. Huang, Volkan Kursun, Grigorios Magklis, Michael L. Scott,
Greg Semeraro, Pradip Bose, Alper Buyuktosunoglu, Peter W. Cook, and Stanley E.
Schuster. Dynamically tuning processor resources with adaptive processing. Computer,
36(12):49–58, December 2003.
[3] A Avizienis. Faulty-tolerant computing: An overview. Computer, 4(1):5–8, 1971.
[4] G Baccarani, MR Wordeman, and RH Dennard. Generalized scaling theory and its
application to a 1/4 micrometer mosfet design. Electron Devices, IEEE Transactions on,
31(4):452–462, 1984.
[5] Rajeev Balasubramonian, David Albonesi, Alper Buyuktosunoglu, and Sandhya
Dwarkadas. Memory hierarchy reconfiguration for energy and performance in general-
purpose processor architectures. In Proceedings of the 33rd annual ACM/IEEE interna-
tional symposium on Microarchitecture, pages 245–257. ACM, 2000.
[6] Amirali Baniasadi and Andreas Moshovos. Instruction flow-based front-end throttling
for power-aware high-performance processors. In Proceedings of the 2001 international
symposium on Low power electronics and design, ISLPED ’01, pages 16–21, New York,
NY, USA, 2001. ACM.
132
[7] Min Bao, Alexandru Andrei, Petru Eles, and Zebo Peng. Temperature-aware voltage
selection for energy optimization. In Proceedings of the conference on Design, automation
and test in Europe, pages 1083–1086. ACM, 2008.
[8] David Bernick, Bill Bruckert, Paul Del Vigna, David Garcia, Robert Jardine, Jim Klecka,
and Jim Smullen. Nonstop R© advanced architecture. In Dependable Systems and Net-
works, 2005. DSN 2005. Proceedings. International Conference on, pages 12–21. IEEE,
2005.
[9] Fred A Bower, Paul G Shealy, Sule Ozev, and Daniel J Sorin. Tolerating hard faults in mi-
croprocessor array structures. In Dependable Systems and Networks, 2004 International
Conference on, pages 51–60. IEEE, 2004.
[10] Fred A Bower, Daniel J Sorin, and Sule Ozev. A mechanism for online diagnosis of hard
faults in microprocessors. In Proceedings of the 38th annual IEEE/ACM International
Symposium on Microarchitecture, pages 197–208. IEEE Computer Society, 2005.
[11] David Brooks, Robert P Dick, Russ Joseph, and Li Shang. Power, thermal, and reliability
modeling in nanometer-scale microprocessors. Micro, IEEE, 27(3):49–62, 2007.
[12] Doug Burger and Todd M. Austin. The simplescalar tool set, version 2.0. SIGARCH
Comput. Archit. News, 25(3):13–25, June 1997.
[13] Alper Buyuktosunoglu, David Albonesi, Stanley Schuster, David Brooks, Pradip Bose,
and Peter Cook. A circuit level implementation of an adaptive issue queue for power-
aware microprocessors. In Proceedings of the 11th Great Lakes symposium on VLSI,
GLSVLSI ’01, pages 73–78, New York, NY, USA, 2001. ACM.
[14] Richard H Byrd, Mary E Hribar, and Jorge Nocedal. An interior point algorithm for
large-scale nonlinear programming. SIAM Journal on Optimization, 9(4):877–900, 1999.
[15] T.E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction
for scalable and accurate parallel multi-core simulation. In High Performance Computing,
Networking, Storage and Analysis (SC), 2011 International Conference for, 2011.
133
[16] Anantha P Chandrakasan, Samuel Sheng, and Robert W Brodersen. Low-power cmos
digital design. IEICE Transactions on Electronics, 75(4):371–382, 1992.
[17] Thidapat Chantem, Xiaobo Sharon Hu, and Robert P Dick. Temperature-aware schedul-
ing and assignment for hard real-time applications on mpsocs. Very Large Scale Integra-
tion (VLSI) Systems, IEEE Transactions on, 19(10):1884–1897, 2011.
[18] Jian-Jia Chen and Chin-Fu Kuo. Energy-efficient scheduling for real-time systems on
dynamic voltage scaling (dvs) platforms. In RTCSA, pages 28–38, 2007.
[19] Wayne H Cheng and Bevan M Baas. Dynamic voltage and frequency scaling circuits with
two supply voltages. In Circuits and Systems, 2008. ISCAS 2008. IEEE International
Symposium on, pages 1236–1239. IEEE, 2008.
[20] Bruce Childers, Hongliang Tang, and Rami Melhem. Adapting processor supply voltage
to instruction-level parallelism. In Kool Chips 2000 Workshop, 2000.
[21] Thomas F Coleman and Yuying Li. An interior trust region approach for nonlinear
minimization subject to bounds. SIAM Journal on optimization, 6(2):418–445, 1996.
[22] Cristian Constantinescu. Trends and challenges in vlsi circuit reliability. Micro, IEEE,
23(4):14–19, 2003.
[23] Standard Performance Evaluation Corporation. Cint 2000, 2003.
[24] A.K. Coskun, R. Strong, D.M. Tullsen, and T.S. Rosing. Evaluating the impact of job
scheduling and power management on processor lifetime for chip multiprocessors. In
Proceedings of the eleventh international joint conference on Measurement and modeling
of computer systems, 2009.
[25] Ayse Kivilcim Coskun, Tajana Simunic Rosing, and Kenny C Gross. Temperature man-
agement in multiprocessor socs using online learning. In Design Automation Conference,
2008. DAC 2008. 45th ACM/IEEE, pages 890–893. IEEE, 2008.
134
[26] Ayse Kivilcim Coskun, Tajana Simunic Rosing, Keith A Whisnant, and Kenny C Gross.
Temperature-aware mpsoc scheduling for reducing hot spots and gradients. In Proceedings
of the 2008 Asia and South Pacific Design Automation Conference, pages 49–54. IEEE
Computer Society Press, 2008.
[27] G. Dhiman and T.S. Rosing. System-level power management using online learning.
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 2009.
[28] Gaurav Dhiman and Tajana Simunic Rosing. Dynamic voltage frequency scaling for
multi-tasking systems using online learning. In Proceedings of the 2007 international
symposium on Low power electronics and design, pages 207–212. ACM, 2007.
[29] A.S. Dhodapkar and J.E. Smith. Timing reconfigurable microarchitectures for power
efficiency. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th
International, pages 133–, April 2004.
[30] Ashutosh S. Dhodapkar and James E. Smith. Managing multi-configuration hardware
via dynamic working set analysis. SIGARCH Comput. Archit. News, 30(2):233–244, May
2002.
[31] Christophe Dubach, Timothy M. Jones, Edwin V. Bonilla, and Michael F. P. O’Boyle. A
predictive model for dynamic microarchitectural adaptivity control. In Proceedings of the
2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO
’43, pages 485–496, Washington, DC, USA, 2010. IEEE Computer Society.
[32] Vincent W Freeh and David K Lowenthal. Using multiple energy gears in mpi programs
on a power-scalable cluster. In Proceedings of the tenth ACM SIGPLAN symposium on
Principles and practice of parallel programming, pages 164–173. ACM, 2005.
[33] Chris Gniady, Ali R. Butt, Y. Charlie Hu, and Yung-Hsiang Lu. Program counter-based
prediction techniques for dynamic power management. IEEE Trans. Comput., 55(6):641–
658, June 2006.
135
[34] Kinshuk Govil, Edwin Chan, and Hal Wasserman. Comparing algorithm for dynamic
speed-setting of a low-power cpu. In Proceedings of the 1st annual international conference
on Mobile computing and networking, pages 13–25. ACM, 1995.
[35] Lance Hammond, Benedict A Hubbert, Michael Siu, Manohar K Prabhu, Michael Chen,
and K Olukolun. The stanford hydra cmp. Micro, IEEE, 20(2):71–84, 2000.
[36] Said Hanafi and Arnaud Freville. An efficient tabu search approach for the 0–1 multidi-
mensional knapsack problem. European Journal of Operational Research, 106(2):659–675,
1998.
[37] Jun Lu Hao Shen and Qinru Qiu. Learning based dvfs for simultaneous energy temper-
ature and performance control. In ISQED’2012, july 2012.
[38] John L Henning. Spec cpu2000: Measuring cpu performance in the new millennium.
Computer, 33(7):28–35, 2000.
[39] John L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit.
News, 34(4):1–17, September 2006.
[40] Yoshihiko Hotta, Mitsuhisa Sato, Hideaki Kimura, Satoshi Matsuoka, Taisuke Boku, and
Daisuke Takahashi. Profile-based optimization of power performance by using dynamic
voltage scaling on a pc cluster. In Parallel and Distributed Processing Symposium, 2006.
IPDPS 2006. 20th International, pages 8–pp. IEEE, 2006.
[41] Christopher R Houck, Jeffery A Joines, and Michael G Kay. A genetic algorithm for
function optimization: a matlab implementation. NCSU-IE TR, 95(09), 1995.
[42] Chung-hsing Hsu and Wu-chun Feng. A power-aware run-time system for high-
performance computing. In Proceedings of the 2005 ACM/IEEE conference on Super-
computing, page 1. IEEE Computer Society, 2005.
[43] Chung-Hsing Hsu and Ulrich Kremer. Compiler-directed dynamic voltage scaling for
memory-bound applications. C.-H. Hsu and U. Kremer. Compiler-directed dynamic volt-
136
age scaling for memory-bound applications. Technical Report DCS-TR-498, Department
of Computer Science, Rutgers University, 2002.
[44] Michael C. Huang, Daniel Chaver, Luis Pinuel, Manuel Prieto, and Francisco Tirado.
Customizing the branch predictor to reduce complexity and energy consumption. IEEE
Micro, 23(5):12–25, September 2003.
[45] Christopher J Hughes, Jayanth Srinivasan, and Sarita V Adve. Saving energy with
architectural and frequency adaptations for multimedia applications. In Proceedings of
the 34th annual ACM/IEEE international symposium on Microarchitecture, pages 250–
261. IEEE Computer Society, 2001.
[46] IBM. Ibm watson, 2013.
[47] Intel. Dynamic data center power management: Trends, issues, and solutions.
[48] Intel. Enhanced intel speedstep technology for the intel pentium m processor, 2004.
[49] Intel. First the tick, now the tock: Next generation intel microarchitecture (nehalem),
2008.
[50] Tohru Ishihara and Hiroto Yasuura. Voltage scheduling problem for dynamically vari-
able voltage processors. In Low Power Electronics and Design, 1998. Proceedings. 1998
International Symposium on, pages 197–202. IEEE, 1998.
[51] A. Iyer and D. Marculescu. Power aware microarchitecture resource scaling. In Pro-
ceedings of the conference on Design, automation and test in Europe, DATE ’01, pages
190–196, Piscataway, NJ, USA, 2001. IEEE Press.
[52] jedec.org. Jedec thermal cycling test. "http://www.jedec.org/standards-documents/
results/jesd22-a104", 2009.
[53] Ravindra Jejurikar, Cristiano Pereira, and Rajesh Gupta. Leakage aware dynamic volt-
age scaling for real-time embedded systems. In Proceedings of the 41st annual Design
Automation Conference, pages 275–280. ACM, 2004.
137
[54] Alan W Johnson and Sheldon H Jacobson. A class of convergent generalized hill climbing
algorithms. Applied Mathematics and Computation, 125(2):359–373, 2002.
[55] A. B. Kahng, S. Kang, R. Kumar, and J. Sartori. Enhancing the efficiency of energy-
constrained dvfs designs. volume PP, pages 1–1, 2012.
[56] Ujval J Kapasi, William J Dally, Scott Rixner, John D Owens, and Brucek Khailany.
The imagine stream processor. In Computer Design: VLSI in Computers and Processors,
2002. Proceedings. 2002 IEEE International Conference on, pages 282–288. IEEE, 2002.
[57] Nandini Kappiah, Vincent W Freeh, and David K Lowenthal. Just in time dynamic volt-
age scaling: Exploiting inter-node slack to save energy in mpi programs. In Proceedings of
the 2005 ACM/IEEE conference on Supercomputing, page 33. IEEE Computer Society,
2005.
[58] James E Kelley, Jr. The cutting-plane method for solving convex programs. Journal of
the Society for Industrial & Applied Mathematics, 8(4):703–712, 1960.
[59] Chetana N Keltcher, Kevin J McGrath, Ardsher Ahmed, and Pat Conway. The amd
opteron processor for multiprocessor servers. Micro, IEEE, 23(2):66–76, 2003.
[60] Hideaki Kimura, Mitsuhisa Sato, Yoshihiko Hotta, Taisuke Boku, and Daisuke Taka-
hashi. Emprical study on reducing energy of parallel programs using slack reclamation
by dvfs in a power-scalable high performance cluster. In Cluster Computing, 2006 IEEE
International Conference on, pages 1–10. IEEE, 2006.
[61] Vasileios Kontorinis, Amirali Shayan, Dean M. Tullsen, and Rakesh Kumar. Reducing
peak power with a table-driven adaptive processor core. In Proceedings of the 42nd Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 189–200,
New York, NY, USA, 2009. ACM.
[62] Travis Lanier. Exploring the design of the cortex-a15 processor. Technical report, ARM,
Tech. Rep, 2011.
138
[63] Michael A Laurenzano, Mitesh Meswani, Laura Carrington, Allan Snavely, Mustafa M
Tikir, and Stephen Poole. Reducing energy usage with memory and computation-aware
dynamic frequency scaling. In Euro-Par 2011 Parallel Processing, pages 79–90. Springer,
2011.
[64] Etienne Le Sueur and Gernot Heiser. Dynamic voltage and frequency scaling: The laws
of diminishing returns. In Proceedings of the 2010 international conference on Power
aware computing and systems, pages 1–8. USENIX Association, 2010.
[65] Benjamin C. Lee and David Brooks. Efficiency trends and limits from comprehensive
microarchitectural adaptivity. SIGARCH Comput. Archit. News, 36(1):36–47, March
2008.
[66] Young Choon Lee and A.Y. Zomaya. Minimizing energy consumption for precedence-
constrained applications using dynamic voltage scaling. In Cluster Computing and the
Grid, 2009. CCGRID ’09. 9th IEEE/ACM International Symposium on, pages 92–99,
May 2009.
[67] Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V Adve, Vikram S
Adve, and Yuanyuan Zhou. Trace-based microarchitecture-level diagnosis of permanent
hardware faults. In Dependable Systems and Networks With FTCS and DCC, 2008. DSN
2008. IEEE International Conference on, pages 22–31. IEEE, 2008.
[68] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, and
Norman P Jouppi. Mcpat: an integrated power, area, and timing modeling framework
for multicore and manycore architectures. In Microarchitecture, 2009. MICRO-42. 42nd
Annual IEEE/ACM International Symposium on, pages 469–480. IEEE, 2009.
[69] Simplescalar LLC. Simplescalar benchmarks, 2004.
[70] Ravi Mahajan and Chia-pin Chiu. Cooling a microprocessor chip. Proceedings of the
IEEE, 94(8), 2006.
139
[71] Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J Dally, and Mark Horowitz.
Smart memories: A modular reconfigurable architecture, volume 28. ACM, 2000.
[72] Ali Manzak and C Chakrabarti. Variable voltage task scheduling for minimizing energy
or minimizing power. In Acoustics, Speech, and Signal Processing, 2000. ICASSP’00.
Proceedings. 2000 IEEE International Conference on, volume 6, pages 3239–3242. IEEE,
2000.
[73] Diana Marculescu. On the use of microarchitecture-driven dynamic voltage scaling. In
Workshop on Complexity-Effective Design, volume 42. Citeseer, 2000.
[74] Ke Meng, Russ Joseph, Robert P. Dick, and Li Shang. Multi-optimization power man-
agement for chip multiprocessors. In Proceedings of the 17th international conference on
Parallel architectures and compilation techniques, PACT ’08, pages 177–186, New York,
NY, USA, 2008. ACM.
[75] Daniel Mosse, Hakan Aydin, Bruce Childers, and Rami Melhem. Compiler-assisted dy-
namic power-aware scheduling for real-time applications. In In Workshop on Compilers
and Operating Systems for Low Power. Citeseer, 2000.
[76] Masakatsu Nakai, Satoshi Akui, Katsunori Seno, Tetsumasa Meguro, Takahiro Seki, Tet-
suo Kondo, Akihiko Hashiguchi, Hirokazu Kawahara, Kazuo Kumano, and Masayuki
Shimura. Dynamic voltage and frequency management for a low-power embedded micro-
processor. Solid-State Circuits, IEEE Journal of, 40(1):28–35, 2005.
[77] Kevin J Nowka, Gary D Carpenter, Eric W MacDonald, Hung C Ngo, Bishop C Brock,
Koji I Ishii, Tuyet Y Nguyen, and Jeffrey L Burns. A 32-bit powerpc system-on-a-chip
with support for dynamic voltage scaling and dynamic frequency scaling. Solid-State
Circuits, IEEE Journal of, 37(11):1441–1447, 2002.
[78] US Department of Defense. Aci broad agency announcement w911nf-12-r-0010, 2012.
[79] Anurag Patel and Kamlesh Prakash. Fault tolerant features of modern processors a case
study. Technical report, Technical report. University of Wisconsin-Madison, 2010.
140
[80] Padmanabhan Pillai and Kang G Shin. Real-time dynamic voltage scaling for low-power
embedded operating systems. In ACM SIGOPS Operating Systems Review, volume 35,
pages 89–102. ACM, 2001.
[81] Johan Pouwelse, Koen Langendoen, and Henk Sips. Dynamic voltage scaling on a low-
power microprocessor. In Proceedings of the 7th annual international conference on Mobile
computing and networking, pages 251–259. ACM, 2001.
[82] Moinuddin K Qureshi, David Thompson, and Yale N Patt. The v-way cache: demand-
based associativity via global replacement. In Computer Architecture, 2005. ISCA’05.
Proceedings. 32nd International Symposium on, pages 544–555. IEEE, 2005.
[83] Kevin Reick, Pia N Sanda, Scott Swaney, Jeffrey W Kellington, Michael J Mack,
Michael S Floyd, and Daniel Henderson. Fault-tolerant design of the ibm power6 mi-
croprocessor. Micro, IEEE, 28(2):30–38, 2008.
[84] Jeff Reilly. Spec discusses the history and reasoning behind spec 95. SPEC Newsletter,
7(3):1–3, 1995.
[85] Nikzad Babaii Rizvandi, Javid Taheri, and Albert Y Zomaya. Some observations on
optimal frequency selection in dvfs-based energy consumption minimization. Journal of
Parallel and Distributed Computing, 71(8):1154–1164, 2011.
[86] G Terry Ross and Richard M Soland. A branch and bound algorithm for the generalized
assignment problem. Mathematical programming, 8(1):91–103, 1975.
[87] Shahrzad Salemi, Liyu Yang, Jun Dai, Jin Qin, and Joseph B. Bernstein. Physics-of-
Failure Based Handbook of Microelectronic Systems. A K Peters, UTICA, NY, 2008.
[88] Roger Schmidt. Challenges in electronic coolingopportunities for enhanced thermal man-
agement techniquesmicroprocessor liquid cooled minichannel heat sink. Heat Transfer
Engineering, 25(3):3–12, 2004.
[89] Manish Shah, J Barren, Jeff Brooks, Robert Golla, Gregory Grohoski, Nils Gura, Rick
Hetherington, Paul Jordan, Mark Luttrell, Christopher Olson, et al. Ultrasparc t2:
141
A highly-treaded, power-efficient, sparc soc. In Solid-State Circuits Conference, 2007.
ASSCC’07. IEEE Asian, pages 22–25. IEEE, 2007.
[90] Dongkun Shin and Jihong Kim. A profile-based energy-efficient intra-task voltage schedul-
ing algorithm for real-time applications. In Proceedings of the 2001 international sympo-
sium on Low power electronics and design, pages 271–274. ACM, 2001.
[91] Tajana Simunic, Kresimir Mihic, and Giovanni Micheli. Optimization of reliability and
power consumption in systems on a chip. In Vassilis Paliouras, Johan Vounckx, and
Diederik Verkest, editors, Integrated Circuit and System Design. Power and Timing Mod-
eling, Optimization and Simulation, volume 3728 of Lecture Notes in Computer Science,
pages 237–246. Springer Berlin Heidelberg, 2005.
[92] J. Srinivasan. Lifetime reliability aware microprocessors. 2006.
[93] Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. Lifetime Reliability:
Toward an Architectural Solution. IEEE Micro, May/Jun 2005.
[94] Phillip Stanley-Marbell, Michael S Hsiao, and Ulrich Kremer. A hardware architecture
for dynamic performance and energy adaptation. In Power-Aware Computer Systems,
pages 33–52. Springer, 2003.
[95] Robert A Stubbs and Sanjay Mehrotra. A branch-and-cut method for 0-1 mixed convex
programming. Mathematical Programming, 86(3):515–532, 1999.
[96] V. Subramanian, P.K. Ramesh, and A.K. Somani. Managing the impact of on-chip
temperature on the lifetime reliability of reliably overclocked systems. In Dependability,
2009. DEPEND ’09. Second International Conference on, 2009.
[97] top500.org. Efficiency, power, cores, ..., 2014.
[98] Peter JM Van Laarhoven and Emile HL Aarts. Simulated annealing. Springer, 1987.
[99] V. Vasudevan and Xuejun Fan. An acceleration model for lead-free (sac) solder joint relia-
bility under thermal cycling. In 58th Electronic Components and Technology Conference,
may 2008.
142
[100] Alexander V. Veidenbaum, Weiyu Tang, Rajesh Gupta, Alexandru Nicolau, and Xiaomei
Ji. Adapting cache line size to application behavior. In Proceedings of the 13th inter-
national conference on Supercomputing, ICS ’99, pages 145–154, New York, NY, USA,
1999. ACM.
[101] Ram Viswanath, Vijay Wakharkar, Abhay Watwe, Vassou Lebonheur, et al. Thermal
performance challenges from silicon to systems. 2000.
[102] Mark Weiser, Brent Welch, Alan Demers, and Scott Shenker. Scheduling for reduced cpu
energy. In Mobile Computing, pages 449–471. Springer, 1996.
[103] Wikipedia. Integer programming.
[104] Wikipedia. Transistor count, 2014.
[105] Wikipedia. Transmeta crusoe, 2014.
[106] Yuan Xie and Wei-Lun Hung. Temperature-aware task allocation and scheduling for em-
bedded multiprocessor systems-on-chip (mpsoc) design. Journal of VLSI signal processing
systems for signal, image and video technology, 45(3):177–189, 2006.
[107] Yifan Zhu and Frank Mueller. Feedback edf scheduling exploiting dynamic voltage scaling.
In Real-Time and Embedded Technology and Applications Symposium, 2004. Proceedings.
RTAS 2004. 10th IEEE, pages 84–93. IEEE, 2004.
[108] Severin Zimmermann, Ingmar Meijer, Manish K Tiwari, Stephan Paredes, Bruno Michel,
and Dimos Poulikakos. Aquasar: A hot water cooled data center with direct energy reuse.
Energy, 43(1):237–245, 2012.
