Improving the Energy Efficiency of Microprocessor Cores Through Accurate Resource Utilisation Prediction by Court, Craig A.
Imperial College London
Department of Computing
Improving the Energy Efficiency of
Microprocessor Cores Through Accurate
Resource Utilisation Prediction
Craig A. Court
Submitted in part fulfilment of the requirements for the degree of
Doctor of Philosophy in Computing of Imperial College London and
the Diploma of Imperial College, June 2012

Abstract
CMOS technology scaling improves the speed and functionality of microprocessors by reducing
the size of transistors. Static power dissipation also increases as a result of scaling however, and
has been identified as a limiting factor in technology scaling. As current technology approaches
that limit, techniques are required both at the technology-level and in the architecture design
to reduce subthreshold leakage, which accounts for the majority of static power dissipation.
This thesis presents an approach to predict the idle periods of execution units at runtime and
power-gate them during these periods to eliminate their static power leakage. We exploit sim-
ilar execution characteristics across loop iterations to build a prediction of the units required
to execute an entire loop from the units used over the first few iterations. The utilisation of
each execution unit is monitored for each iteration, and thresholds are used to determine which
units should be power-gated for the remainder of the loop. Three techniques are presented:
Loop-Directed Mothballing (LDM), Extended Loop-Directed Mothballing (ELDM) and sched-
ule balancing. LDM power-gates execution units only during innermost loops, which are simple
to detect at runtime. ELDM extends this method to all loops using loop entry and exit informa-
tion gathered oﬄine. The balancing scheduler is developed to balance the types of instruction
issued each cycle, to encourage reuse of execution units and make unnecessary units easier to
detect.
Extensive simulation using traces of 16 benchmarks from the SPEC CPU2006 suite demon-
strates that LDM reduces the energy-delay product of our simulated superscalar processor by
10.3%. For traces with a low proportion of executed instructions inside innermost loops, ELDM
improves the energy-delay product by up to 13% by allowing the technique to be applied to
other loops in the trace. Employing schedule balancing with ELDM achieves similar savings,
and simplifies the hardware required to make predictions.
i
ii
Acknowledgements
I would like to thank Professor Paul Kelly, for taking me on as a student quite late into the
PhD studies. His advice and guidance has been invaluable to me.
I would also like to thank my family for their encouragement and support throughout my
education, I most definitely would not be where I am today without them.
I would like to thank the friends I have made at Clayponds Village, the postgraduate hall where
I have lived during my studies, for many thoroughly enjoyable years. I would especially like to
thank Kostas Glaros, who I have spent many hours with discussing ideas and problems. I will
particularly miss our morning cup of tea in the Junior Common Room.
iii
iv
Dedication
This work is dedicated to the creators of LEO, the Lyons Electronic Office, who were pioneers
of business computing. Their story inspired me to go into research.
v
‘Never trust a man, who when left alone with a tea cosy... Doesn’t try it on.’
Billy Connolly
vi
Contents
Abstract i
Acknowledgements iii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Statement of originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background theory 6
2.1 Power dissipation in CMOS logic . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Dynamic power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Static power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Technology scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Reducing CMOS power consumption . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Reducing dynamic power . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
vii
viii CONTENTS
2.2.2 Reducing static power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Power management in existing designs . . . . . . . . . . . . . . . . . . . 14
2.3 Energy-delay product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Energy savings and costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Related Work 19
3.1 Coarse-grain power management . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Fine-grain power management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Power-managing the superscalar pipeline . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Power managing the issue queue and register file . . . . . . . . . . . . . . 25
3.3.2 Power-managing caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.3 Power-managing execution units . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.4 Compiler and profiling vs. hardware . . . . . . . . . . . . . . . . . . . . 36
3.4 SMT and efficient processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Problem Analysis 41
4.1 Fundamental power-gating goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Analysis of application domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Analysis of execution features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Frequently executed basic blocks . . . . . . . . . . . . . . . . . . . . . . 44
4.3.2 Basic block sequences and branch misprediction . . . . . . . . . . . . . . 45
4.3.3 Out-of-order execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
CONTENTS ix
4.4 The Loop-Directed Mothballing approach . . . . . . . . . . . . . . . . . . . . . . 47
4.4.1 Prediction refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.2 Changes in control flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.3 Conditions for application . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Loop-Directed Mothballing 53
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 LDM overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 False loop exits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Finding the set of required execution units . . . . . . . . . . . . . . . . . . . . . 59
5.3.1 Exhaustive approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.2 Threshold approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Hardware components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Example loop from perlbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.1 Description of the loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.2 Utilisation and predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Extended Loop-Directed Mothballing 70
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 ELDM Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 Recording resource utilisation in loops . . . . . . . . . . . . . . . . . . . 75
6.3 Oﬄine trace analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
x CONTENTS
6.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7 Schedule balancing 83
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Balancing scheduler overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3 Processing the issue queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8 Simulation and power estimation 90
8.1 Simulator accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2 Simulation and power estimation toolchain . . . . . . . . . . . . . . . . . . . . . 93
8.3 Simulation of LDM, ELDM and schedule balancing . . . . . . . . . . . . . . . . 94
8.4 Power estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.5 Hardware overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.6 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9 Results and discussion 104
9.1 Optimal threshold selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2 Comparison of LDM, ELDM and schedule balancing . . . . . . . . . . . . . . . 107
9.2.1 Benchmark runtime statistics . . . . . . . . . . . . . . . . . . . . . . . . 110
9.3 Representative 10 million instruction traces . . . . . . . . . . . . . . . . . . . . . 115
9.3.1 Applying schedule balancing oﬄine . . . . . . . . . . . . . . . . . . . . . 116
9.3.2 Oracle results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.3.3 Operating frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.4 Discussion of related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10 Conclusion 128
10.1 Summary of thesis achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
10.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.4.1 Increasing power savings . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
10.4.2 Improving performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
A SimpleScalar/Alpha command line arguments 135
B McPAT preprepared XML file 137
Bibliography 146
xi
xii
List of Tables
2.1 Low power C-states for Intel’s 2nd Generation Core family of processors. . . . . 15
3.1 Area of selected architecture components. . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Summary of related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Example Loop-Directed Mothballing resource prediction cache. . . . . . . . . . . 65
5.2 Assembly code of an example loop from the perlbench benchmark. . . . . . . . . 66
5.3 Iteration statistics from execution of the perlbench benchmark. . . . . . . . . . . 67
7.1 Execution unit costs for the scheduler modification. . . . . . . . . . . . . . . . . 86
8.1 DEC Alpha 21264 specification used for simulation and power estimation. . . . . 95
8.2 Proportions of each benchmark that are similar to the chosen sample. . . . . . . 101
9.1 Loop statistics for the 1 million instruction traces of the 16 benchmarks. . . . . 111
9.2 Execution statistics for the first 8 benchmark traces while using ELDM. . . . . . 113
9.3 Execution statistics for the second 8 benchmark traces while using ELDM. . . . 114
9.4 LDM results when executing representative 10 million instruction traces of the
benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.5 ELDM results when executing representative 10 million instruction traces of the
benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xiii
9.6 Balancing scheduler results when executing representative 10 million instruction
traces of the benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.1 SimpleScalar/Alpha command line arguments. . . . . . . . . . . . . . . . . . . . 136
xiv
List of Figures
2.1 CMOS inverter showing leakage current. . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Estimated breakdown of power dissipated by a processor based on the DEC
Alpha 21264. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Diagram of a superscalar architecture based on the DEC Alpha 21264. . . . . . 23
3.2 Component breakdown of power consumption in the superscalar architecture. . . 24
3.3 Direct-mapped cache structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Utilisation of different execution units during a trace of sjeng. . . . . . . . . . . 31
3.5 Utilisation of different execution units during a trace of cactusADM. . . . . . . . 32
5.1 Three schedules showing possible sets of instructions issued on each cycle of a
loop and the set of units required to execute the loop. . . . . . . . . . . . . . . . 54
5.2 Coverage of executed instructions that are from innermost loops. . . . . . . . . . 55
5.3 Example control flow graph, showing conditional branches inside an innermost
loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 The superscalar architecture (based on the DEC Alpha 21264) with additional
components required for LDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1 Proportion of executed instructions where power-gating can be applied when
using LDM and ELDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
xv
xvi LIST OF FIGURES
6.2 An example of merging basic blocks to hide irrelevant control flow information. . 77
6.3 Control flow examples from the benchmark traces. . . . . . . . . . . . . . . . . . 79
6.4 The superscalar architecture (based on the DEC Alpha 21264) with additional
components required for ELDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1 Example issue queue containing 8 issuable instructions. . . . . . . . . . . . . . . 84
7.2 Example issue queue showing issue costs associated with each instruction. . . . . 87
7.3 Effect of modified scheduler on execution unit utilisation during a trace of mcf. . 87
8.1 Simulation and power estimation toolchain. . . . . . . . . . . . . . . . . . . . . . 93
9.1 EDP savings when using LDM with different combinations of thresholds. . . . . 106
9.2 EDP savings when using ELDM with different combinations of thresholds. . . . 106
9.3 EDP savings when using ELDM and schedule balancing with different combina-
tions of thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.4 Normalised execution time for the 16 benchmark traces. . . . . . . . . . . . . . . 109
9.5 Normalised power consumption for the 16 benchmark traces. . . . . . . . . . . . 109
9.6 Normalised EDP for the 16 benchmark traces. . . . . . . . . . . . . . . . . . . . 109
9.7 Scatter plot of EDP and average number of iterations per loop visit for the 16
benchmark traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.8 Scatter plot of EDP and average number of cycles per loop visit for the 16
benchmark traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.9 Power-gating statistics for the perlbench trace when using ELDM and an oracle
implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.10 Power-gating statistics for the sjeng trace when using ELDM and an oracle
implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.11 Power-gating statistics for the gromacs trace when using ELDM and an oracle
implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.12 Power-gating statistics for the lbm trace when using ELDM and an oracle im-
plementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.13 EDP of the processor using ELDM and the oracle implementation. . . . . . . . . 120
9.14 Power consumption at different operating frequencies. . . . . . . . . . . . . . . . 121
9.15 Normalised power savings at different operating frequencies when using ELDM. 121
9.16 Normalised execution time for the 16 benchmark traces when using Shift and
IPC methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.17 Normalised power consumption for the 16 benchmark traces when using Shift
and IPC methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.18 Normalised EDP for the 16 benchmark traces when using Shift and IPC methods.123
xvii
xviii
List of Algorithms
1 EveryCycle, structures and signals (LDM) . . . . . . . . . . . . . . . . . . . . . 56
2 ApplyPrediction (LDM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 EveryCycle, structures and signals (ELDM) . . . . . . . . . . . . . . . . . . . . 74
4 UpdatePrediction (ELDM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
xix
xx
Chapter 1
Introduction
This thesis addresses the problem of increasing static power dissipation in microprocessors as
the underlying CMOS technology is scaled to smaller feature sizes. Execution units are among
the most power hungry devices on the chip and can be power-gated during idle periods to
eliminate the subthreshold leakage current through the unit, which is the major component
of static power. Our approach exploits similar execution profiles across loop iterations to
accurately predict these idle periods. Power-gating execution units using these predictions
permits consistent power savings across 16 benchmarks from the SPEC CPU2006 suite, while
causing very low performance loss.
1.1 Motivation
The power dissipated by CMOS integrated circuits has attracted much attention in recent
years and could limit further advancement of the technology [TPB98]. CMOS technology
scaling attempts to keep up with market demands for faster and more feature rich processors,
by providing higher operating frequencies and more available transistors to computer architects.
However, this leads to increased energy requirements as more computation is performed per
second and the static power of the processor increases with each technology generation. The
higher energy requirements have a detrimental impact for many processing environments:
1
2 Chapter 1. Introduction
• Mobile devices are expected to perform more and more computationally expensive tasks,
such as media decoding, but the increase in power consumption lowers the battery life.
• High performance machines contain many cores per chip and the increasing density of
the transistors makes cooling more difficult as there is less surface area through which to
transmit the heat generated.
• High volumes of servers in company data centres can consume up to 200 megawatts
(enough to power 200,000 homes) [eco08]. With server requirements continually increas-
ing, this can rapidly become a large financial burden.
Emerging technologies such as the FinFET [HLK+00] leak less static power (which is now
the largest contributor to processor power consumption) than standard CMOS, and allow the
technology feature size to be scaled as low as 17nm, but continued scaling and higher operating
frequencies will still drive this static power consumption higher. Therefore, to control the power
consumption of future processors, it is necessary to mitigate static power consumption through
architectural techniques.
Power-gating is such a technique, that allows processor resources to be disconnected from the
power when they are not required. It has been successfully applied to various processor devices,
including Texas Instruments’ ARM-based processors and Intel’s 2nd Generation Core proces-
sors [Tex11, Int11]. In most implementations, power-gating is applied at a coarse granularity
to entire cores, although there is a large amount of research into power-gating at a smaller
granularity to save power in active cores by switching off individual units inside the core that
are not required (see Chapter 2).
Deciding when to power-gate resources is the key problem for any power-gating techniques as
powering resources back on has a cost in terms of both time and energy, which could cancel out
some energy savings. The two main strategies are to use the compiler or hardware to guide the
decision. Compile time decisions have the benefit of foresight, as access to the application code
allows the amount of time that a resource should be disabled to be estimated. However, the
runtime information available to hardware techniques provides a more accurate picture of the
1.2. Contributions 3
resource requirements of an application and so hardware techniques often outperform compiler
ones.
A disadvantage of hardware techniques is the limited view of what instructions are to come,
which means that power-gating decisions are often made based on the instructions that have just
been executed, with the assumption that the same resources will be required to execute future
instructions. For many cases this assumption may hold, but when it does not, insufficient or
inappropriate resources could have a large impact on application performance. More accurate
prediction of resource requirements would reduce the amount of performance degradation due
to inappropriate power-gating decisions and could allow greater energy savings.
1.2 Contributions
Our approach increases the accuracy of resource requirement predictions by exploiting the fact
that different iterations of the same loop will have similar resource requirements. By monitoring
resource usage over a few iterations, a power-gating decision will be made which is likely to be
appropriate for the remaining iterations of the loop. Loop-Directed Mothballing (LDM) uses
this to power-gate the execution units of a processor during innermost loops, which are simple
to detect at runtime. In an attempt to increase savings, loop information gathered oﬄine is
passed to the processor so that the method can be applied to all loops in Extended Loop-
Directed Mothballing (ELDM). Finally a scheduler modification is provided to reveal more
power-gating opportunities for ELDM by discouraging the use of some execution units. The
main contributions of this research are as follows:
• Analysis of the problems surrounding execution unit power-gating, and the rationale
behind our loop based approach to solve these problems (Chapter 4).
• A method for LDM, including the tasks that must be carried out and hardware required
to implement the technique (Chapter 5).
4 Chapter 1. Introduction
• A method for ELDM, describing the shortcomings of LDM and a means to produce oﬄine
loop information so that execution units can be power-gated during all loops (Chapter
6).
• A method for schedule balancing, which encourages execution of different types of instruc-
tion where possible, to reduce the utilisation of some execution units so that power-gating
candidates are easier to identify (Chapter 7).
• Simulation of 16 benchmark traces from the SPEC CPU2006 suite that demonstrates
power savings of up to 20% for LDM and ELDM with low performance loss. The simula-
tion results also show how ELDM can achieve up to 13% more power savings than LDM
by applying the technique to non-innermost loops (Chapter 9).
• An exploration of the two key thresholds for LDM and ELDM that demonstrates the effec-
tiveness of schedule balancing, which as a result permits less complex resource monitoring
logic (Chapter 9).
• A comparison of ELDM to two existing interval-based hardware techniques, demonstrat-
ing the importance of prioritising low performance loss when attempting to save energy
(Chapter 9).
• Analysis of the potential power savings that could be achieved by power-gating execution
units at the optimal points without performance loss. These are then compared to the
savings achieved with our approach to show that we achieve up to 77% of this optimal
value with ELDM (Chapter 9).
1.3 Thesis structure
The next chapter describes CMOS and the factors relating to power consumption. The energy-
delay product is discussed as a figure of merit and then equations are given that describe
the architecture-level factors that affect energy savings and costs. This chapter is included to
provide some background knowledge for the reader.
1.4. Statement of originality 5
Chapter 3 describes the related work. Coarse-grain approaches to save power at the device-level
are more mature and are discussed first. These methods are compared to fine-grain techniques
and then existing research to power-gate different components inside the processor are discussed.
Chapter 4 performs an analysis of the difficulties of power-gating execution units, and the
features of instruction stream execution that may assist a hardware technique. A list of four
goals and one requirement for an effective power-gating technique is produced, and our approach
is presented showing how we address these goals and requirements.
Chapters 5, 6 and 7 describe the implementation of LDM, ELDM and schedule balancing,
including the hardware that would need to be added to the processor architecture.
Our simulation methodology, including the choice of simulation and power estimation tools, is
described in Chapter 8. This is used in our evaluation of LDM, ELDM and schedule balancing
in Chapter 9. We draw conclusions and discuss our assumptions in Chapter 10.
1.4 Statement of originality
The work contained within this thesis is my own, and the work of any others that has been
used to support this thesis is appropriately referenced.
1.5 Publications
C.A. Court and P.H.J. Kelly. Loop-directed mothballing: Power-gating execution units using
fast analysis of inner loops. In Cool Chips XIV, pages 13, IEEE, 2011.
C.A. Court and P.H.J. Kelly. Loop-directed mothballing: Power-gating execution units using
runtime analysis of loops. In Micro, PP, IEEE, 2011.
Chapter 2
Background theory
CMOS logic is the fundamental building block from which nearly all microprocessors are built.
The power consumption of CMOS logic, and therefore microprocessors, depends on many low-
level technology parameters. This chapter provides an overview of the factors contributing to
power consumption and how the balance between static power and dynamic power dissipation is
changing with technology scaling. The key parameters that can be varied to reduce power, and
the consequences of varying them are also discussed. The key source for most of this section
is [WH05] (pages 186–196), which the interested reader is referred to for a comprehensive
explanation of the issues.
Measuring the effectiveness of a power saving approach can be difficult, as secondary effects
such as performance degradation may occur. We discuss the use of the energy-delay product
(EDP) in comparing power-gating techniques, and also show the factors that contribute to
the overall energy savings and overheads. The McPAT tool will used to estimate the power
consumption of the processor in our evaluation, and will be discussed in Chapter 8.
2.1 Power dissipation in CMOS logic
A competitive advantage of CMOS over other transistor technologies is low power consumption.
Transistors are arranged in a complementary manner (CMOS stands for Complementary Metal
6
2.1. Power dissipation in CMOS logic 7
Figure 2.1: CMOS inverter showing leakage current through an off transistor. Diagram repro-
duced from [BS02]
Oxide Semiconductor) as shown in the example inverter in Figure 2.1, such that when one
transistor is on the other will be off. The output of the inverter will achieve the appropriate
voltage, but no current flows directly from source to ground. A small current may leak through
the off transistor, but in technology generations with feature sizes greater than around 100nm,
this leakage is negligible.
2.1.1 Dynamic power
The main component of power dissipation is therefore the dynamic power or power dissipated
while switching the states of the transistors. This power is required to charge the load capaci-
tance (shown on the right of Figure 2.1) and can be calculated using the following equation:
Pdynamic = αCV
2f (2.1)
C is the load capacitance, V is the supply voltage, f is the clock frequency and α is the activity
factor, which relates to the proportion of cycles that the transistors will switch state. The clock
8 Chapter 2. Background theory
has an activity factor of α = 1 as it switches every cycle, but normal logic has an activity factor
around α = 0.1 due to some inputs remaining constant [WH05] (page 191).
Power is also consumed during a brief short circuit that is created when both transistors are
switching and both are temporarily on. The power dissipated during this short circuit is around
10% of the power needed to charge the load capacitance [Vee84].
2.1.2 Static power
CMOS circuits are low power, as their complementary nature prevents a direct connection be-
tween the source to the ground. However, current flowing through a transistor is exponentially
related to gate voltage and cannot be cut off completely even if the gate voltage is reduced to
zero (Equation 2.3). This is called a leakage current (see Figure 2.1) or subthreshold leakage:
Pstatic = IstaticV (2.2)
Istatic = I0e
(VG−VT )/nvS (2.3)
Istatic is the leakage current that flows an off transistor and V is the supply voltage. I0 is the
current that flows through an on transistor, VG is the voltage at the transistor gate, VT is the
threshold voltage, n is a process-dependant term typically between 1.4 and 1.5, v is the thermal
voltage, and S is a significance term (see [WH05], page 88 for more details). For technology
generations with feature sizes at or above 0.35µm, the static power leakage has been a negligible
contribution to overall CMOS power.
Other forms of leakage also exist, the most significant of which is gate-leakage. This is a current
leaking through the gate dielectric creating a connection either from the source to the gate, or
from the gate to ground. This leakage is currently much smaller than subthreshold leakage (ac-
counting for around 3% of total leakage in our simulations) and new high-k dielectric materials
can be used to greatly reduce this form of leakage [KAB+03]. As subthreshold leakage is dom-
2.1. Power dissipation in CMOS logic 9
inant and predicted to remain so, we will use static leakage or simply leakage interchangeably
with subthreshold leakage from now on.
2.1.3 Technology scaling
Technology scaling improves the density and performance of CMOS by decreasing the size of
transistors. As a result, more transistors can be placed onto a single die, which can be used for
added functionality or increased parallelism, and improved performance allows the technology
to keep pace with market demands.
As transistors are scaled below 0.1µm, the supply voltage must also be scaled down to ensure
correct operation and prevent damage to the transistors [BS02, Dav96]. This has a positive
impact on the dynamic power dissipation, as the dynamic power is quadratically dependent
on voltage (Equation (2.1)). Load capacitance also decreases as a consequence of technology
scaling as it is dependant on transistor size.
The decrease in supply voltage must be matched by a reduced threshold voltage, which is
the voltage where the transistor ’switches’ and above which the transistor is considered to be
on. The subthreshold current scales exponentially higher as the threshold voltage decreases
(Equation 2.3), which causes the static power to increase despite the lower voltage [BS02]. The
threshold voltage also determines the performance of the circuits as well, so a compromise must
be met when setting the threshold voltage to balance power against performance.
In [TPB98], it is predicted that sustaining the performance trends seen in 0.25µm technology
for the next generation would result in a 100x increase in leakage current. Other trends are
shown that demonstrate static power is an increasing contributor to overall CMOS power in
microprocessors as technology is scaled down to 0.18µm. Of the six fundamental scaling limits
described in [TPB98], half are limited by subthreshold leakage, and one is limited by gate
leakage. Kim et al. [KAB+03] plot dynamic and static power projections from the 2002 ITRS
roadmap and indicate that subthreshold leakage power will exceed dynamic power before 2005
and will be around 10 times the dynamic power by 2020 (measured from graph).
10 Chapter 2. Background theory
Using the power estimation framework described later in Chapter 8, the power consumption of
a superscalar processor (including caches) based on the DEC Alpha 21264 architecture [Com02]
was estimated for 90nm, 65nm and 45nm technology generations (Figure 2.2). The McPAT tool
is used to calculate reasonable designs for the architecture and estimate power consumption
based on technology data in the 2007 ITRS roadmap. The figure demonstrates that the ratio
of static power to dynamic power may already be approaching the expected values predicted
in [KAB+03].
90nm 65nm 45nm
P
ow
er
 (W
)
0
5
10
15
20
25
static
dynamic
Figure 2.2: Estimated breakdown of power dissipated by a processor based on the DEC Alpha
21264 for different technology sizes
To collect the data for the figure the design parameters have been kept the same, including the
clock frequency which is set at 1200MHz. In reality the additional logic afforded by technology
scaling may be used for added functionality and the smaller technology could support higher
clock frequencies. Each would result in increases of both dynamic and static power. The choice
of clock frequency is discussed later in Chapter 8.
2.2. Reducing CMOS power consumption 11
2.2 Reducing CMOS power consumption
2.2.1 Reducing dynamic power
From Equation (2.1) we can see that dynamic power is quadratically dependent on supply
voltage, so reducing the voltage should yield large dynamic power savings. Dynamic voltage
scaling (DVS) [BB95] is a popular way to decrease dynamic power consumption by decreasing
the supply voltage at runtime.
Maximum clock frequency is proportional to (V − VT )n, where n is technology dependent and
is between 1 and 2, and VT is the threshold voltage [TPB98]. Therefore, when the voltage
is scaled down, f must be decreased accordingly, which further reduces the dynamic power,
but this represents a loss in performance (DVS is normally referred to as dynamic voltage and
frequency scaling, or DVFS).
In [WH05], it is shown that a 2/3 frequency reduction combined with voltage scaling decreases
dynamic power to 30%. In situations where lower performance is acceptable, for instance
during idle periods or in a real-time application, this can be an effective way to reduce power
consumption. In computation bound applications however, the energy savings will be much
less (30%× 3/2 = 45%), as the processor will dissipate dynamic power for longer to complete
computation.
Clock-gating can completely eliminate the dynamic power consumption of idle devices by stop-
ping the input clock for that device. The clock network consumes a large amount of dynamic
power as the activity factor is α = 1, and other transistors connected to the clock may also
switch even though the device performs no useful computation. By stopping the clock for an
idle device this dynamic power is eliminated as no transistors will be able to switch states.
12 Chapter 2. Background theory
2.2.2 Reducing static power
The threshold voltage has an exponential relationship with leakage current such that increas-
ing the threshold decreases subthreshold leakage (Equation 2.3). However, transistor delay is
inversely proportional to (V − VT )n, so globally increasing the threshold voltage decreases the
achievable clock frequency, and therefore performance. The choice of threshold voltage will de-
pend on the requirements of the system, and is a compromise between performance and power
consumption.
Circuits that are not on critical paths may be able to tolerate longer delays without impacting
overall performance, so the threshold voltage can be higher for these circuits to reduce their
power consumption [WH05, TPB98, BS02]. Alternatively, it is possible to decrease the supply
voltage for non-critical circuits with similar effects, but level shifter circuits are required to
communicate between the different supply voltage domains [BS02], so this approach would be
limited to coarse grain application.
When circuits are idle, they dissipate static power even though no computation is being per-
formed. To prevent leakage in idle circuits, they can be isolated from the supply voltage or
ground. This is called power-gating and is often applied by connecting circuits to a virtual
ground, that is either connected to the ground during normal operation or to the power supply
to prevent leakage. A slow transistor with a high threshold is used to connect the virtual ground
so that it is not a significant source of leakage itself.
To switch on a circuit the slow power-gating transistor must switch and the capacitances in the
circuit must all be recharged (or discharged) to return it to an operational state. This takes
time, which depends on the maximum current that can be supplied to charge the circuit. Power
will also be consumed each time the circuit is power gated, both to switch the power-gating
transistor and to charge the circuit.
Connecting the virtual ground to intermediate voltages between the supply voltage and ground
allows a tradeoff to be made between power savings and switch on costs [ADSN06]. Connecting
the virtual ground to a lower voltage will reduce the costs in terms of both delay and power,
2.2. Reducing CMOS power consumption 13
but lower power savings will be achieved. If overall performance is particularly sensitive to the
switch on time of a circuit, this may be a more appropriate option than powering off the circuit
completely.
A key disadvantage of power-gating idle circuits is that the state of the transistors is lost.
Power-gating can be safely applied to stateless circuits, such as execution units, as these simply
compute an output based on inputs. Storage structures such as caches or register files would
lose their information when they are power-gated however, so data must be copied to alternative
storage (such as a higher level of cache or main memory) before the circuits can be switched
off. Copying large amounts of data each time a circuit is power-gated will incur an undesirable
loss of performance and will also consume dynamic power.
Kim et al. show that switching the supply voltage from 1V to 200-300mV (using a virtual
power supply rather than a virtual ground) allows the information in a cache cell to be pre-
served while reducing the subthreshold leakage by 90% [KFBM04]. An alternative approach
[ALR02] redesigns the cache cell so that it retains the data when the supply is completely
removed. Although the data degrades somewhat, enough information remains for the data to
be recovered when power is restored to the cell. With state preservation, power-gating becomes
an attractive leakage reduction technique for idle circuits, as subthreshold leakage is almost
completely eliminated. In addition, a natural side effect of power-gating is dynamic power
reduction, as the transistors cannot switch while the circuit is off.
A different technique for reducing subthreshold leakage in idle circuits is to apply a body
bias to the transistor bodies to either increase or decrease the threshold voltage of transistors
dynamically. As stated earlier, increasing the threshold voltage will slow down the transistors
and reduce the subthreshold leakage. Applying the body bias takes longer than power-gating
however [BS02], and requires additional power supply rails for the transistors [WH05].
14 Chapter 2. Background theory
2.2.3 Power management in existing designs
Texas Instruments’ Sitara ARM microprocessor using a Cortex-A8 core [Tex11] is partitioned
into four power domains: the Graphics Domain, Active Domain, Default Domain and Always-
On Domain. All domains except the Always-On Domain, can be power-gated off when in
standby mode to save static and dynamic power. The Always-On Domain contains all modules
that are required even in standby mode, including the host ARM core and modules that generate
wake-up interrupts. The Default Domain consists of modules that may be required in standby,
which can be powered on during standby if needed. The Active Domain and Graphics Domain
contain modules that will only be used when the processor is not in standby mode.
In addition to power-gating, the Texas Instruments microprocessor employs SmartReflex mod-
ules to perform DVFS (which they call adaptive voltage scaling) based on the desired perfor-
mance and chip temperature.
Intel performs DVFS (called SpeedStep) in their 2nd Generation Intel Core family of processors
(including the i7, i5 and i3) [Int11] by defining a set of P-states for the processor, each of which
represents a different frequency and voltage combination. The appropriate frequency for each
core is requested by software, and the processor is set to the highest frequency P-state requested
by adjusting the voltage and clock.
They implement low power idle states (called C-states) for threads, cores and the processor as
a whole according to the Advanced Configuration and Power Interface (ACPI) specification in
[Hew10]. Higher C-states provide lower power consumption, but also incur a longer delay when
returning to active mode. The cores are set to the lowest C-state of their executing threads, and
the processor is set to the lowest C-state of the cores. The implemented C-states are detailed
in Table 2.1.
Power-gating, clock-gating and DVFS are applied to idle devices, but devices can be defined at
different granularities. In the Texas Instruments and Intel examples a coarse grained approach
is taken and low power techniques are applied at the core or module level. However, even if a
core is active, there may be independent units within the core that are idle while other units
2.3. Energy-delay product 15
C-state Description
C0 Active state, including DVFS P-states
C1/C1E Low power state
C3 Level 1 and Level 2 local caches flushed to Level 3 shared cache.
Architectural state is maintained and the core is clock-gated
C6 Architectural state is saved to dedicated SRAM and the core is power-gated
Table 2.1: Low power C-states for Intel’s 2nd Generation Core family of processors.
are performing computation. Techniques to power-gate and clock-gate at a finer granularity
(for example individual cache lines and execution units) have been studied in various research
projects, but they have yet to be broadly adopted by manufacturers. These techniques will be
discussed in Chapter 3.
2.3 Energy-delay product
Power, power-delay product and energy-delay product can all be used to measure the effective-
ness of a power saving technique [WH05]. On their own, power and power-delay product do
not take performance into account (power-delay product essentially calculates the total energy
consumed for the computation). In the most extreme case, both may be reduced to zero by
power-gating the entire processor, although performance would drop to zero also.
All processors must maintain an acceptable level of performance and most research that uses
power or power-delay product as a measure of fitness will also provide performance measure-
ments. To directly compare different techniques however, both power and performance must
be combined into a single metric, and the energy-delay product is becoming a de facto standard
for this purpose.
Energy-delay product explicitly includes both the energy for computation and performance.
For delay we use the time to complete execution of a trace, which is the inverse of perfor-
mance. Another metric, energy-delay2 product places more significance on the performance
of the processor. It is a matter of choice which is more appropriate when comparing power
saving techniques, and energy-delay product was chosen for our evaluation due to the increasing
significance of power consumption with continued technology scaling.
16 Chapter 2. Background theory
The energy-delay product (EDP) is a figure of merit that is normally only used to compare
processors or power saving techniques. It should be noted however, that two processors with
the same EDP may not have the same energy or performance characteristics, as one may be
fast and power hungry, and the other slow with low power requirements. Therefore, EDP will
be used alongside power and performance results in this thesis when performing comparisons.
2.4 Energy savings and costs
The following formulae describe the energy savings and costs when power-gating execution
units:
Energysaved =
n∑
unit=0
(Energyper cycle(unit)× (Cycles Onorig(unit)− Cycles OnPG(unit)))
Energycost =
n∑
unit=0
(Number of Switch Ons(unit)× Energyswitch on(unit))
+ Energyper cycle(rest of core)
× (Cycles OnPG(rest of core)− Cycles Onorig(rest of core))
The orig and PG subscripts denote original execution without power-gating and execution with
power-gating respectively.
Energy is saved when units are power-gated off, so the savings are equal to the power consump-
tion of each unit multiplied by the amount of time the unit is off. The energy cost has direct
and indirect components. The direct cost of power-gating is the static and dynamic power
dissipated while the unit is being powered on and is temporarily unavailable for instruction
issue. The indirect cost is from the remainder of the processor, which will dissipate additional
static and dynamic power during any extra cycles due to loss of performance.
As the energy consumed by the rest of the processor is larger than the energy consumed by an
execution unit, savings from many cycles of an execution unit being powered-off could easily
2.5. Summary 17
be negated by only a few extra cycles powering the rest of the processor due to increased
execution time. Therefore, care must be taken to ensure the savings are larger than the costs
by minimising performance loss. Frequently power-gating units could also result in high costs,
so power-gating for long intervals should be encouraged.
2.5 Summary
This chapter describes the two forms of power dissipation in CMOS logic, static and dynamic
power, and describes how the technology parameters can be varied to reduce these two compo-
nents. Section 2.1.1 describes the dynamic power, which is dissipated when transistors change
state. Static power is described in Section 2.1.2, and is dissipated regardless of transistor switch-
ing. Section 2.2 shows how the dynamic and static power can be reduced and the consequences
of varying the technology-level parameters.
Technology scaling increases the density and performance of transistors, but the reduction in
the threshold voltage results in an exponential increase in the subthreshold leakage and static
power component. It is predicted that static power will dominate overall power consumption,
and it may already do so. Subthreshold leakage can be controlled for active, non critical-path
circuits by adjusting the threshold leakage to match the required delay for the circuit. For idle
circuits, the leakage can be reduced by power-gating the circuit or by dynamically increasing
the threshold voltage using body biasing. Power-gating is selected in our approach to reduce
the power consumption of execution units for the following reasons:
• Dynamically setting the threshold to reduce subthreshold leakage during idle periods
using body biasing is slow.
• Power-gating allows subthreshold leakage to be almost completely eliminated for idle
circuits and is relatively fast.
Given the secondary effects of power reduction, such as performance loss, it is crucial to use an
appropriate measure of fitness for a power-saving approach. Section 2.3 compares the different
18 Chapter 2. Background theory
measures and explains our choice of energy-delay product (EDP); Power and energy can be
reduced at the cost of reduced performance, but this is often undesirable for a user. As EDP
takes both energy and performance into account, it encompasses both requirements into a single
metric. EDP it not a directly measurable quantity however, so performance and power data
will be provided to link the results to these directly measurable quantities.
Finally, in Section 2.4 equations are defined to describe the factors that contribute to energy
savings and overheads when power-gating execution units. Costs are associated with both pow-
ering up the units and powering the remainder of the processor for additional cycles due to
performance lost. Savings are proportional to the amount of time a unit remains off. Minimis-
ing performance loss and avoiding frequent power mode transitions are shown to be two key
priorities.
Chapter 3
Related Work
Power management techniques can be applied to devices to reduce their power consumption
when they experience periods of idleness. This chapter describes some of these techniques and
how they are applied to system devices at different granularities. It was shown in the previous
chapter that dynamic voltage and frequency scaling (DVFS) can be used to reduce dynamic
power consumption, but since static power dissipation is now a greater contributor to overall
power, we will focus of power-gating as a power management mechanism. We will also compare
power management to simultaneous multi-threading (SMT), which attempts to make use of
idle periods to perform computation from other program threads.
3.1 Coarse-grain power management
Power-gating allows idle devices to be switched off by isolating them from the power supply.
A coarse-grain approach, called dynamic power management or DPM, is to power-gate entire
system devices such as disk drives, microprocessor cores or other on-chip modules [Tex11,
HII+06, KSS08]. At this granularity, devices are requested to perform tasks by software, which
may take some time to complete. Processor cores will be requested to execute a thread of
computation by the operating system, and I/O devices such as disks will service data requests
from applications (via the operating system). After all requests have been serviced, the device
19
20 Chapter 3. Related Work
becomes idle and may be power-gated to reduce power consumption, by selecting one of many
different low power modes.
The frequency and distribution of requests depends on the application type, number of applica-
tions running and user interaction, so runtime (online) techniques must be applied to determine
an appropriate time to switch off a device. Ideally, devices will be power-gated off for long pe-
riods, which will amortise the power cost of powering on the device for the next request. There
will always be a delay while the device is power-gated on, and requests may have to wait to be
serviced. Normally, power modes that consume less power while the device is off will require
more power and greater delay when powering the device on.
Most online DPM techniques fall into three categories: Timeout, predictive and stochastic
modelling. Timeout policies switch off a device if it has been idle for a set period, and is
commonly used for screens and disk devices. The advantages of such a policy are that it is
easy to implement in hardware, and the timeout can be adjusted to control the likely impact
on performance. Increasing the timeout to reduce the impact on performance will decrease the
power savings, however, as the device will remain on longer before it is switched to a low power
mode.
In the prediction strategy, the length of the next idle period is predicted based on the history of
past idle periods. The optimal power mode for the predicted length of the idle period is applied,
and the device is set to return to full power just before the next request is due to arrive. Two
recent predictive techniques employ exponential averages of past idle periods [CGZ09] and
genetic algorithms [KTYZ06] to predict the length of the next idle period.
The predictive DPM policy assumes a stationary (regular) pattern of request arrivals and can
perform badly when non-stationary request patterns are observed. In [ISG03], past idle periods
are used to update a histogram which represents the probability distribution for the next idle
period. This contains much more information than a single value prediction of the length of
the next idle period. When selecting the appropriate power mode for the period, the likelihood
of the prediction is taken into account so that a mode with a smaller penalty can be applied if
there is low confidence in the prediction.
3.2. Fine-grain power management 21
Stochastic modelling methods assume probability models for the idle period and device response
time, and determine the optimal power mode given these assumptions [SDM08]. A limitation
of stochastic models for idle periods is that they cannot dynamically adapt to different work-
loads and also perform poorly with non-stationary workloads that do not fit the modelling
assumptions. As DPM policies tend to be better suited to specific workload types, a more
versatile approach is to use a policy selection mechanism to apply one of a set of DPM policies
(which may include timeout, predictive and stochastic policies) depending on the workload
[RKM05, DR06].
A comprehensive survey of DPM, including timeout, predictive and stochastic strategies can
be found in [BBDM00].
3.2 Fine-grain power management
It was mentioned earlier that power-managed devices can be defined at different granularities.
Fine-grain power management considers smaller devices that may be components of a larger
device, such as the pipeline stages and units that make up a processor core. Each fine-grain
device has a well defined set of inputs and outputs, and can be isolated from a power supply
(power-gating).
Fine-grain power-gating has the potential for greater power savings, as the active hardware can
be matched to the resource requirements of a workload at a higher resolution. For instance - a
core may be active to execute a thread of execution (either using DPM policies or DVFS), but
not all of the individual components of the core may need to be active to support execution.
Superscalar processors use wide architectures to support high instruction-level parallelism (ILP)
where it exists, but this can result in many components being idle each cycle. In the next section,
the superscalar pipeline and existing techniques to power-gate components are discussed.
At the component granularity, DVFS becomes difficult to implement as DVFS requires different
voltage and clock domains, which would need to be distributed to all components through the
CMOS fabric. Also, level shifter circuits would need to be introduced to communicate between
22 Chapter 3. Related Work
circuits operating at different voltages [BS02]. Power and clock-gating require virtual ground
lines and control signals for each component and can be used instead to power-manage fine-grain
components.
Fine-grain processor components can be considered to be accessed in a request based fashion
like coarse-grain devices, but requests are shorter and more frequent. The smaller size of
components results in lower power savings when compared to coarse-grain devices, but the
power and time to power components back on are reduced also, and start up times in the order
of 10 clock cycles are reported in [ADSN06, BS02]. Another factor contributing to the short
power on delay is that most components have no internal state. Many simply compute results
from inputs, so additional delay is not incurred in retrieving and restoring the state.
The remainder of this chapter will provide related work on fine-grain power management of
processor cores.
3.3 Power-managing the superscalar pipeline
The increasing availability of transistors on a chip due to technology scaling allows more func-
tionality to be added with each technology generation. This can be used to widen the processor
pipeline, or add more cores, or both. Instruction-level parallelism (ILP) exists in instruction
streams, when instructions have no data dependencies with each other and could therefore be
executed in parallel. Wide (superscalar) architectures exploit ILP to improve performance by
providing multiple execution units to execute instructions in parallel where possible. The in-
creased throughput requires that the other components of the design are widened also. Modern
architectures can execute around four instructions in parallel.
When the instruction stream contains low ILP, the width of the superscalar components may
not be fully utilised, leaving parts of the pipeline idle. The execution stage of the pipeline is
different to other stages, in that several different types of execution unit are implemented to
execute different types of instruction. This normally results in more units than necessary to
maintain the maximum throughput of the rest of the processor, which means that some units
3.3. Power-managing the superscalar pipeline 23
must remain idle each cycle. Figure 3.1 shows the superscalar architecture that we model later
to evaluate our power saving techniques. It is based on the DEC Alpha 21264 architecture
[Com02]. To support simulation and power modelling, compound execution units that can
perform two or more types of instruction are broken down into separate units. The integer
units are split into simple ALUs (ALU) and complex ALUs (MUL), and the load/store units
are split into simple ALUs and dedicated memory access units (MEM).
Figure 3.1: Diagram of a superscalar architecture based on the DEC Alpha 21264.
Figure 3.2 shows the breakdown of power consumption for the superscalar architecture for
two different technology generations - 90nm and 45nm. The figure was produced using the
simulation and power estimation framework described later in Chapter 8, with a 1 million cycle
trace of perlbench. In Section 2.1.3 it was shown that dynamic power dissipation exceeds static
24 Chapter 3. Related Work
power dissipation in 90nm technology, but static power dissipation is dominant in the more
recent 45nm technology. This shift means that power consumption becomes more dependent
on the number of transistors (area) than transistor switching activity (computation performed).
Table 3.1 shows area predictions made by our power estimation framework for selected units.
From the table and Figure 3.2, we can see that devices with larger area, namely the Level 2
cache and execution units, consume more power after technology scaling when compared to
smaller area devices such as the data and instruction caches, MMU and bypass logic.
Figure 3.2: Breakdown of power consumption in the superscalar architecture estimated for
90nm and 45nm technology.
Since 2009, mass market processors implemented in 32nm technology have been available
[Int10b], however, the modelling tool chosen for our evaluation does not support power es-
3.3. Power-managing the superscalar pipeline 25
Component Area (mm)
l2cache 13.23
dcache 1.46
icache 0.63
MUL (2 units) 0.94
FPU (2 units) 4.66
ALU (4 units) 1.26
MMU 0.24
bypass 0.02
Table 3.1: Area of selected superscalar architecture components estimated for implementation
in 45nm technology. The total area is given where there are multiple instances of a unit.
timation for this generation. It is expected that the trends in power consumption will continue
to 32nm. The implications of future technologies described in the ITRS roadmap such as
multi-gate transistors [ITR10], which may be necessary for further technology scaling, will be
discussed in Chapter 10.
The following subsection examines techniques to power-manage the issue queue and register file
in the superscalar pipeline. Execution units and the Level 2 cache will be considered in detail in
their respective subsections, as these contribute highly to overall processor power consumption.
3.3.1 Power managing the issue queue and register file
Instructions to be executed are fetched from main memory and decoded in the processor
pipeline’s front end (shown near the top of Figure 3.2). Instructions are then dispatched
to the issue queue, which contains both instructions that have their operands available and
instructions that do not. Each cycle instructions are updated with operands that have been
made available by instructions executed in the previous cycle in a process called wakeup. Then
a set of instructions which have all of their operands available are issued to the execution units,
in a process called selection [PJS97]. Although instructions enter the queue in the original
program order, they may be issued out-of-order if certain dependencies are satisfied.
The selection logic attempts to achieve maximum throughput by filling as many issue slots as
possible with instructions. The issue width defines how many slots are available and therefore
the upper limit on throughput. Data dependencies between instructions in the queue may
26 Chapter 3. Related Work
mean that instructions will not have their operands available until other instructions produce
them as a result. This can prevent all of the issue slots being filled. A larger issue queue (or
instruction window) increases the probability that enough instructions will be issuable, at the
cost of increased power consumption.
The issue queue can be power-managed by power-gating parts of the queue to reduce its length.
In [BKAB03], the length of the reorder buffer is used to estimate the length of issue queue that
is needed to sustain the current level of parallelism. Completed instructions must be committed
to the architectural state in the original program order, so when instructions are issued out-of-
order from the issue queue, they must be buffered in the reorder buffer (ROB) in the original
order until all previous instructions have committed. The length of the ROB indicates how far
out-of-order an instruction has been issued, and therefore indicates how long the issue queue
needs to be to maintain the current level of parallelism. A longer queue is not necessary, so the
rest of the issue queue can be power-gated off to reduce power consumption.
The issue queue can be monitored itself to estimate the required length, by measuring the num-
ber of instructions that are issued from the youngest part of the queue [FG01]. If instructions
are infrequently issued from this part of the queue, the size of the queue can be dynamically
reduced. Conversely, if many are issued from the youngest part, then the queue size may be
limiting the amount of ILP that can be exploited and should therefore be enlarged. In [IM02],
Iyer and Marculescu dynamically scale the width of the entire processor, including the length
of the issue queue. They identify execution hotspots and determine the best processor width
by estimating the power consumption when different widths are enforced.
The techniques discussed so far fall into the prediction category of power management, as
runtime information is used to predict the length of queue that will be most appropriate for
future instructions. A more accurate approach is to determine at compile time the length of
queue that would be required to achieve the maximum amount of parallelism available in the
instructions [JOAG05]. Special instructions are then inserted to change the maximum length
of the issue queue at the appropriate time.
The register file stores the working data used by instructions. Data can be accessed directly
3.3. Power-managing the superscalar pipeline 27
during issue, so instructions are not required to access caches or main memory for their operands
or to store results. Storing data in the register file speeds up execution, but there are a
limited number of registers available, normally around 32 to 128. The size of the register
file is dependent on the number of ports, which must be high so that the register file can
provide operands for each issued instruction in parallel while storing the results of the executed
instructions.
Park, Powell and Vijaykumar [PPV02] reduce the size and therefore the power consumption
of the register file by statically reducing the number of register file ports using bypass hints
and decoupled renaming. A bypass network forwards the results of executed instructions to the
inputs of execution units, so that the following instruction can use the result immediately as
an operand, avoiding the delay or writing and reading the result in the register file. In [PPV02]
it is noted that around 50%–70% of operands are obtained from the bypass network. They
reduce the number of register file read ports, and use bypass hints in the issue queue to guide
the selection logic in choosing instructions that will not exceed the number of available ports.
Banking can reduce the size of the register file, but each write port can only access the registers
contained within a particular bank. This can lead to bank conflicts if many instructions produce
results for the same bank. Operands of an instruction are usually allocated a virtual register,
which is used to resolve false operand dependencies. In [PPV02], the assignment of the operand
to a physical register is delayed until the values are to be written to the register file, so that
non-conflicting registers are assigned to the operands, permitting smaller register files through
banking.
Compiler techniques can also be used to reduce the number of registers that are needed and
allow power management policies to be applied to unused registers [JOA+05]. It is difficult to
determine in hardware if the most recent access to the data in a register is the last. Therefore
all registers must be kept active unless they are accessed again. Using the compiler, the authors
of [JOA+05] detect single use results and use special instructions to flag registers as single use
registers. By assigning results to these registers, the hardware can make the register available
after it is next accessed. It can either be made available for results of other instructions, or it
28 Chapter 3. Related Work
can be power-gated to reduce power consumption.
Power-gating registers dynamically at runtime is potentially dangerous, as isolating registers
from the power causes them to lose their data. Prediction based power management are avoided
therefore, as misprediction can result in incorrect program execution. Even the compiler tech-
nique just described employs slow, low power shadow registers to hold a copy of the previous
value in case of exceptions or branch mispredictions.
3.3.2 Power-managing caches
The structure of a direct-mapped cache is shown in Figure 3.3. To access data, the address is
first broken down into a tag, index and displacement. The index identifies the row of the cache,
and the tag is then compared to determine if the cached data block corresponds to the data
address. The displacement locates the specific data required in the data block.
Instruction/data address: tag index displacement
Direct-mapped cache: index cache line
0
1
2
...
[tag, data block, valid bit]
[tag, data block, valid bit]
[tag, data block, valid bit]
...
Figure 3.3: Direct-mapped cache structure.
Different applications will naturally exhibit a variety of demand on cache resources. Loops
will limit instruction cache accesses to a range of addresses for long periods of time, and access
patterns of the data cache will depend on stride length and size of data structures. These access
patterns make caches amenable to predictive power management strategies, which can use the
history of accesses as a guide. DVFS is not suitable for caches, as their latency is already a
limiting factor of execution time. Caches can be power-gated however, either by switching off
specific cache lines [KHM01, ALR02, AGVO05], consecutive sets of cache lines [PYF+00], or
3.3. Power-managing the superscalar pipeline 29
cache subbanks [KFBM04].
Powell et al. [PYF+00] adapt the size of the instruction cache to an application’s requirements
by iteratively halving the cache, power-gating off the regions of cache that are not needed.
They use a prediction based policy, which estimates the cache requirements of an application
by monitoring the cache miss ratio over an interval. When resizing a cache, the mapping
function used to locate the appropriate cache line must be adapted by masking the lower bits
of the index according to the current size.
It has been stated previously that circuits lose their state when they are power-gated, which
means that data contained in power-gated cache lines will be lost and will need to be reloaded
from the next level in the memory hierarchy if it is needed again. This can be damaging to
performance if the prediction is incorrect, or cached data will be reused far in the future. The
technique in [PYF+00] requires many cache lines to be reloaded each time the cache is resized,
as changing the mapping function will change the location that a particular address is looked
for in the cache.
Kim et al. reduce leakage while preserving the data by reducing the supply voltage to the cells
rather than removing it completely [KFBM04]. Reducing the voltage from 1V to 200-300mV
decreases leakage by 90%, but the full voltage must be restored for the cache cell data to be
read or written. This incurs a delay and can affect the performance of the cache, as the memory
accesses must wait for the cell to power up fully before the data can be accessed.
Instruction caches exhibit more spatial locality than data caches, and will normally address
small regions of the cache for long periods, before a jump to a new region due to function calls
and returns, or branch instructions. To exploit this pattern, Kim et al. divide the cache data
blocks into subbanks and switch all subbanks except the active one to the low-leakage (drowsy)
mode. A prediction mechanism records jumps that cause a transition to a new cache subbank
so that the new cache subbank can be power-gated on early without incurring a delay. As
the data is not lost during the power-gating, this method will result in fewer accesses to the
remainder of the cache hierarchy than one that destroys data, resulting in less performance
degradation and potentially higher energy savings.
30 Chapter 3. Related Work
Data cache accesses show more temporal locality, as data items are normally accessed repeatedly
for a period, and then possibly never again. Kim et al. put all data cache lines into drowsy
mode every 4k cycles, and restore them to full power when they are first accessed. Multiple
accesses of the data amortise the wake-up delay and data items that are not accessed again
remain in the low-leakage state. Intuitively, historical access information from the previous
period should provide a good guide as to which cache lines will be required during the next
period. The authors implemented this by applying the drowsy mode only to the cache lines
not accessed in the previous period, but this produced inferior results to the simpler policy.
Although an improvement in performance was seen using the access history, the policy put
fewer cache lines in the low-leakage mode, reducing power savings.
Kaxiras, Hu and Martonosi also make use of the temporal locality property in their Cache Decay
technique [KHM01]. They observe that the dead time of a cache line is orders of magnitude
larger than the active time when the data is being accessed. They use a timeout policy that
monitors the number of cycles since the last access, and switch off lines if they are not accessed
for a set interval of time. They improve upon this, dynamically adapting the timeout interval
for each line by measuring the length of the idle period. If the period is short, the interval is
increased so that the cache line will remain on longer and catch the next access. If the time
is long, however, the interval is decreased to attempt to maximise the amount of time the line
is in the low-leakage mode. Abella and O’Boyle employ a similar power management strategy,
but relate the number of accesses to cache lines to the length of the timeout interval [AGVO05].
In [ALR02] a new cache cell is studied, which retains the stored data while removing the voltage
from the cell completely. Although the data degrades somewhat, it retains enough information
for the data to be retrieved when power is restored. Only a small power-up delay is estimated
and the device was fabricated in 250nm technology. At a small cost to performance, potentially
large leakage reductions in caches could be seen with this technique, as all cache cells are
power-gated when not being accessed.
Compiler driven techniques can also be used to power-gate cache lines, by inserting directives
into the code to specify which cache lines will be needed during execution [ZHD+02, ZKKC03].
3.3. Power-managing the superscalar pipeline 31
An advantage of this technique is that it is more accurate than predictive techniques. The
compiler directives can power-gate a cache line as soon as it enters a long enough idle period,
and can switch the line on again in advance of the next access. A comparison of compiler
techniques to hardware techniques will be given in Section 3.3.4.
The cache power management approaches described here consistently show leakage reduction of
around 70%. Since static power leakage contributes significantly to the overall processor power
consumption, and the L2 cache is likely to have relatively low dynamic power dissipation, we
can assume that current techniques are able to reduce the total power contribution of the L2
cache by around 70% also.
3.3.3 Power-managing execution units
Assuming the L2 cache power can be reduced by 70% with current techniques, execution units
will contribute around 44% of the overall power of the superscalar processor in Figure 3.2 when
using the 45nm technology, and are by far the largest contributor to overall power consumption.
Execution units can also be power-managed by using power-gating to reduce their leakage when
they are not required to execute instructions. Unlike caches, no state needs to be preserved, so
almost all of the leakage can be prevented with power-gating. There is still a delay to start up
a unit however, so the pipeline may stall while an unavailable unit is power-gated on.
ALU ALU ALU ALU MUL MUL MEM MEM FPADD FPMUL
U
ni
t u
til
is
at
io
n 
pe
r c
yc
le
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Figure 3.4: Utilisation of different execution units during a trace of the integer benchmark
sjeng
32 Chapter 3. Related Work
ALU ALU ALU ALU MUL MUL MEM MEM FPADD FPMUL
U
ni
t u
til
is
at
io
n 
pe
r c
yc
le
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Figure 3.5: Utilisation of different execution units during a trace of the floating point benchmark
cactusADM
Figures 3.4 and 3.5 show the utilisation of different execution units while executing a trace of
the sjeng and cactusADM benchmarks, which are integer and floating point benchmarks. The
simulation framework is the same as used for the remainder of this work and is described in
detail in Chapter 8. It can be seen that some units have very low utilisation, which indicates
potential for power saving. Identifying the appropriate time to power-gate a unit is non-
trivial however, as the utilisation can vary significantly from cycle to cycle, which can result in
short power-gating periods where power overheads exceed savings. Restricting the number of
units also directly limits the amount of parallelism that can be exploited, which can affect the
performance of the entire processor.
As with caches, execution units can be power-gated at different granularities. Maro et al. power-
gate entire clusters of execution units in an architecture containing two identical integer clusters
and a floating point cluster [MBB01]. They propose three prediction policies for the integer
clusters, using different metrics to guide the prediction. The first measures the utilisation of
clusters and switches off a consistently under-used cluster. If the remaining cluster is consis-
tently over-used the second cluster is powered back on. The second method monitors the IPC
(number of committed instructions per clock) over a 512 cycle window to gauge the general
trend in parallelism and power-gate an integer cluster on or off accordingly. The final method
counts the number of dependencies between instructions in the issue queue as a guide to the
maximum achievable parallelism. An integer cluster is then power-gated when low parallelism
is detected. We showed earlier that the issue queue can be monitored to resize the length of the
3.3. Power-managing the superscalar pipeline 33
issue queue in [FG01], but that method does not attempt to estimate the amount of parallelism,
rather the distance between instructions that are executed in parallel. Half of the floating point
cluster is power-gated when the integer clusters are power-gated, as the low integer ILP may
indicate low floating point ILP also. The entire floating point cluster is power-gated using a
timeout policy, if no floating point instructions are issued.
Power-managing clusters of units means that the approach in [MBB01] suffers some of the
same problems as coarse-grain power management, as some units inside a cluster will remain
on even if they are idle. Moreover, although the architecture can be scaled to match the
amount of parallelism, the units that are available will not be selected to match the types of
instruction that make up that parallelism. Each cluster contains a balance of unit types, which
may not be appropriate for instruction streams that are dominated by one type of instruction.
Therefore, power management at the component level should ideally monitor individual units
and power-gate each separately.
Hu et al. also find a timeout technique suitable for floating point units, and power-gate the
units if they have been idle for a predefined number of cycles [HBS+04]. They attribute large
savings and low performance losses for these units to long idle periods and a short break-even
time. Youssef et al. [YAE06] dynamically adjust the timeout at runtime for each functional
unit, based on the number of times the idle period exceeds the break-even time. The policy
is similar to the Cache Decay policy in [KHM01], in that the timeout value is increased if the
idle period is frequently less than the break-even time, so that the unit will remain on in future
and catch the next access. If the idle period is frequently longer than the break-even time, the
timeout value is decreased, to increase the energy savings.
In [LBBS09] it is shown that the timeout method in [HBS+04] increases the power consumption
of the processor when executing some benchmarks. By measuring the number of successful
instances (cycles a unit remains off after the break-even time) and the number of harmful
instances (number of times the unit is switched back on before the break-even time) over an
interval, they can determine if power-gating has been successful in saving power. If not, power-
gating is disabled for the unit over the next interval to mitigate losses. The authors also bound
34 Chapter 3. Related Work
the amount of loss using tokens that represent the leakage power of a unit over a cycle. The
acceptable power increase over a number of intervals is represented as a bag of these tokens,
and at the end of each interval a number of tokens representing the power savings or wastage
are added or removed from the bag respectively. Power-gating will only be enabled for the next
interval if there are enough tokens left in the bag to tolerate the worst possible wastage over
that interval.
A further extension to [HBS+04] is proposed by Yeh et al. in [YCCY11]. The authors use
Predictive Pre-Wakeup to predict when a unit will be required in future, so that the unit can
be powered on early. As a result, that will be issued to the unit will not stall. The prediction
mechanism stores a cache containing recently visited branches, whether the branches were
taken, and the units that were used following these branches. When branches instructions are
executed and the branch direction resolved, the corresponding entry in the cache is selected, and
the information used to pre-wakeup any units that will be needed in the following instructions.
This technique shares some similarities to the compiler driven approaches described later, as
predictions are made at the basic block granularity, and the technique is limited to frequently
visited basic blocks due to the limited size and scope of a cache based approach.
The timeout techniques based on the work of Hu et al. will power-gate units on if the scheduler
attempts to issue instructions to a unit that is off. This will consume power and force the
instruction to stall until the unit is powered. If the unit is used infrequently however, it
may be advantageous to leave the unit off, and delay execution of the instruction so that the
computation can be performed on another unit that is already on. In the next chapter we
will describe how our Loop-Directed Mothballing technique achieves this, by monitoring the
utilisation of individual units.
Ikebuchi et al. fabricate Geyser-1, which is a fine-grained power-gated processor based on the
MIPS R3000 [ISK+09]. Execution units are power-gated with a single cycle delay, so units are
only powered on if an instructions detected earlier in the pipeline will need them. Units are
power-gated off after use, unless the compiler inserts an instruction indicating that the unit
will be reused soon. The short switch on delay allows power management to be applied with
3.3. Power-managing the superscalar pipeline 35
complete accuracy, however, their chip operates at only 60MHz and increasing the operating
frequency will lead to switch on delays in the order of 10 cycles [ADSN06, BS02]. Hence, their
technique would be limited to processors with very low operating frequencies.
Ikebuchi et al. calculate the break-even time for power-gating, and use the compiler to detect
instances where the idle period of a unit is shorter than the break-even time and would therefore
result in increased power consumption. A fully compiler-driven approach is implemented in
[TSC06] and [RPOG02]. The control flow graph (CFG) of each application is analysed and
the execution unit requirements of each basic basic block is determined. Idle periods can
be predicted by comparing the requirements of consecutive blocks, and the optimal power
management choices can be communicated to the processor using special instructions. In
[TSC06] applications must be executed and profiled to augment the CFG with probabilities for
each branch target. This is used to give higher priority to the resource requirements of more
likely targets when determining the optimal power management strategy. Profiling can also be
used to identify hot blocks that have high execution frequency and apply power-gating only for
these blocks [RPOG02]. As with compiler-driven power-gating of caches, special power-gating
instructions can be placed to switch on units just before the basic block that requires them and
turn off units as soon as they have been used for the last time.
In [RRK09] a hybrid compiler/hardware approach is attempted, that only inserts special in-
structions to power units off. The authors assume a single cycle latency for powering a unit
on, so that if it is detected in the decode stage that a unit will be required, it can be powered
on without delay. However, as we have described previously, such a short power-on delay may
not be achievable at clock frequencies as high as 1GHz.
Execution units are used much more erratically than caches, so units must be power-gated
frequently and carefully. The pipeline can stall if a required unit is unavailable and must be
powered-up, and short periods of inactivity can result in overheads outweighing the savings.
Power-gating execution units is therefore much more difficult, and as a result, total energy
savings are lower. The compiler approach in [TSC06] reduces total chip energy by up to 18%,
but many benchmarks tested showed increased energy consumption. Power can also be saved
36 Chapter 3. Related Work
at the cost of performance, as in [MBB01], but we show later that when total energy and
performance are accounted for in the energy-delay product metric, a system with no power-
gating may perform better.
3.3.4 Compiler and profiling vs. hardware
Compiler techniques can be applied to application code to select appropriate points to power-
manage execution units. This allows decisions to be made based on knowledge of future resource
requirements, which can prevent units being power-gated off just before they will be needed
again. This would incur all the overheads of power-gating but save little power. The break-even
time (BET) is the amount of time a unit must be off before the savings outweigh the overheads,
resulting in a net power saving.
Analysis is often performed at the basic block level, as power-gating at a finer granularity is
unlikely to yield idle periods longer than the BET. To estimate the resource requirements of
a basic block, the data dependencies must be analysed to determine which instructions can
be executed in parallel, and therefore the maximum number of each unit that is required to
execute the basic block. Power-gating decisions are then made, taking the requirements of the
surrounding basic blocks into consideration. Profiling can also be used [TSC06, RPOG02] to
augment the basic blocks with frequency or likelihood of execution.
With a hardware approach, lack of knowledge about future instructions may mean that units
are turned off just before they are required again, resulting in a power-gating period shorter
than the BET. However, hardware techniques have access to runtime information about the
execution characteristics, which may indicate that some parallelism identified at compile-time
cannot be exploited. Hardware techniques will also not require recompilation of the code for
different architectures. This is a requirement of compiler methods, even if the architectures
support the same instruction set, as the number of available units may change.
3.4. SMT and efficient processing 37
3.4 SMT and efficient processing
Using the processor resources efficiently requires either switching off idle units, or utilising idle
units for other tasks. Power-gating achieves the former, and Simultaneous Multi-Threading
(SMT) is a mature technique that accomplishes the latter. SMT exploits thread-level paral-
lelism to increase the utilisation of the processor resources by executing instructions from two or
more instruction streams simultaneously [TEL95]. If there are insufficient issuable instructions
from one stream to fill the issue slots, instructions from the other stream (which will have no
data dependencies with the first stream) can be issued to fill the spare slots and utilise the
other execution units.
To support multiple instruction streams (threads), some resources in the pipeline must be
replicated and will increase the power consumption of the processor. Program counters, return
stacks, instruction retirement, instruction queue flush, trap mechanisms and the register file
must all be implemented separately for each thread (although using register renaming, simply
enlarging the register file suffices) [TEE+96]. SMT has five disadvantages from an efficiency
point-of-view:
• If the executing threads do not complement each other, they may not provide instructions
that match the available execution units.
• Data accesses from one thread may cause cache lines used by another to be moved to a
deeper cache level, increasing data movement and data access latencies.
• Additional pipeline resources to support multi-threading will consume extra power, and
increased complexity may limit the clock frequency or add pipeline stages [TEE+96].
• SMT is indiscriminate, and will execute multiple threads on a processor even if a single
thread has a large cache footprint or high throughput.
• Due to the limited issue width, some units will remain idle even if all the issue slots are
filled.
38 Chapter 3. Related Work
Power efficiency in a processor comes from high utilisation of resources while they are on and
consuming power. Efficient computation on the other hand, means that results are produced
with minimum effort. SMT increases the utilisation of resources and therefore power efficiency,
but decreases the computational efficiency by adding computation to arbitrate between threads
and move data because of cache pollution. Employing power-gating in a single-threaded pro-
cessor is more computationally efficient, as fewer extra resources need to be added and a single
thread will have exclusive access to caches. The resources that have low utilisation (including
caches) will be resized or disabled to match the specific requirements of the application, which
may result in better power efficiency as all idle units can be switched off.
3.5 Summary
This chapter compares many different approaches to power management. Section 3.1 describes
coarse-grain approaches which are applied to large devices which are required to complete
tasks as requested by software. Section 3.2 shows how some of these techniques can be adapted
to fine-grain components inside larger devices. Section 3.3 describes the architecture of a
superscalar pipeline and identifies execution units and Level 2 caches as the most power-hungry
components. Power management of different superscalar components are then discussed in
detail and compiler initiated power management is compared to the hardware only approach.
Section 3.4 compares power management to simultaneous multi-threading, which attempts to
utilise the idle periods of fine-grain components for other tasks rather than switching them off.
Devices at both coarse and fine granularities experience periods of idleness. Due to the short
idle periods of fine-grain devices (components), monitoring these periods and switching power
modes cannot be performed quickly enough by runtime software such as the operating system,
so hardware must be employed. Compiler approaches have the benefit of foresight when making
power management decisions so they can trigger the hardware to switch off components just
after they have been used for the last time and switch them on before they will be needed again.
However, hardware only techniques can use runtime information when making decisions and
3.5. Summary 39
generally perform better than compiler methods. The related work discussed in this chapter
is summarised in Table 3.2. Of the two most power-hungry components in the processor,
execution units and Level 2 cache, the cache is accessed in a more structured manner. Therefore
idle periods are easier to predict and power savings are normally larger than those seen for
execution units.
Simultaneous multi-threading (SMT) achieves power efficiency by using idle execution units to
execute instructions from other program threads. However, some execution units will always be
idle as the issue width is smaller than the total number of units. Power-gating can achieve better
power efficiency as it can switch off all units that are idle. Power-gating also achieves better
computational efficiency than SMT, as it avoids cache pollution and the added computation
required to arbitrate between executing threads.
The next chapter analyses the problem of power managing execution units, and develops a
new strategy to accurately predict the idle periods of the units so that they can be effectively
power-gated.
40 Chapter 3. Related Work
Reference Unit power-gated Type Notes
[BKAB03] Issue queue Hardware Uses length of ROB as guide to parallelism
[FG01] Issue queue Hardware Counts instructions issued from youngest
part of issue queue
[JOAG05] Issue queue Compiler Maximum issue queue length determined
at compile time, special instructions added
to modify length
[PPV02] Register file Hardware Statically reduce register file size using
bypass hints and decoupled renaming
[JOA+05] Register file Compiler Flags single use registers, which can be
powered off after use
[JOA+05] Cache Hardware Monitors cache miss ratio and increases
or decreases cache accordingly
[KFBM04] I cache Hardware Only power on subbank containing
currently executing instructions
[KFBM04] D cache Hardware All data cache lines put to sleep every 4k
cycles
[KHM01] Cache Hardware Adaptive timeout policy based on idle
period
[AGVO05] Cache Hardware Adaptive timeout policy based on number
of cache line accesses
[ALR02] Cache Hardware New design for state preserving cache cell
[ZHD+02] I cache Compiler Power-gating instructions inserted at loop
granularity
[ZKKC03] D cache Compiler Power-gating instructions inserted to
power off cache lines containing objects
that will not be accessed
[MBB01] Execution unit Hardware 3 mechanisms to gauge ILP and switch
off cluster of units
[HBS+04] Execution unit Hardware Static timeout policy
[YAE06] Execution unit Hardware Adaptive timeout policy
[LBBS09] Execution unit Hardware Static timeout policy with ability to
disable power-gating and to add maximum
power loss guarantees
[YCCY11] Execution unit Hardware Static timeout policy with pre-wakeup
[ISK+09] Execution unit Hardware/ Actual implementation relying on single
Compiler cycle power on delay
[TSC06] Execution unit Compiler Profiling used to find branch probability
[RPOG02] Execution unit Compiler Profiling used to identify hot blocks
[RRK09] Execution unit Hardware/ Instructions inserted only to power off
Compiler units, hardware mechanism used to power
units on
Table 3.2: Summary of related work on power-gating the superscalar pipeline
Chapter 4
Problem Analysis
The effectiveness of a power-gating strategy will be affected by the requirements of the system in
which the processor resides and many features of the instruction stream that is being executed.
This chapter analyses the application domains, analyses features of the instruction stream that
can assist in making predictions, and describes the features of superscalar architectures that
may complicate a power-gating strategy.
Using this analysis, four goals and one requirement will be defined that an effective power-
gating strategy should achieve and satisfy. We then describe our Loop-Directed Mothballing
(LDM) approach and show how it achieves these.
4.1 Fundamental power-gating goals
In Section 2.4 we provided equations to describe how energy is saved when a component is
power-gated and the energy costs of doing so. From these equations, we can derive three
fundamental goals for an execution unit power-gating strategy:
1. Long Periods - Units should remain power-gated for long periods to mitigate costs and
increase savings.
41
42 Chapter 4. Problem Analysis
2. Performance - Units should only be power-gated when doing so will have a low impact
on performance.
3. Accuracy - Idle period prediction should be as accurate as possible.
The motivation behind the Long Periods goal stems from the power-on cost in terms of energy.
It is therefore favourable to power-gate for longer periods so that units are powered-on less often.
Performance loss is undesirable for any strategy, but it also increases energy consumption as
the other components in the processor must be powered for longer. The Performance goal is
therefore an important goal for our strategy. Accuracy is linked to the previous two goals, as
poor prediction accuracy can result in idle periods that are longer or shorter than expected,
which may mean that units are not power-gated for as long as they could be, or they are not
available when they are needed.
A complication with execution units that does not exist with most other devices is that there are
many units of each type in the superscalar architecture. A power-gating strategy has the option
of making more units available to complete requests faster, or limit the number of available
units to save power. Inter-instruction data dependencies or delays while accessing memory may
mean that completing requests slower does not impact overall performance.
Given this complication, it is easier to think of the set of units that are required to complete a
portion of the instruction stream rather than the idle period of individual units. The set of units
that will be needed will depend on the types of the instructions that are being executed and
the amount of ILP. Both will vary greatly during program execution. Rather than predicting
the idle period of units, the set of required units can be predicted for a particular feature of
the instruction stream, such as a basic block. The Accuracy goal still holds, as inaccurate
predictions could limit performance by preventing all available ILP from being exploited, or
result in lower power savings if too many units are included in the prediction.
4.2. Analysis of application domains 43
4.2 Analysis of application domains
Broadly speaking, computation systems can be classified as mobile systems, high performance
desktop systems and server systems. Each system type would benefit from lower power con-
sumption: Mobile devices could operate for longer after the batteries have been charged, high
performance systems would require less cooling, reducing noise and energy consumed, and the
electricity bill of server farms and data centres could be greatly reduced.
Power-saving strategies should be applied in the same way for all system types. None could
tolerate a large loss in performance, the processor architectures are similar, and each system
will normally run an operating system to enable multiprogramming. The Performance goal
states that low performance loss should be incurred with a power-gating technique, so aside
from the power savings, each system type should not experience much of a difference when
using a processor that can power-gate execution units.
The presence of an operating system creates a new goal for our power-gating strategy however.
To enable multiprogramming, the operating system will interleave execution of instructions
from two or more program threads. As each thread will need its own context (data stored in
the processor’s registers), switching between threads requires copying the context to memory
and loading the context of the new thread from memory. Due to the costs of this, switching
between threads occurs relatively infrequently.
A predictive strategy must monitor instruction execution to predict which execution units will
not be needed and can be turned off. This prediction will not be valid for a different thread,
however, so a new prediction must be made when the context is switched and a different thread
resumes execution. If the time to produce a prediction is large, this limits the opportunity
for power-gating to a potentially small period before the next context switch. Therefore, an
additional goal for our power-gating strategy is as follows:
4. Fast Predictions - Predictions should be made over as short a time period as possible
to maximise the savings that can be achieved before the next context switch.
44 Chapter 4. Problem Analysis
4.3 Analysis of execution features
We have stated earlier that the set of execution units required to execute a portion of the
instruction stream depends on the types of instructions and ILP. More specifically, the set
depends on which particular types of instructions are executed together. The way in which
instructions are executed out-of-order by the superscalar pipeline can make predicting the
required set of execution units difficult. However, control flow features of the instruction stream
can be exploited to improve the accuracy of predictions. This section examines some of these
features of program execution to determine the effects they may have on a predictive power-
gating strategy.
4.3.1 Frequently executed basic blocks
In Section 3.3.3 we described an existing compiler driven strategy which power-gates execution
units only during frequently executed basic blocks (called hot blocks) [RPOG02]. An advantage
of this is that power-gating choices can be optimised for the basic blocks which account for the
majority of executed instructions. The set of units that are available during execution of other
basic blocks may be suboptimal, but these will account for a minor proportion of program
execution.
Predictive, runtime strategies can take advantage of hot blocks, as the blocks should exhibit
the same execution characteristics each time they are executed. Monitoring the execution units
used to execute a hot block will provide an accurate description of the set of execution units
required by the block. When the block is revisited in future, the execution units that were not
required last time can be power-gated off until the block has finished executing. It should be
noted that this is still a predictive strategy as the block may execute slightly differently each
time, due to out-of-order execution or memory operations missing in the cache.
At runtime it is not possible to determine which blocks are hot until the blocks have been ex-
ecuted many times. Therefore, implementing this strategy requires storing execution statistics
for a potentially large number of basic blocks (as there may be many hot blocks in a large
4.3. Analysis of execution features 45
loop). Basic blocks are also often short, which may result in power-gating units for a short
period. Although the Long Periods goal is not met, the remaining goals are, as accurate pre-
dictions are made for a block based on actual execution of the block, which should also limit
performance loss, and predictions can be made quickly. We will show in Section 4.4 how our
approach exploits hot blocks in loops to simplify the identification of hot blocks and meet the
Long Periods goal.
4.3.2 Basic block sequences and branch misprediction
To address the problem of short basic blocks, a compiler approach may consider the optimal
set of execution units for a sequence of blocks. In [TSC06], profiling information gathered at
runtime is used to augment the control flow graph (CFG) of basic blocks with probabilities of
each branch target. Using this information, the set of execution units for a group of blocks can
take into account different control flows through the blocks, assigning greater significance to
the more likely paths.
The Long Periods goal is satisfied using this approach, but the Accuracy goal is compromised,
with implications for Performance. Power-gating execution units over many basic blocks in-
creases the amount of time that units remain power-gated, but the strategy is less accurate
because of the presence of multiple control flows. Profiling information guides the technique
by giving the compiler insight into the execution of the program, but does not indicate how
sensitive the branch probabilities are to workload variation. Even if this was taken into account
by adding more information to the CFG, a static policy must be implemented by the compiler,
which will not be accurate for some workloads. Static policies also mean that the set of exe-
cution units cannot be tuned for individual control flow paths, and two likely paths with very
different requirements could result in few execution units being power-gated.
Another complication introduced by most superscalar architectures results from speculative ex-
ecution. The direction of branches is predicted at runtime, and instructions from the predicted
control flow path are executed speculatively before the actual branch target is known. This
can speed up execution by increasing the amount of parallelism, but mispredictions cause the
46 Chapter 4. Problem Analysis
pipeline to flush and a delay while instructions from the correct branch target are fetched. In
[HBS+04], mispredictions are exploited to increase power savings, as the execution units will
remain idle until the correct instructions are fetched. When they are fetched, the pipeline will
stall while the required units are restarted, which will impact performance. In our approach,
we take advantage of branch mispredictions to hide some of the power on delay when units are
switched on. This mitigates some of the performance loss that results from power-gating as no
instructions could execute on the units during this period anyway.
4.3.3 Out-of-order execution
Many superscalar architectures are able to increase the number of instructions that can be
executed each cycle by executing instructions out-of-order, as long as inter-instruction data
dependencies permit. This can have serious implications for a predictive strategy if instructions
are moved past basic block boundaries. Consider a situation where a basic block is predicted
to use no units of a particular type. If every unit of this type is switched off for the next visit
to the block, but an instruction from the preceding block that requires that type of unit has
moved past the block boundary, execution will stall, possibly indefinitely.
An additional requirement that ensures correct execution of instructions is as follows:
5. Correctness - It must be possible to detect situations where the units that are available
are insufficient to execute the types of instruction that are ready for issue, and power on
any units that are required to successfully complete program execution.
This requirement is essential for any power-gating policy that allows every unit of a particular
type to be power-gated in a superscalar architecture that implements out-of-order execution.
The effects of such a situation can still have detrimental consequences, as the pipeline may stall
while the required unit is switched on, and additional energy will be consumed in switching
on the unit. The ability to refine predictions over many instances of block execution can be
beneficial in such a scenario, so that the prediction can be modified to account for units needed
by instructions executed out-of-order.
4.4. The Loop-Directed Mothballing approach 47
4.4 The Loop-Directed Mothballing approach
Our runtime approach attempts to satisfy the goals and the requirement discussed previously,
by power-gating execution units only during loops. Instead of making predictions for the set of
execution units required to execute any basic block or sequence of blocks, predictions are made
for the blocks that make up a loop body. Units that aren’t required to execute the blocks will
remain off until the loop exits, which we can expect to take many iterations, thus satisfying
the Long Periods goal.
Achieving the Accuracy goal comes as a result of making predictions for the basic blocks based
on execution of the same basic blocks. Naive strategies that monitor ILP or unit usage over
a time interval to make a prediction for the following interval (such as those in [MBB01]) are
inherently inaccurate, as there are no assurances that following interval will have similar execu-
tion characteristics. In loops, hot blocks or hot sequences of blocks are executed repeatedly, so
a prediction can be made on the first loop iteration and applied to successive iterations. The
prediction does not need to be stored between loops, so large storage structures can be avoided.
In practice, it may be beneficial to cache a small number of predictions as loops will often be
revisited, especially if they are contained within another loop.
Accurate predictions cannot be made for basic blocks outside of loop bodies, as a prediction
cannot be made based on previous execution of the block without storing a history of execution
unit requirements for all blocks. As these blocks are likely to be executed infrequently however,
the power savings would be small. Using a less accurate strategy may allow these additional
savings to be achieved, but risks performance loss.
Many branch predictors, including the one used by the DEC Alpha 21264 on which our super-
scalar model is based, will not be able to accurately detect when a loop will exit. As a result,
the branch that exits a loop is normally mispredicted, resulting in a pipeline flush and a delay
while the correct instructions are fetched. In the DEC Alpha 21264 architecture this delay is 7
cycles. Our power-gating model has an 8 cycle power on delay, which means that if loop exit is
a branch misprediction, all units can be powered on for the non-loop code, and any units that
48 Chapter 4. Problem Analysis
were power-gated during the loop will only be unavailable for one cycle. This, and the accuracy
of predictions, limits the performance degradation of our approach, meeting the Performance
goal.
The Fast Prediction goal is satisfied as predictions can be made over a single loop iteration, and
are applied immediately for the remaining iterations. The Linux kernel will switch processes
with the lowest static priority after they have been running for up to 5ms [BC06]. After a
context switch, a prediction can be made for long loop bodies (that take 1000’s of cycles per
iteration), and the prediction could still be used for 100’s of iterations before the next context
switch. The Correctness requirement is satisfied by detecting an attempt to issue an instruction
of a type that has no units powered-on. A unit of the required type is powered on, and the
unit is added to the predicted set of execution units for the loop.
4.4.1 Prediction refinement
The set of execution units necessary for execution of the loop body may be overestimated
when monitoring the execution units that are used during a loop iteration. The scheduler is
indiscriminate when issuing instructions to units, so may issue many instructions of the same
type in a particular cycle (for instance - four ALU instructions). In some cases a different order
of instructions may use fewer units, by delaying an instruction for a cycle so that it can reuse
a unit previously used by another instruction.
LDM uses a process of refinement to determine which units are necessary. Rather than power-
gating units that are not used, LDM power-gates off units that have utilisation below a certain
threshold. An assumption is made that the instructions executed on these units could be
executed at a different time on another unit without performance loss, permitting greater
power savings. LDM iteratively removes a unit of each type after each iteration if its utilisation
is below the threshold.
The assumption does not always hold however, so a mechanism is required to detect situations
where switching off a unit causes performance degradation. We use a second threshold for
4.4. The Loop-Directed Mothballing approach 49
the least-used unit of each type, such that if it is utilised more than this threshold during an
iteration, another unit of the same type is switched on. This refinement technique is discussed
later in Chapter 5, where it is compared to an alternative method of determining performance
loss.
4.4.2 Changes in control flow
Instructions in the loop body are likely to contain some control flow from if-then-else and case
statements. Compiler driven power-gating must make static predictions for different control
flows, but LDM makes predictions at runtime, so different predictions can be made for different
control flows. In the next chapter we show that it is necessary to record the outcome of
conditional branches during a loop iteration, so that we can detect exit from a loop at runtime.
Any branch inside the loop, including the loop branch itself may target an address outside
of the loop body, exiting the loop. Therefore we assume that we have exited the loop, if the
internal control flow deviates from the flow we saw in the previous iteration.
If the change in control flow resulted from if-then-else or case statements, the loop will still
be executing. If the change in control flow persists however, the loop will be redetected, and a
new prediction will be made for the new control flow over a few more iterations. Consequently,
different control flows through the loop will acquire their own predictions tailored to their
requirements.
In Chapter 6 the LDM approach is extended from the basic variant which considers only
innermost loops, so that execution units can be power-gated during all loops (ELDM). For
this, loop entry and exit points must be determined oﬄine and provided to the processor. The
branch pattern is therefore not required and a single prediction is made for all control flows
within the loop body. ELDM enables power-gating to be applied to larger loops which are
part of a loop nest and are likely to have more complex control flows, so consecutive iterations
experiencing the same control flow are less common.
50 Chapter 4. Problem Analysis
4.4.3 Conditions for application
For LDM to be applied to a processor, each execution unit must be connected a virtual ground,
which can be isolated from the true ground using a power-gating transistor. LDM produces
binary signals as output, which will drive these transistors to switch them on or off, as required.
LDM is targeted at out-of-order superscalar architectures, as these allow rescheduling of in-
structions, and will offer more opportunities to power-gate units (as more units are normally
included in superscalar architectures). The technique could be applied to other processor types,
such as scalar processors or in-order processors, as units may still remain unused during loop
iterations. However, the full benefits of the technique may not be realised.
Although LDM can be applied in a multiprogramming environment provided by an operating
system, it cannot be applied in a multithreading environment. The accuracy of predictions
in LDM comes from consistent execution unit usage patterns across loop iterations. When
using multithreading this consistency is reduced, as instructions from a different thread may
be executed during an iteration. The execution unit requirements of the other thread will
introduce variations in the recorded requirements for the loop, which will reduce the accuracy
of the prediction.
Programs do not require recompilation for the basic form of LDM, as the entire LDM mecha-
nism, including loop detection, is implemented in hardware. For Extended Loop-Directed Moth-
balling recompilation is needed to identify loop entry and exit points, and insert instructions
to communicate this information to the processor. To implement these special instructions, the
instruction set architecture of the processor would need to be extended and the decode pipeline
stage would need to be modified to pass the information to the ELDM hardware.
4.5 Summary
This chapter looks at factors that might affect power-gating choices. Section 4.1 describes
three fundamental goals for an effective power-gating strategy. Section 4.2 discusses differ-
4.5. Summary 51
ent application domains and introduces a new goal when power-gating in a context switching
environment. Section 4.3 considers some features of program execution in detail, and how
power-gating strategies can exploit these features or must cope with their effects. In Section
4.4, our loop based approach is described, and it is shown how it satisfies all the goals and
requirements of an effective strategy.
The application domain includes mobile devices, high performance desktop systems and server
systems. As each would benefit from power savings, and as performance degradation is un-
desirable, an effective power-gating technique will be applied similarly across the application
domain.
The four goals for an effective strategy are Long Periods, Performance, Accuracy and Fast
Predictions. There is also a Correctness requirement that is introduced because of out-of-order
execution.
Our approach is called Loop-Directed Mothballing and achieves each of these goals and the
Correctness requirement. Units are power-gated during loops, so are expected to remain off for
almost the entire execution of the loop. Performance degradation is likely to be less than that
seen in naive strategies, due to accurate predictions for basic blocks made during execution of
the same blocks. Also, most of the delay of powering units on at loop exit can be hidden if the
exit is caused by a branch misprediction, as newly fetched instructions would not be ready for
execution for several cycles anyway.
A prediction for the set of units that is required can be produced over a single loop iteration and
is applied immediately, so the strategy should work well in a context switching environment.
In situations where out-of-order execution results in instructions in the issue stage that require
unavailable units, these instructions are detected, the unit is powered on and the prediction is
modified to reflect that an additional unit is required.
LDM allows units to be power-gated off even if they were used during the previous iteration.
If their utilisation falls below a threshold, it is assumed that the instructions executed on the
unit could be executed at a different time on another unit without performance loss. A second
52 Chapter 4. Problem Analysis
threshold is used as an indicator of performance loss, and allows a unit to be turned back on.
The prediction is refined over several iterations using these thresholds.
Tailored predictions are made for different control flows by effectively treating different control
flows through a loop as independent loops. With larger loops this would be less effective, as
successive loop iterations are less likely to have the same control flow. Therefore the ELDM
variant of LDM presented in Chapter 6 makes a single prediction for all control flows within a
particular loop.
The next two chapters describe a basic variant of LDM (referred to as LDM) and ELDM. The
basic variant applies the LDM strategy to power-gate execution units during innermost loops,
using a runtime loop detection mechanism. ELDM allows the LDM strategy to be applied to
all loops, by determining loop entry and exit points oﬄine.
Chapter 5
Loop-Directed Mothballing
The basic form of Loop-Directed Mothballing (LDM) is applied only to innermost loops, as
these are simple to detect at runtime and are likely to permit power-gating of more units. We
will show that despite this restriction, we can apply our power-gating strategy for large portions
of many benchmark traces. Implementing LDM requires additional components to be added
to the superscalar architecture, as well as a policy for deciding which units are unnecessary for
execution of a loop. In this chapter we describe the implementation of the LDM technique and
demonstrate its operation with examples.
LDM was first presented at COOL Chips XIV in April 2011 [CK11a], and later published in
more detail in IEEE Micro [CK11b].
5.1 Introduction
The set of execution units required to execute a loop is determined by the union of the sets
of units used on each cycle. In our model, based on the DEC Alpha 21264, we have 4 ALU
units, 2 MUL units, 2 MEM units, 1 FP ADD unit and 1 FP MUL unit. If a particular type
of instruction is issued in any cycle, at least one unit of that type must be in the required set
for the loop. In the best case scenario, exactly the same set of execution units will be used
53
54 Chapter 5. Loop-Directed Mothballing
each cycle and only four units will need to remain on for the loop. The more the sets vary
from cycle to cycle, the more execution units will be in the union of the sets that determines
the requirements of the entire loop. In the worst case, every unit of a particular type will be
used in some cycles, and the union will contain all 10 units. Figure 5.1, demonstrates the union
between the sets of units used on each cycle for different instruction schedules.
Figure 5.1: Three schedules showing possible sets of instructions issued on each cycle of a loop
and the union of these sets which determines the set of units that must be available to execute
the loop. The worst case schedule has maximum variation between the sets, resulting in an
execution unit requirement twice as large as the best case scenario.
To increase power savings we will use thresholds to power-gate units that are used infrequently,
rather than units that are not used at all. This allows for some flexibility when all execution
units of a type are used in a cycle, as some units of this type may still be switched off if this is
a rare occurrence.
In the basic form of LDM, we will power-gate execution units during innermost loops only.
Innermost loops are likely to be smaller, and take fewer cycles to execute an iteration. Therefore
the number of sets of execution units that contribute to the union of sets will be smaller. With
fewer contributing sets, the union between the sets is likely to contain fewer execution units,
and more units can be power-gated as a result. Another advantage of innermost loops is that
they are easy to detect at runtime. We describe the mechanism for this in Section 5.2.
To assess the proportion of executed instructions that are from inside innermost loops, and
therefore the scope of power savings from LDM, traces from 16 SPEC CPU2006 benchmarks
each containing 1 million instructions were simulated and the coverage was measured (Fig-
ure 5.2). The benchmarks and simulation framework are described in Chapter 8. From the
5.2. LDM overview 55
0
20
40
60
80
10
0
C
ov
er
ag
e 
(%
)
40
0.p
er
lbe
nc
h
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
43
5.g
ro
ma
cs
43
6.c
ac
tus
AD
M
43
7.l
es
lie
3d
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
ua
ntu
m
46
4.h
26
4r
ef
47
0.l
bm
48
2.s
ph
inx
3
99
8.s
pe
cra
nd
om
Figure 5.2: Coverage of executed instructions that are from innermost loops.
coverage we can see that for 10 of the traces innermost loop execution accounts for over 75%
of executed instructions. The remaining 6 traces have low coverage, so for these we will expect
to see minimal power savings.
5.2 LDM overview
The key tasks that must be carried out to perform LDM are outlined in Algorithms 1 and 2
and are described in this section. The EveryCycle function is completed every cycle, and most
of the key stages can be executed in parallel. The ApplyPrediction function is only called at
the end of a loop iteration.
Loop detection. As we focus on innermost loops, a loop can be detected by two successively
executed backwards branches that have the same branch address. The loop body consists of the
instructions at the addresses between the branch target and the branch address. The branch
instruction also marks the separation between iterations. Whenever a backwards branch is
detected, the utilisation counters are reset in case the branch forms part of the next loop or
iteration.
Record utilisation. Counters are used to record the utilisation of execution units by incre-
menting each cycle if a unit is used. The total number of cycles for the iteration is also recorded
so the counts can be converted to percentage utilisation in ApplyPrediction.
56 Chapter 5. Loop-Directed Mothballing
Algorithm 1 EveryCycle, structures and signals (LDM)
int16 utilisationCounters[NUMBER OF UNITS] . Storage Structures
int16 totalCycles
int previousBackwardsAddress
bool forwardBranchPattern[FORWARD PATTERN LENGTH]
int8 forwardBranchNumber
int8 previousForwardBranches
bool forwardBranchMatch
bool isF irstIteration
RAM int tableBranchAddress[TABLE SIZE] . Prediction Table
RAM bool tableForwardBranchPattern[TABLE SIZE][FORWARD PATTERN LENGTH]
RAM bool tablePrediction[TABLE SIZE][NUMBER OF UNITS]
RAM bool tableGuaranteeOn[TABLE SIZE][NUMBER OF UNITS]
bool unitActive[NUMBER OF UNITS] . Input Signals From Processor
bool branchExecuted
bool branchDirection
bool branchTaken
int branchAddress
bool issueFailed[NUMBER OF UNIT TYPES]
bool powerState[NUMBER OF UNITS] . Output Signals To Processor
function EveryCycle
for all u in NUMBER OF UNITS do . Record Utilisation
utilisationCounters[u] + = unitActive[u]
end for
totalCycles + + . Record Total Number Of Cycles
if branchExecuted and branchDirection = BACKWARDS then . Loop Detection
if branchAddress = previousBackwardsAddress and (isF irstIteration
or (forwardBranchMatch and previousForwardBranches = forwardBranchNumber)) then
ApplyPrediction() . Completed A Loop Iteration
isF irstIteration← false
else
for all u in NUMBER OF UNITS do . May Have Exited Loop
powerState[u] ← on
end for
isF irstIteration← true
end if
previousBackwardsAddress← branchAddress . Reset For Next Iteration
for all u in NUMBER OF UNITS do
utilisationCounters[u] ← 0
end for
totalCycles← 0
previousForwardBranches← forwardBranchNumber
forwardBranchNumber ← 0
forwardBranchMatch← true
end if
if branchExecuted and branchDirection = FORWARDS then . Loop Exit Detection
if branchTaken 6= forwardBranchPattern[forwardBranchNumber] then
forwardBranchMatch← false
forwardBranchPattern[forwardBranchNumber] ← branchTaken
for all u in NUMBER OF UNITS do . May Have Exited Loop
powerState[u] ← on
end for
end if
forwardBranchNumber + +
end if
for all t in NUMBER OF UNIT TYPES do . Check For Failed Attempts To Issue Instruction
if issueFailed[t] then
index← Hash(previousBackwardsAddress, forwardBranchPattern)
u← min(t) . Get The First Unit Of This Type
powerState[u] ← on
tablePrediction[index][u] ← on
tableGuaranteeOn[index][u] ← true
end if
end for
end function
5.2. LDM overview 57
Algorithm 2 ApplyPrediction (LDM)
function ApplyPrediction
index← Hash(branchAddress, forwardBranchPattern)
if branchAddress = tableBranchAddress[index] and forwardBranchPattern = tableForwardBranchPattern[index] then
if isF irstIteration then . Apply Existing Prediction
for all u in NUMBER OF UNITS do
powerState[u] ← tablePrediction[index][u]
end for
end if
for all t in NUMBER OF UNIT TYPES do . Update Existing Entry
luu← FindLeastUsed(t, utilisationCounters) . Get The Least Used Unit Of Type
if !tableGuaranteeOn[index][luu] then
. Test Power Off Threshold And Power-Gate Unit
if luu = min(t) and utilisationCounters[luu] = 0 then
powerState[luu] ← off
tablePrediction[index][luu] ← off
else if utilisationCounters[luu]/totalCycles < SWITCH OFF THRESHOLD then
powerState[luu] ← off
tablePrediction[index][luu] ← off
end if
. Test Power On Threshold And Power-Gate Unit
if utilisationCounters[luu]/totalCycles > SWITCH ON THRESHOLD and luu 6= max(type) then
powerState[luu + 1] ← on . Power Gate On Next Unit Of Same Type
tablePrediction[index][luu + 1] ← on
tableGuaranteeOn[index][luu + 1] ← true
end if
end if
end for
else
tableBranchAddress[index] ← branchAddress . Create New Entry
tableForwardBranchPattern[index] ← forwardBranchPattern
for all u in NUMBER OF UNITS do
tablePredictor[index][u] ← on
tableGuaranteeOn[index][u] ← false
end for
end if
end function
Apply thresholds. In ApplyPrediction the set of required units for the loop is predicted based
on the recorded utilisation of each unit. Unit utilisation can be used to determine the union of
the sets of units used in each cycle, as the union will contain all units with non-zero utilisation
during an iteration of the loop. To increase power savings, we use utilisation thresholds to
remove infrequently used units from this union and to add units if unit utilisation indicates
performance loss (see Section 5.3). We will only power-gate the last unit of a type if it has zero
utilisation, however. We refer to the set of units created using the thresholds as the prediction
for the loop.
Power-gate units. The prediction is applied by signalling to the processor which units should
be on or off. During a loop units may be powered back on if the prediction changes and more
units are required than are currently active.
The task record utilisation occurs every cycle by updating counters associated with each
58 Chapter 5. Loop-Directed Mothballing
unit. At the end of each iteration, apply thresholds and power-gate units will compare the
utilisation to two thresholds to produce a new prediction, and then power-gate units according
to the prediction.
Detect loop exit. Loop exit will either be caused by a branch inside the loop that targets
an address outside of the loop, or the loop branch itself not being taken. The latter is simple
to detect as the loop branch address is known. To detect the former, the outcome of the
condition of all conditional forward branches in the loop body is recorded in a forward branch
pattern. When detecting an innermost loop, the forward branch pattern is reset each time a
backwards branch is executed. We record the pattern until the next backwards branch, and
if the backwards branch address matches the address of the previous backwards branch, we
have found an innermost loop and have a pattern of intermediate forward branches that is
guaranteed not to exit the loop. We check forward branches during successive iterations, and
any observed deviation from the recorded pattern is assumed to be a loop exit. We describe
the consequences of false loop exits in Section 5.2.1.
Prediction Storage. The current prediction is stored in a dedicated cache called a prediction
table. The table has many entries and is indexed by both the branch address and the forward
branch pattern, which allows the prediction to be reused if the loop is revisited in future. In
ApplyPrediction an existing prediction is applied on the first iteration of the loop.
Power-gate all units on. When executing non-innermost loop code, all units are power-gated
on. In this state the processor cannot perform worse than a processor without power-gating.
Since we don’t have an accurate prediction for resource requirements in non-loop code, power-
gating units off could be detrimental to both performance and energy consumption.
5.2.1 False loop exits
Loop exit is assumed whenever the target of a conditional forward branch inside the loop
changes. This ensures that resource requirement predictions made for the loop are not applied
inappropriately outside of the loop, but can cause false loop exit detection when if-then-else
5.3. Finding the set of required execution units 59
and case statements are executed. If the control flow is changed due to one of these statements
and the change persists however, the loop will be redetected with the new control flow and a
new prediction will be made for the loop. Different flows will therefore yield predictions that
are specific to the instructions executed, and as predictions are stored and indexed by both
branch address and forward branch pattern, it will not overwrite the prediction for the original
control flow.
ALU MULMEM FPU
0%
100%
25%
50%
75%
U
ti
li
s
a
ti
o
n
 A
,C
,D
,E
...
ALU MULMEM FPU
0%
100%
25%
50%
75%
U
ti
li
s
a
ti
o
n
 A
,B
,E
A
C
D
E
B
Figure 5.3: Example control flow graph, showing conditional branches inside an innermost loop.
The unit utilisation is shown for two paths.
Figure 5.3 illustrates a scenario where different predictions for different control flows would
be beneficial. The left control flow contains no instructions that require the FPUs, but the
right flow requires both FPUs. A single prediction for the loop could not accurately capture
the requirements of each flow, which is achieved in LDM through its control flow specific
predictions.
5.3 Finding the set of required execution units
The out-of-order scheduler in superscalar processors issues instructions as soon as possible,
leading to high variation in ILP due to data dependencies. For instance - before a data depen-
dency is resolved ILP may be low as instructions are blocked from issue, and after resolution
the blocked instructions and subsequent instructions may cause a burst of high ILP.
At runtime, a burst of high ILP may indicate that many units are required to maintain the
60 Chapter 5. Loop-Directed Mothballing
same performance, however, it might also be possible to delay instructions in the schedule and
achieve the same performance with fewer units. In Chapter 4 we state that low impact on
performance is an important goal for a power-gating strategy. If we simply take the union of
required units during a loop iteration and power-gate only units that are never used, we will
not compromise performance, but bursts of high ILP may limit the power savings that can be
achieved.
Our goal therefore, is to identify units that were used in the previous iteration of the loop
and can be power-gated without causing unacceptable performance loss. The remainder of this
section considers two different approaches to achieve this goal.
5.3.1 Exhaustive approach
One solution to the problem is to progressively change the set of available units over individual
iterations of the loop and measure the time needed to execute each iteration. Any set that
increases execution time is disregarded and the remaining set with the fewest units is applied
for the remainder of the loop by power-gating the appropriate units.
Two problems affect the effectiveness of this approach:
• Exhaustively testing each configuration will require a large number of iterations and would
therefore limit savings, even if impossible or unlikely configurations are not considered.
• During our simulations (for instance the example in Section 5.5), we have observed a
variation in execution time across loop iterations when using the same set of units, due to
memory accesses hitting or missing in the cache and out-of-order instruction forwarding
causing inconsistency in the first few iterations. Good configurations coinciding with
either event would be disregarded.
5.3. Finding the set of required execution units 61
5.3.2 Threshold approach
The utilisation of individual units provides information about the maximum available ILP and
the proliferation of that level of ILP during a loop iteration. Consider 4 identical ALUs in a
processor: If all 4 units are highly utilised, there exists ILP of four instructions for the majority
of the code. If all 4 are utilised, but one is used very infrequently, there exists ILP of four
instructions, but it may be possible to power-gate off one unit and have the other units adopt
the workload without affecting performance.
We propose a threshold-based approach that switches off a unit if it is used less than a preset
switch off threshold. A formal definition of the switch off threshold is as follows:
Definition 5.1 (switch off threshold) A switch off threshold is set such that the least-used
unit (LUU) of a particular type (ALU, MUL, MEM, FP ADD, FP MUL) is power-gated off if
its utilisation is lower than the threshold, except where the LUU is the last active unit of the
type, when the unit may only be power-gated off if it has zero utilisation.
The exception in the definition ensures that at least one unit of a type remains on if there is
at least one instruction that requires it.
We stated in the previous section that it is difficult to detect if power-gating a unit has affected
the execution time of the loop. However, high utilisation of the new least-used unit (LUU)
after power-gating a unit may be an indicator: If the second least-used unit (2LUU) has high
utilisation before power-gating off the LUU of the same type, there are few opportunities in the
schedule of the 2LUU (and 3LUU etc.) to accommodate the instructions currently executed on
the LUU. In this case there is an increased probability of performance loss. Setting a second
threshold for the new LUU, to switch on another unit of the same type if the new LUU is highly
used may reduce performance loss. A formal definition of the switch on threshold is as follows:
Definition 5.2 (switch on threshold) A switch on threshold is set such that the least-used
unit (LUU) of a type exhibiting utilisation above the threshold initiates power-gating on another
unit of the same type (if available).
62 Chapter 5. Loop-Directed Mothballing
As it may be beneficial to sacrifice performance to achieve power-savings in some circumstances,
we will evaluate the effect of different thresholds on the energy-delay product in Chapter 9.
The two problems facing the exhaustive testing approach discussed earlier are solved by the
threshold technique:
• New configurations are applied and refined quickly, as many units may be power-gated
on or off in parallel after each iteration.
• Variation in execution time from instruction forwarding or memory accesses will only
have short term effects as the thresholds are continually applied to refine the prediction
of required units.
Out-of-order anomalies and flip-flopping
In Chapter 4 we state that out-of-order instruction forwarding may cause instructions to cross
an iteration boundary and result in zero utilisation for the required unit. If this unit was the
last of its type it would be powered off (see Definition 5.1), even though it will be required in
the next iteration. Any instruction requiring a unit that is off must therefore be detected at
instruction issue, and the required unit power-gated on.
Flip-flopping occurs when a unit is repeatedly power-gated on and off in successive iterations,
due to low utilisation of the LUU but high utilisation of the 2LUU. This is undesirable as
power-gating a unit on has an associated energy cost as described in Section 2.4. In the event
of flip-flopping, the unit should remain on despite its low utilisation, to err on the side of
limiting performance degradation.
A mechanism to prevent a unit being switched off due to instruction forwarding when the same
loop is visited in future, and to prevent flip-flopping, is a guarantee on flag defined as follows:
Definition 5.3 (guarantee on flag) For each detected loop, a set of guarantee on flags is
recorded (one flag per execution unit), which signify if the unit has been restored to the on state
5.4. Hardware components 63
either due to the unit being required at instruction issue or due to the switch on threshold. Any
unit flagged to be guaranteed on may not be power-gated again due to the switch off threshold,
during the current or future executions of the loop to which the flags are associated.
The guarantee on flag also prevents units being power-gated off due to memory delays. The
example in Section 5.5 shows a memory unit being removed from the prediction when it ex-
periences low utilisation during a cache miss. The unit is returned to the prediction after the
miss, and as a result its guarantee on flag is set. During future cache misses, this unit will not
be power-gated off again, as the flag has been set.
The threshold technique described in this section will be used for the basic LDM variant and
ELDM, which is described in the next chapter.
5.4 Hardware components
The hardware components required to add LDM to our superscalar architecture are shown
in Figure 5.4. For each execution unit a counter is included to record the unit’s utilisation,
which is passed to the threshold comparison unit. To avoid the costly division in calculating
percentage utilisation, a look-up table which contains precomputed thresholds for different loop
sizes could be used. The branch direction unit looks for successive backwards branches that
bound iterations of innermost loops, and forward branches to store and compare the forward
branch pattern to detect loop exit. The prediction cache (prediction table) contains the current
prediction for the loop, plus predictions from other loops. Information from the branch direction
unit selects the current entry in the cache and the threshold comparison unit is used to update
the prediction. The prediction in the currently selected entry of the cache is then applied by
power-gating the appropriate units. When not in an innermost loop, no entry in the prediction
cache is selected, so all execution units are power-gated on. The issue failed unit is signalled
when an instruction cannot issue because no appropriate unit is available. The unit will update
the current entry in the prediction cache to include this type of unit in the prediction. The
64 Chapter 5. Loop-Directed Mothballing
modifications to the processor are mostly passive and only power-gating the units actively
interferes with normal operation.
Instruction Cache
Fetch
Unit
Branch
Predictor
ITB Predecode
Retire
Unit
Decode and
Rename Register
Issue Queue FP Issue
 utilisationCounters
ALU MUL MEM FPU
Integer Registers FP Regs
Cache and Memory Subsystem
Resource
Prediction
Cache
Branch
Direction
Threshold
Comparison
powerState
branchExecuted
branchDirection
branchTaken
branchAddress
Issue
Failed
issueFailed
Figure 5.4: The superscalar architecture (based on the DEC Alpha 21264) with additional
components required for LDM (dark grey).
The resource prediction cache is likely to be the largest unit that must be added for LDM. An
example cache is shown in Table 5.1. The branch address and branch pattern (forward branch
pattern) uniquely identify each prediction in the cache and associate it with a loop branch and
control flow. The prediction consists of a bit for each execution unit which is set to 1 if the unit
is required and should be on. guarantee on also stores a bit for each unit, which if set prevents
the unit being power-gated off. The need for this field is explained earlier (see Section 5.3).
5.5. Example loop from perlbench 65
Branch Branch Prediction Guarantee on
address pattern
0x0FF4 0x0011 0x1110110000 0x0000010000
0x0FF4 0x101 0x1100101000 0x0000000000
0x5F20 0x10011 0x1000101100 0x0000001100
... ... ... ...
Table 5.1: Example Loop-Directed Mothballing resource prediction cache.
5.5 Example loop from perlbench
This section describes an example loop from the perlbench benchmark and demonstrates how
LDM makes a power-gating prediction. First the assembly code for the loop is shown and
explained. Then the unit utilisations and predictions are shown for each iteration of the loop.
5.5.1 Description of the loop
An example loop from the perlbench benchmark is shown in Table 5.2. The first instruction is
a forward branch that exits the loop and the last instruction is the loop branch, which can also
cause loop exit. The forward branch pattern is 0x0 as the first branch must not be taken for the
loop to continue executing. All instructions require an ALU, either for the main computation
or for address calculation, and four require MEM units. The MUL, FP ADD and FP MUL
units are not needed for this loop.
5.5.2 Utilisation and predictions
Table 5.3 shows runtime statistics gathered during execution of the perlbench loop on the
simulator. The unit utilisation for each unit will be divided by the number of cycles needed to
execute the iteration to produce a utilisation percentage. This is compared to the thresholds to
determine if a unit should be power-gated on or off. For this example, the switch off threshold
is set to 5% and the switch on threshold is set to 70%. The table shows iterations where valid
counts can be made. Therefore, as one iteration must pass before two consecutive branches with
66 Chapter 5. Loop-Directed Mothballing
Instruction Unit Instruction
required description
BLE 21 112 ALU Branch if <= 0
SUBL 21 0 -> 21 ALU Subtract longword
BIS 20 0 -> 29 ALU Bitwise OR
LDQ U [18] -> 28 MEM, ALU Load quadword unaligned
EXTBL 28 18 -> 28 ALU Extract byte low
LDQ U [20] -> 25 MEM, ALU Load quadword unaligned
INSBL 28 20 -> 24 ALU Insert byte low
MSKBL 25 20 -> 25 ALU Mask byte low
BIS 25 24 -> 25 ALU Bitwise OR
STQ U 25 [20] MEM, ALU Store quadword unaligned
LDQ U [29] -> 28 MEM, ALU Load quadword unaligned
EXTBL 28 29 -> 29 ALU Extract byte low
AND 19 0 -> 23 ALU Bitwise AND
XOR 29 23 -> 29 ALU Bitwise XOR
LDA 20 -> 20 ALU Load address
LDA 18 -> 18 ALU Load address
BNE 29 -68 ALU Branch if != 0
Table 5.2: Assembly code of an example loop from the perlbench benchmark.
the same address are observed, the first iteration in the table is actually the second iteration
of the loop.
After the first iteration, the entry in the prediction cache is set up and a new prediction is created
with all the values set to the default value of 1. After the second iteration, the prediction is
modified. The least-used MUL unit is used less than the switch off threshold, so it is flagged as
unnecessary in the prediction and switched off. As the FP units are actually two different types
of unit, one for simple arithmetic and one for complex arithmetic such as multiplication, each
are considered separately. Both are not used and are also switched off (according to Definition
5.1 we may only switch off the last unit of a type if it has zero utilisation). The third iteration
switches off the last MUL unit as it is not guaranteed to be on, and is not used.
At iteration 13, the utilisation increases for all the units, and the number of cycles to execute
the iteration increases to 55 cycles (equal to the memory latency). A cache miss could cause
this delay and stall the pipeline, resulting in the units appearing busy for longer than normal.
Due to this, the utilisation of the least-used memory unit has dropped below the threshold, and
the unit is switched off. On iteration 14 the unit is still utilised as instructions are allowed to
5.6. Summary 67
Iteration Unit utilisation Cycles to Prediction Guarantee on
[ALU] [MUL] [MEM] [FP] execute
1 [5,4,4,2] [0,0] [3,2] [0,0] 5 0x1111 11 11 11 0x0000 00 00 00
2 [5,4,4,3] [0,0] [3,1] [0,0] 5 0x1111 10 11 00 0x0000 00 00 00
3 [5,4,4,2] [0,0] [3,2] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
4 [5,4,4,2] [0,0] [3,2] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
5 [5,4,4,2] [0,0] [3,2] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
6 [5,4,4,3] [0,0] [2,1] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
7 [5,5,5,2] [0,0] [3,1] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
8 [5,4,3,2] [0,0] [3,2] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
9 [5,5,4,3] [0,0] [3,1] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
10 [5,4,4,3] [0,0] [3,1] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
11 [5,5,4,1] [0,0] [4,1] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
12 [5,4,4,2] [0,0] [4,1] [0,0] 5 0x1111 00 11 00 0x0000 00 00 00
13 [13,12,10,7] [0,0] [5,1] [0,0] 55 0x1111 00 10 00 0x0000 00 00 00
14 [7,6,4,1] [0,0] [7,1] [0,0] 7 0x1111 00 11 00 0x0000 00 01 00
15 [5,5,4,2] [0,0] [4,0] [0,0] 5 0x1111 00 11 00 0x0000 00 01 00
Table 5.3: Iteration statistics from execution of the perlbench benchmark.
complete before the unit is actually power-gated off. During this iteration the high utilisation
of the remaining MEM unit is above the switch on threshold, which causes the prediction to
revert back to having both units MEM on. This sets the guarantee on flag for the second MEM
unit and will prevent it being switched off again with future cache misses.
5.6 Summary
This chapter describes the implementation of the basic form of LDM, which is restricted to
innermost loops. In Section 5.1 we explain the rationale behind applying our technique only to
innermost loops and show the implications of this for 16 benchmark traces. Section 5.2 presents
the LDM algorithm, details the key stages of the implementation and describes the benefits of
false loop exit detection. Section 5.3 discusses two approaches to improving power-savings by
switching off units that are used during an iteration, and shows that our threshold approach
overcomes some shortcomings of an exhaustive search. Section 5.4 describes the additional
hardware that would need to be added to the superscalar pipeline to implement the technique.
An example demonstrating LDM in operation when executing the perlbench benchmark is given
in Section 5.5.
68 Chapter 5. Loop-Directed Mothballing
The set of execution units required to execute a loop depends on the union of the sets of units
required on each cycle. Innermost loops are likely to be short and have fewer sets contributing
to the union, so it is likely that more units will not be in the union and can therefore be turned
off, when compared to larger (non-innermost) loops. We can also detect innermost loops easily
at runtime. Our coverage evaluation showed that over 75% of executed instructions in 10 of
the 16 benchmark traces come from loop bodies. Given this, and the likelihood of more savings
from small loops, the innermost loop restriction should not be too detrimental to savings.
A threshold approach to identifying unnecessary execution units allows the set of available units
to be changed quickly and adapt to cache misses as well as out-of-order scheduling moving
instructions past loop branches. Potential performance degradation can also be detected and
units switched back on using a second threshold. The LDM implementation has the following
key stages:
1. Detect loop entry by monitoring backwards branches.
2. Record unit utilisation with hardware counters associated with each unit.
3. Apply thresholds to unit utilisation to create a prediction of which units are required.
4. Power-gate the units not required according to the prediction.
5. Detect loop exit by monitoring the condition of forward branches and the backwards loop
branch.
6. Store the prediction for the loop in the prediction cache so it can be retrieved for future
visits to the loop.
7. Switch all units on as we have no accurate predictions for instructions outside of loop
bodies.
The hardware required to implement the design is mostly passive, only interfering by power-
gating units. The most important component is the prediction cache, which stores predictions,
forward branch patterns and guarantee on flags for different loops and different control flows
5.6. Summary 69
through loops. As this is likely to be large in size, the power consumption for different cache
capacities is discussed in Chapter 8.
We have used an example loop from perlbench to demonstrate the operation of LDM, including
how it behaves during a cache miss and updates its prediction accordingly.
This chapter describes the implementation of LDM, which is limited to power-gating execution
units only during innermost loops due to the hardware loop detection mechanism. The next
chapter describes ELDM, which extends the LDM technique to apply power-gating during
non-innermost loops using loop exit and entry information that is gathered oﬄine.
Chapter 6
Extended Loop-Directed Mothballing
The basic variant of LDM only power-gates execution units during innermost loops. This
chapter provides a method for extending LDM so that the effects of power-gating units during
all loops can be evaluated. The new method is called Extended Loop-Directed Mothballing
or ELDM. Changes to the loop detection mechanism are necessary as insufficient information
is available at runtime to accurately determine which loop the executing instructions belong
to. We provide a means of producing information oﬄine that can be used by the processor to
determine the currently executing loops. We also describe the changes to the LDM hardware
that would be needed to implement ELDM.
6.1 Introduction
For the basic implementation of LDM we chose to limit power-gating to innermost loops, as
they are easy to detect, account for a large proportion of executed instructions, and are likely to
contain more units that can be power-gated than larger loops. In this chapter we will describe
Extended Loop-Directed Mothballing, which will allow us to examine the extent of additional
power savings if we apply LDM to all loops.
In Figure 5.2 we saw that for 6 of the 16 benchmarks, innermost loops accounted for only a small
proportion of executed instructions. Figure 6.1 shows the proportion of executed instructions
70
6.1. Introduction 71
that originate from innermost loops (the proportion of execution where LDM can be applied
to power-gate units) and the proportion of executed instructions that originate from any loop
(the proportion of execution where ELDM can be applied). From the figure we can see that
almost every executed instruction from each benchmark is from a loop.
0
20
40
60
80
10
0
C
ov
er
ag
e 
(%
)
40
0.p
er
lbe
nc
h
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
43
5.g
ro
ma
cs
43
6.c
ac
tus
AD
M
43
7.l
es
lie
3d
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
ua
ntu
m
46
4.h
26
4r
ef
47
0.l
bm
48
2.s
ph
inx
3
99
8.s
pe
cra
nd
om
LDM
ELDM
Figure 6.1: Proportion of executed instructions where power-gating can be applied when using
LDM and ELDM
Although detecting innermost loops is possible at runtime, detecting other loops and the loop
structure can be difficult. More specifically, it is difficult to determine exactly which loop in a
loop nest the currently executing instructions belong to; A branch that exits a loop may also
exit other outer loops, and loop nesting may not be properly detected when a loop contains
more than one backwards branch (for example from continue statements). In the latter case,
each time a different loop branch is encountered, the loop detector may assume a new inner
loop has been entered, resulting in a very deep loop nest instead of a single loop.
To overcome this when executing a benchmark, we will supply the simulator with a list of loop
entry points and exit points, and the simulator maintains a modified version of the prediction
cache to keep track of the loops that the currently executing instructions belong to. The loop
entry point is an instruction address inside the loop, and if that address is visited it is added
to the prediction cache and flagged as currently executing. Loop exit points are instruction
addresses visited just after a loop exit. When these are encountered, the loop that corresponds
to the exit point is found in the prediction cache and the entry is flagged as not currently
executing. The entry will remain in the cache as it contains the prediction for the loop, which
72 Chapter 6. Extended Loop-Directed Mothballing
may be used again in future visits to the loop. A loop may have many exit points and many
loops may share the same exit point, if a single branch exits many loops.
In a real implementation, loop entry and exit points could either be detected at compile time
or by profiling the application. At compile time, additional instructions would be inserted to
inform a real processor of upcoming loops with the relevant entry and exit information. Special
instructions are used in existing compiler based methods in [JOA+05, TSC06, RPOG02] to pass
similar information to the processor at runtime. Unfortunately, the Compaq C compiler we use
to compile benchmarks for the simulator is closed source, so for our purposes benchmark traces
are manually processed oﬄine after compilation to extract the required information. Lists of
entry and exit points are then passed to the simulator along with the program instructions.
The algorithm used to extract the loops oﬄine is explained in Section 6.3, but the reader should
note that loop detection is not the focus of this thesis. The algorithm is sufficient to provide the
required information for our simulation, but we accept that improvements could be made to the
method. It should be clear to see, however, that a similar method could also be implemented
in a compiler to produce the same results and insert instructions to convey the information to
the real processor implementation.
Oﬄine loop detection allows loops to be identified an iteration earlier with ELDM, which ex-
plains the higher coverage in Figure 6.1 for benchmark traces executing only a single innermost
loop (for example mcf and libquantum).
Larger outer loops will likely contain more complex internal control flows, which will include
the control flows of inner loops, which may iterate a different number of times during different
outer loop iterations. In the basic variant of LDM different predictions could be made for
different control flows inside a loop, as long as the same control flow was observed for a few
iterations. Large outer loops are much less likely to experience the same control flow for
successive iterations, so tailoring predictions for control flows would be ineffective. To simplify
the ELDM method and structures used in the implementation, a single prediction is made for
each loop, including innermost loops, regardless of control flow.
6.2. ELDM Overview 73
6.2 ELDM Overview
The ELDM method is generally the same as for LDM, but some tasks must be performed
differently. Algorithms 3 and 4 describe the ELDM method. The EveryCycle function is
performed every cycle, and the UpdatePrediction function is only performed at the end of an
iteration.
Loop detection. As explained earlier, it is necessary to communicate some information to the
processor so that it is possible to accurately determine which loops the executing instructions
belong to. The processor is provided with a list of loop entry points, which identifies each loop
by the address of an instruction contained within the loop body. In Algorithm 3 the branch
address field in the prediction table entries are preloaded with the entry points. When an
instruction with this address is executed, we can safely assume that the corresponding loop
has been entered. If a prediction already exists in the prediction cache, the previously stored
prediction will be immediately applied after the loop is detected, when the current prediction
is applied.
Record utilisation. The utilisation of each execution unit is recorded separately for each
loop. The entry point marks the boundary between two iterations.
Apply thresholds. Thresholds are used as in LDM to make a prediction of required units
using utilisation thresholds.
Power-gate units. The prediction is applied by powering units off or on. In a nest of loops,
the prediction of the innermost loop that is being executed is always applied.
Detect loop exit. A list of loop exit points is passed to the processor as described earlier.
The exit point consists of a tuple which links the address of an instruction visited just after a
loop exit to the entry point which identifies the loop. If an instruction matching an exit point
address is executed, the loop identified by the entry point has exited. Multiple loops may exit
when an address is visited, so the exit addresses of every loop in the cache must be checked.
Prediction Storage. At loop exit, the loop information remains in the prediction cache and
74 Chapter 6. Extended Loop-Directed Mothballing
Algorithm 3 EveryCycle, structures and signals (ELDM)
int8 currentLoopDepth . Storage Structures
RAM int tableBranchAddress[TABLE SIZE] . Prediction Table
RAM bool tableCurrentlyExecuting[TABLE SIZE]
RAM int8 tableLoopDepth[TABLE SIZE]
RAM int16 tableUtilisationCounters[TABLE SIZE][NUMBER OF UNITS]
RAM int16 tableTotalCycles[TABLE SIZE]
RAM bool tablePrediction[TABLE SIZE][NUMBER OF UNITS]
RAM bool tableGuaranteeOn[TABLE SIZE][NUMBER OF UNITS]
RAM int loopExitBranchAddress[EXIT TABLE SIZE] . Loop Exit Table
RAM int loopExitExitAddress[EXIT TABLE SIZE]
bool unitActive[NUMBER OF UNITS] . Input Signals From Processor
bool branchExecuted
bool branchDirection
int branchAddress
int currentAddress[ISSUE WIDTH]
bool issueFailed[NUMBER OF UNIT TYPES]
bool powerState[NUMBER OF UNITS] . Output Signals To Processor
function EveryCycle
for all t in TABLE SIZE do . Record Utilisation
for all u in NUMBER OF UNITS do
tableUtilisationCounters[t][u] + = unitActive[u]
end for
tableTotalCycles[t]++
end for
if branchExecuted and branchDirection = BACKWARDS then . Loop And Iteration Detection
index← SelectByAddress(branchAddress) . Select Entry In Table
if index ≥ 0 then . If Branch Found In Table
if tableCurrentlyExecuting[index] then
UpdatePrediction(index)
else
currentLoopDepth + +
tableLoopDepth[index] ← currentLoopDepth
tableCurrentlyExecuting[index] ← true
end if
for all u in NUMBER OF UNITS do . Reset For Next Iteration
tableUtilisationCounters[index][u] ← 0
end for
tableTotalCycles[index] ← 0
end if
end if
for all e in EXIT TABLE SIZE do . Loop Exit Detection
for all i in ISSUE WIDTH do
if loopExitExitAddress[e] = currentAddress[i] then
index← SelectByAddress(loopExitBranchAddress[e])
tableCurrentlyExecuting[index] ← false
if currentLoopDepth ≥ tableLoopDepth[index] then
currentLoopDepth← tableLoopDepth[index] −1
end if
end if
end for
end for
index← SelectByLoopDepth(currentLoopDepth) . Only Currently Executing Entries Considered
for all u in NUMBER OF UNITS do . Apply Current Prediction
if index = 0 then . Not In A Loop
powerstate[u] ← on
else
powerState[u] ← tablePrediction[index][u]
end if
end for
for all t in NUMBER OF UNIT TYPES do . Check For Failed Attempts To Issue Instruction
if issueFailed[t] then
u← min(t) // Get The First Unit Of This Type
powerState[u] ← on
tablePrediction[index][u] ← on
tableGuaranteeOn[index][u] ← true
end if
end for
end function
6.2. ELDM Overview 75
Algorithm 4 UpdatePrediction (ELDM)
function UpdatePrediction(index)
for all t in NUMBER OF UNIT TYPES do . Update Existing Entry
luu← FindLeastUsed(t, tableUtilisationCounters[index]) . Get The Least Used Unit Of Type
if !tableGuaranteeOn[index][luu] then
totalCycles← tableTotalCycles[index]
. Test Power Off Threshold And Power-Gate Unit
if luu = min(t) and tableUtilisationCounters[index][luu] = 0 then
tablePrediction[index][luu] ← off
else if tableUtilisationCounters[index][luu]/totalCycles < SWITCH OFF THRESHOLD then
tablePrediction[index][luu] ← off
end if
. Test Power On Threshold And Power-Gate Unit
if tableUtilisationCounters[index][luu]/totalCycles > SWITCH ON THRESHOLD and luu 6= max(t) then
tablePrediction[index][luu + 1] ← on
tableGuaranteeOn[index][luu + 1] ← true
end if
end if
end for
end function
the prediction may be reused if the loop is visited again in future. The entry may be overwritten
by other loop information as more loops are executed, in which case the previous prediction
will be lost.
Power-gate units. If the outermost loop is exited, all units are powered on, as in LDM. If
the loop exited into another loop in a nest however, the prediction for that loop is applied.
The currently exiting loop can be obtained from the entries in the prediction cache (see Section
6.4).
The next section discusses how we record resource utilisation in the record utilisation stage, as
nesting of loops means that executing instructions will belong to more than one loop.
6.2.1 Recording resource utilisation in loops
When applying the LDM strategy to all loops, some executing instructions belong to more than
one loop in a nest. There are two options for recording resource utilisation while executing loops
in nests:
1. Resource utilisation is recorded for instructions that are contained within the loop, but
not for those that are inside any other inner loops.
2. Resource utilisation for a loop includes all instructions including those in inner loops.
76 Chapter 6. Extended Loop-Directed Mothballing
The motivation for option 1 is that a separate prediction will be made and applied for an inner
loop, so a more specific prediction should be applied for the instructions in the outer loop.
However, this can lead to switching many units on or off when the inner loop exits, due to a
mismatch between the requirements of the two loops. Also, the outer loop will normally execute
the inner loop on every iteration, so the prediction for the outer loop will only be applied until
its next iteration when the inner loop will be visited again. This frequent switching has an
associated cost as described in the Section 2.4.
Option 2 limits frequent switching by including the utilisation of the contained loops. Often, an
inner loop will dominate the execution of the outer loop, and option 2 will bias the prediction
for the outer loop towards the requirements of the inner loop. This reduces the amount of
switching at the exit of the inner loop as the two predictions should be similar. Conversely,
an inner loop may be small and relatively insignificant to the outer loop. In this situation the
outer prediction would be naturally biased towards the best set of execution units for the outer
loop. If any type of unit is required only by the inner loop however, at least one unit of the
type would also be included in the prediction of the outer loop, removing the need to power-up
the unit on each visit to the inner loop.
We select option 2 for ELDM, to limit performance losses that could be caused by frequently
power-gating units at the boundaries between inner and outer loops.
6.3 Oﬄine trace analysis
To accurately identify loop entry and exit points, oﬄine analysis is performed on all benchmark
traces. Earlier in Section 6.1 we explained how the entry points and exit points are used to
identify which loop the currently executing instructions belong to so that the prediction for the
required set of execution units can be updated. This section provides a method that can be
used to identify these points in a benchmark trace.
Control flow graphs (CFGs) are produced for each benchmark trace by recording the basic
blocks visited during simulation. Some information in the CFG is irrelevant for our analysis,
6.3. Oﬄine trace analysis 77
such as if-then-else and case statements, many of which exist inside or between loops and have
no bearing on the loop entry and exit points. To hide this information and make processing
the trace for entry and exit points easier, we place basic blocks in address order and iteratively
merge blocks with the next block if they only target the next block. Figure 6.2 shows how the
control flow between basic blocks of an if-then-else statement are merged into a single block.
Figure 6.2: a) A control flow of basic blocks from an if-then-else statement. b), c) and d) Basic
blocks from a) are iteratively merged with the next block if they only branch into the next
block.
Branches specifically relevant to loops or function calls will not be merged by this process. Loop
branch instructions may target the next block, but will always either target a previous block
or the block containing the branch. If the branch targets a previous address inside the same
block, this backwards branch is protected from being removed from the CFG, so the loop can
still be detected. Function call and return branches are never merged, as they can introduce
invalid control flows.
After merging, the remaining branches are coloured to show backwards branches (potential
loops), function calls and function returns. As function calls and returns normally target
78 Chapter 6. Extended Loop-Directed Mothballing
instruction addresses either above or below the currently executing code region they often
cause backwards branches which could be mistaken for loops, hence are coloured differently.
Loops are identified as a backwards branch with a control flow path back to the branch address.
Figure 6.3 shows real examples of true loops and false loops observed in the perlbench, leslie3d
and specrandom benchmark programs. At the top of the first control flow we can see a standard
loop with branch address at a) and exit address at b). These two addresses are stored in the
entry point and exit point lists as appropriate. The backwards branch from c) is not a loop
on its own, but when combined with the branch at h) it does form a loop. Either backwards
branch address could be chosen as the entry point and the exit point is at d). e) is a false loop
as there is no path back to e) after the branch is taken. f) is another small loop, which exits
at g).
The second control flow in Figure 6.3 is a function that is called from an instruction at a
higher address. i) is a true loop that exits at j), but k) is a false loop as the function returns
just after j). l) and m) are two examples that demonstrate a typical loop nesting and a loop
with multiple branch addresses to the same loop. In m) the entry point will be chosen as an
instruction address common to all paths within the loop.
For our evaluation, CFGs from each benchmark trace are processed manually. By doing this,
we were able to identify some of the more complex loops such as those in Figure 6.3, and ensure
that valid entry and exit points were selected. Automation of this method would require further
analysis of other control flows such as those involving recursion. In our framework, automation
is further hindered by the closed source nature of the compiler.
6.4 Hardware
The hardware requirements for ELDM are shown in Figure 6.4. The main difference when
compared to the hardware required for LDM is the loop exit unit which accompanies the
prediction cache. The prediction cache contains the current prediction for each loop, and the
entry point address that identifies the loop. The exit points are stored in their own table in
6.4. Hardware 79
Figure 6.3: Actual control flow graphs from the perlbench, leslie3d and specrandom benchmarks
after merging basic blocks.
the loop exit unit as there may be more than one exit point per loop. For our simulation, the
prediction cache and loop exit table must be large enough to hold the entry and exit points
for all loops, as they are all loaded at the start of simulation. In a real implementation, the
information would be passed to the processor through special compiler-inserted instructions, so
the cache and table can be much smaller. A discussion of the actual cache size for LDM and
ELDM will be provided in Chapter 8.
In addition to the prediction, entries in the prediction cache also contain a guarantee on field
80 Chapter 6. Extended Loop-Directed Mothballing
and resource utilisation counts. The resource utilisation counts must be recorded separately
for each loop as executed instructions may contribute to the counts of many loops in a loop
nest simultaneously. A loop depth field is also included in the cache to record the depth of each
loop in the current nesting. This is used in conjunction with a register monitoring the deepest
loop currently executing to determine which prediction should be applied after loop exit. The
forward branch pattern is no longer necessary to detect loop exit, so does not need to be stored
for ELDM. Finally a currently executing field is added to show which loops are currently being
executed.
The threshold comparison unit uses the resource utilisation counts for the loop at the end of
each iteration, along with the guarantee on field to update the prediction.
Instruction Cache
Fetch
Unit
Branch
Predictor
ITB Predecode
Retire
Unit
Decode and
Rename Register
Issue Queue FP Issue
 unitActive
ALU MUL MEM FPU
Integer Registers FP Regs
Cache and Memory Subsystem
Resource
Prediction
Cache
Branch
Direction
Threshold
Comparison
powerState
branchExecuted
branchDirection
branchAddress
Issue
Failed
issueFailed
Loop
Exit
currentAddress[4]
Figure 6.4: The superscalar architecture (based on the DEC Alpha 21264) with additional
components required for ELDM (dark grey).
6.5. Summary 81
The utilisation counters associated with each execution unit do not need to count the number
of active cycles for each unit, but instead flag if a unit was used each cycle. These flags are
used every cycle to update loop-specific utilisation counts in the prediction cache for all loops
that are currently executing.
6.5 Summary
This chapter shows how LDM can be extended so that we can evaluate the additional power
that can be saved by applying the strategy to all loops. Section 6.1 shows the increased cover-
age of instructions over which power-gating could be applied, and describes the difficulties of
accurately determining which loop the executing instructions belong to. Due to these difficul-
ties, oﬄine analysis is required to provide information on loop entry and exit points. Section
6.2 explains the steps that must be carried out when implementing ELDM, and the changes
that must be made to detect loops. An oﬄine trace analysis method which provides loop entry
and exit points is provided in Section 6.3. The hardware that would need to be added to the
superscalar architecture to implement ELDM is compared to that needed for LDM in Section
6.4.
It is important for the ELDM implementation to be able to determine which loops the currently
executing instructions belong to, so that the correct set of execution units can be power-
gated. We manually produce loop entry and exit points from benchmark traces, which are
passed to the simulator along with the benchmark program so that simulated hardware can
accurately determine which loop is being executed. Ideally this method would be incorporated
into a compiler however, to automate the process. Communicating the entry and exit point
information through special instructions also reduces the amount of storage needed for the
implementation, as the points are provided only when needed.
A key difference with LDM is the prediction cache, which must be augmented with fields to
support loop nesting information. The utilisation of each execution unit must be recorded for
each loop in the cache, as the executing instructions may belong to more than one loop. The
82 Chapter 6. Extended Loop-Directed Mothballing
forward branch pattern, used to detect loop exit in the basic variant of LDM, is not required
however, as the exit point information is used to identify loop exits, and ELDM does not make
different predictions for different control flows inside a loop.
ELDM allows power-gating to be applied to execution units during execution of all loops, by
providing the processor with loop information that has been gathered oﬄine. The next chapter
modifies the dynamic out-of-order scheduler to encourage reuse of units that are already on.
When applied to a processor implementing ELDM, this should make units that are unneces-
sary for execution of a loop easier to identify with ELDM, and may permit a simpler ELDM
implementation.
Chapter 7
Schedule balancing
Out-of-order schedulers issue as many instructions from the issue queue as possible each cycle,
but do not take the types of instruction into account when selecting instructions. This can result
in more units being used to execute a loop body than would be necessary. In this chapter we
modify the scheduler so that it encourages reuse of units and avoids using many units of the
same type where possible. This scheduler will be incorporated into ELDM, so that the ability
of the scheduler to reveal unnecessary units can be evaluated.
7.1 Introduction
In Chapter 5 we stated that the set of execution units required to execute a loop body depends
on the union of the sets required on each individual cycle. We also noted that in the best case
scenario there would be minimal variation between the sets used on each cycle, as variation
between the sets may result in more units in the union of the sets. LDM and ELDM attempt
to increase savings by assuming that if a particular variation occurs infrequently and results
in a unit that has low utilisation, the computation can be moved to another unit without
performance loss and the unit can be power-gated.
In current out-of-order schedulers, there is no motivation for reducing the variation between the
set of instructions that are issued from cycle-to-cycle. Consider the issue queue in Figure 7.1,
83
84 Chapter 7. Schedule balancing
Figure 7.1: Example issue queue containing 8 issuable instructions.
which contains 8 instructions that have all of their data dependencies satisfied. The scheduler
may issue up to four instructions per cycle. A traditional scheduler will process the queue from
the head, issuing the first four instructions (LOAD, LOAD, MUL, MUL) on the first cycle and the
following four instructions (ADD, ADD, ADD, ADD) on the second cycle. When using the traditional
scheduler the set of required units for the two cycles contains 8 units - 4 ALUs, 2 MEMs and
2 MULs. It should be clear to see, however, that only half of these units are actually needed,
if on each cycle 2 ALU instructions, 1 MEM instruction and 1 MUL instruction are issued. In
this chapter we propose a balancing scheduler that attempts to reduce the variation between
the instructions issued on different cycles.
There may be instances where inter-instruction data dependencies prevent a different set of
instructions from being issued on a particular cycle. In this case the balancing scheduler should
not change the schedule, as this may affect performance. As the set of instructions that are
issued in this scenario is limited by the data dependencies, the threshold approach used in LDM
and ELDM would not offer any advantage. In fact, as the switch on threshold cannot accurately
identify performance loss, LDM and ELDM may switch off a unit used by this schedule and
suffer a performance loss. The key advantage of the balancing scheduler is that it has access to
the data dependency information, so it can distinguish between situations where instructions
can be moved to balance the sets (decrease the variation) and situations where dependencies
7.2. Balancing scheduler overview 85
prevent this.
In an ideal situation with oracle knowledge, a scheduler could consider all of the instructions that
need to be issued on each loop iteration and reorder them given data dependence constraints to
use as few units as possible. As storing all the instructions from large loops would require large
buffers inside the processor, and data dependencies can be complex, this may be intractable
in realtime at runtime. Our approach is to balance the set of issued instructions on a cycle to
cycle basis.
When the schedule has been balanced, the union of the sets of units used each cycle should be
a better indicator of the unit requirements for the loop body. Therefore, only units that are
never used should be power-gated off, and no units should need to be powered on again due
to performance loss. We will evaluate the balancing scheduler later by incorporating it into
ELDM and comparing the optimal thresholds when using the new scheduler and when using
ELDM on its own.
7.2 Balancing scheduler overview
We associate a cost with each execution unit, which the scheduler can use to decide which
instructions in the issue queue should be issued. For each instruction type contained within the
loop body at least one unit of that type will need to remain on. Therefore the cost of issuing
to these units should be zero. In general, the cost of using the first unit of any type should
be zero, because if it is not used by the loop body at all, it will be switched off anyway. For
the units that are used, setting the cost to zero will encourage the balancing scheduler to reuse
these units whenever possible.
We increase the cost for each additional unit of each type as shown in Table 7.1. The costs
increase more rapidly for the MUL and MEM units. This gives some priority to using additional
ALUs as an extra ALU unit is likely to be more useful (in the example code in Figure 5.2
all instructions require an ALU). The fourth ALU has the highest cost and should only be
used if there are no other types of instruction available that could reuse another unit. Other
86 Chapter 7. Schedule balancing
Unit Costs
ALU [0,1,2,3]
MUL [0,2]
MEM [0,2]
FP ADD [0]
FP MUL [0]
Table 7.1: Execution unit costs for the scheduler modification. Costs increase for each successive
unit of each type.
architectures may have a different configuration of units, but as long as the costs increase for
each additional unit of each type, the scheme should encourage a balanced schedule where
possible.
We do not restrict the scheduler from issuing instructions to an available unit. Doing so could
force an instruction to be executed in the following cycle and allow the unit to be power-gated,
but the effect on performance cannot be determined in the cycle-to-cycle approach.
Each cycle, the issue queue is processed in order and each instruction is weighted by the cost
of the unit that it would need for execution. Figure 7.2 shows the example issue queue from
Section 7.1 with the cost to issue each instruction. The instructions with the lowest weight
(shown in bold in the figure) are then issued. We can see from the figure that using the
balancing scheduler, the minimum number of units will be required over the two cycles. The
issue queue is reprocessed each cycle and classical optimisations, such as promotion of memory
operations, can still be applied before the weighting.
The balancing scheduler changes the order in which instructions are executed, which may
have an effect on performance if an instruction on a critical path is delayed. To examine
the effects of the balancing scheduler it was implemented and tested without applying power-
gating to any execution units, and less than 1.5% increase in execution time was observed for
all the benchmarks over the 1 million cycle traces (see Chapter 8 for details on the simulation
framework and benchmarks). For the simulations it is assumed that the scheduler modification
will not affect the clock rate of the processor.
Figure 7.3 demonstrates, for the trace of the mcf benchmark, how the average utilisation of
the units changes when the balancing scheduler is used. This particular trace is a single loop,
7.3. Processing the issue queue 87
Figure 7.2: Example issue queue showing issue costs associated with each instruction.
so the values represent the utilisation over a single loop body. With schedule balancing ALU
computation has moved from the third and fourth ALU to the first and second, increasing the
utilisation of the first and second to almost full capacity, and reducing the utilisation of the
fourth to almost zero. No performance degradation was observed for mcf, and it is clear that
the fourth unit is a good candidate for power-gating.
ALU ALU ALU ALU MUL MUL MEM MEM FPADD FPMUL
U
ni
t u
til
is
at
io
n 
pe
r c
yc
le
0.
0
0.
4
0.
8
1.
0
Original Scheduler
Modified Scheduler
Figure 7.3: Effect of modified scheduler on execution unit utilisation during a trace of the
integer benchmark mcf.
7.3 Processing the issue queue
In the standard scheduler the issue queue is processed sequentially, issuing up to four instruc-
tions from the head of the queue if the required execution units are available. To apply the
88 Chapter 7. Schedule balancing
weights for the modified scheduler, the queue must be processed up to four times (as there are
four ALUs). On each pass, an instruction of each type (ALU, MUL, MEM, FP ADD, FP MUL)
that is nearest to the head of the queue and has not already been weighted is weighted with the
current cost for that type. The costs are then increased according to Table 7.1 before the next
pass. After the queue has been processed, the instructions with the lowest weights are issued.
According to [PJS97], the dynamic scheduler delay increases as a quadratic function of both
issue queue length and issue width, and can limit the clock frequency of a superscalar processor.
Therefore, modifying the scheduler to incorporate several passes of the issue queue to select
instructions to issue could be difficult to implement without impacting the performance or
reducing the size of the instruction window. In the discussion of results in Chapter 9 we will
discuss this issue and the possibility of balancing the schedule at compile time, by reordering
the assembly code such that instructions appear in the issue queue in a balanced fashion.
An alternative runtime solution, approximates a balanced schedule in only two sequential
passes. The first pass attempts to fill a balanced schedule in one pass by granting issue to
a limited set of units, for example - 2 ALUs, 1 MUL, 1 MEM, 1 FP ADD and 1 FP MUL. This
prioritises reusing units that may already be on. If the issue slots cannot be filled completely,
a second pass permits issue to the remaining units until the issue slots are filled. The details of
implementing the scheduler, either in the selection logic or by reordering instructions at compile
time, will be left to future work; We will focus instead on the effectiveness of such a scheduler
in revealing units that can be power-gated by reducing variation between the sets of units that
are required during each cycle of the loop.
7.4 Summary
The set of execution units that are used to execute a loop body depends on the sets of units used
on each cycle. We present a balancing scheduler that reorders instructions in an attempt to
reduce the variation between the sets of units used on different cycles. Section 7.1 introduces
the problem and potential benefits of a balancing scheduler. In Section 7.2 we describe the
7.4. Summary 89
operation of the scheduler and weighting system used to encourage unit reuse. Section 7.3
describes how the issue queue must be processed to implement the scheduler.
A disadvantage of LDM and ELDM is that the thresholds used to switch on or off execution
units cannot take inter-instruction data dependencies into account. As a result performance
may be compromised by switching off an infrequently used unit. As the switch on threshold is
only an indicator of potential performance loss, this unit may not be switched back on again.
The balancing scheduler has access to data dependency information and can determine which
instructions may issue each cycle. Therefore it can reorder the instructions so that fewer units
are used without causing performance loss.
We note that during a loop at least one unit must remain on for each type of instruction that
exists in the loop body. By associating a cost to each unit, we encourage reuse of these units
rather than using a second unit of the same type and increasing the number of units used during
the loop. The costs increase for each additional unit of each type.
Implementing the scheduler involves four passes through the issue queue, but this could be
reduced to two by approximating the balancing scheduler we have described.
Schedule balancing encourages reuse of execution units by prioritising balanced schedules that
issue many different types of instruction each cycle. This chapter and Chapters 5 and 6 describe
the implementation of LDM, ELDM and the balancing scheduler. The next chapter describes
the simulation and power estimation tools we will use to evaluate our techniques. We will
also discuss the power overheads of the hardware that must be added to the architecture to
implement our techniques.
Chapter 8
Simulation and power estimation
To evaluate the effectiveness of our techniques we implement them in the SimpleScalar/Alpha
simulator. Using the simulation statistics, we can then estimate the power consumption using
McPAT. This chapter starts by describing factors that influence the accuracy of processor
simulators, and describes how our evaluation should not be greatly affected by these factors. We
then describe our changes to SimpleScalar and how we account for our power-gating technique
when estimating power consumption using McPAT. Finally we describe the benchmarks we will
use in our evaluation.
8.1 Simulator accuracy
Simulators are used to estimate processor performance when evaluating changes to an archi-
tecture, as implementing the design physically is prohibitive in terms of time and money.
Simulators allow changes to be made quickly and cheaply, but must be sufficiently accurate
so that the behaviour of the simulator represents the behaviour that would be seen in a real
processor. This section discusses recent research on simulator accuracy and how accuracy might
be affected when simulating our LDM power-gating techniques. The next section describes the
simulation parameters and modifications we have made to implement LDM, ELDM and the
balancing scheduler.
90
8.1. Simulator accuracy 91
In [CLSL02], Cain et al. make a distinction between the precision and accuracy of a simulator.
Precision describes how closely the simulator matches the hardware being simulated, and covers
aspects such as latencies, protocols and structures. Precision is important when making changes
to an architecture, so that the simulated components can reflect all the effects of the change.
Accuracy relates to how closely the simulator reproduces the behaviour of the processor, and
is dependant on the benchmarks used, emulation of operation system effects and loads on the
system resources from external sources. Accuracy is important when producing estimates for
the behaviour of a real system, in terms of power or performance, for example.
In [CLSL02], the authors create a highly precise full-system simulator of the Power-PC proces-
sor, so that they can evaluate how accuracy is affected by the operating system, direct memory
access and wrong path execution. Interestingly, as more detail is added in the workload by
including execution of library code and operating system code, the accuracy of miss rates in a
cache decreases and results in up to 5.8x overestimation of misses. Using the precise simulator,
the authors find that an instruction to fill an entire cache line with zeroes is responsible for
overestimating the cache misses, as the precise simulator emulates hardware that avoids missing
in the cache since the whole line will be rewritten anyway. They find that accounting for direct
memory accesses and wrong path execution had a low impact on simulation accuracy.
In [GKO+00], the authors compare simulation models of a multiprocessor system, FLASH, to an
actual hardware implementation. Their simulators included a low precision simulator, Mipsy,
both with and without full operating system simulation, and a high precision simulator, MXS,
with full operation system simulation. Initially, no simulators were accurate when compared to
the real implementation, and no clear advantage was seen with full operating system simulation.
Sources of error were attributed to inaccurate latencies for the TLB (even in the most precise
simulator, MXS), cache conflicts caused when ignoring the operating system, and imprecise
latencies for instruction execution in Mipsy. Precision is important for accurate simulation,
but due to the complexities of modern architectures and consequences of imprecision, a small
inconsistency between simulator and the real implementation (for instance the TLB latency)
can have a large effect on accuracy.
92 Chapter 8. Simulation and power estimation
Simulators are often used in research to evaluate the relative improvement that results from
an architecture change. Gibson et al. show in [GKO+00] that even an imprecise simulator can
closely predict trends when scaling the number of processors in a multiprocessor system, if
the operating system is also simulated. Redstone, Eggers and Levy [REL00] also show that
modelling the operating system is particularly important for commercial workloads such as
webservers, where up to 75% of execution is operating system code and not application code.
In the absence an absolutely precise simulator, merely choosing a more precise simulator may
be counter productive, as the remaining lack of precision will still have an impact on accuracy.
As a result, a simple simulator that is augmented with measured latencies may be better than
one that attempts to precisely model the factors that contribute to the latency. It is important
that sufficient detail is provided, however, so that the simulated components can reflect the
effects of any architecture changes.
The SimpleScalar/Alpha [BA97] is used for simulation of our power-gating approach and models
the superscalar pipeline in detail, including dynamic scheduling, register renaming and limits
on the number of instructions that can be issued and committed. The operating system is not
simulated, however the potential effects are considered: A focus of LDM and ELDM is low
performance impact, and any changes in performance are due to hard factors (switching off a
unit), which have well defined effects. We anticipate low interference between these changes
and the function of the operating system, so the operating system should have minimal impact
on the relative performance degradation caused by LDM or ELDM. It should also be noted
that LDM and ELDM could be applied when executing operating system code as well, so the
energy savings would not be limited by the proportion of time spent executing application code.
Increased memory latencies due to misses in the cache are currently accounted for in LDM and
ELDM (see Section 5.5 for an example). An increased latency for a memory access (due to a
miss in the TLB for instance) would decrease performance for simulation of both the reference
processor and the processor implementing LDM or ELDM similarly.
8.2. Simulation and power estimation toolchain 93
8.2 Simulation and power estimation toolchain
The simulation and power estimation toolchain used to produce the results presented in the
next chapter is illustrated in Figure 8.1 and described in this section.
Compaq C Compiler
program binary
(Alpha ISA)
SimpleScalar/Alpha
SimpleScalar
parameters
simulation statistics
XML Printer
McPAT
parameters
XML speciﬁcation
McPAT
power estimation
Results
execution cycles
CFG Printer
Manual Processing
control ﬂow graphexecution
trace
loop entry and
exit points
ELDMbenchmarks
Figure 8.1: Simulation and power estimation toolchain.
First the benchmark source is compiled to produce a binary that uses the DEC Alpha in-
struction set architecture. This need only be done once, as the source is never changed during
experimentation. The binary is provided to SimpleScalar along with simulation parameters that
describe the processor architecture to simulate. The specific parameter values are summarised
in Table 8.1, and are passed to SimpleScalar as command line arguments (see Appendix A).
SimpleScalar is modified to output all the required simulation statistics in a format that is
easily parsed by the XML printer.
McPAT takes an XML file as input, which provides both the architecture description and
simulation statistics. The XML printer takes a preprepared XML file that already contains the
architecture description and inserts the simulation statistics provided by SimpleScalar. The
preprepared XML file can be found in Appendix B. It should be noted that extra attributes
have been added to the XML file to communicate the extra information needed to account
94 Chapter 8. Simulation and power estimation
for power-gating, as described in Section 8.4. McPAT produces power estimation data for a
breakdown of the processor units, and for the processor as a whole, which is lastly combined
with the execution time in cycles from SimpleScalar.
For ELDM, two steps must be added to the toolchain to produce the entry and exit point lists.
The benchmark source must be simulated twice: First without power-gating enabled to produce
an execution trace, which is processed in the CFG printer to output a postscript representation
of the control flow graph. This can be processed manually to identify loop entry and exit points
and produce the required lists. The second simulation uses these lists to apply power-gating
using ELDM.
If these lists can be produced at compile time, then these two steps will be moved into the
compile step and the relevant information will be passed to the simulator via special instructions
inserted into the binaries.
8.3 Simulation of LDM, ELDM and schedule balancing
To evaluate the performance impact of our techniques and produce statistics which will be
used to estimate power consumption, we use the SimpleScalar/Alpha [BA97] simulator. The
superscalar architecture is based on the DEC Alpha 21264 architecture, which we have slightly
modified to separate some composite execution units that execute more than one type of in-
struction into separate units. The specification of the simulated processor and other simulation
parameters are detailed in Table 8.1.
The SimpleScalar toolset is one of the most popular simulators [YL06], and was created in
1995. It has undergone many refinements and revisions, and its broad adoption allows direct
comparison to other methods. To verify that all benchmarks are executed correctly, they were
run to completion on the sim-fast simulator variant and the outputs were checked against the
reference outputs provided. The benchmarks will be described later in Section 8.6.
To simulate LDM and ELDM, the SimpleScalar/Alpha (sim-outorder) simulator was modified
8.3. Simulation of LDM, ELDM and schedule balancing 95
Parameter Value
Decode, issue, commit width 4
Registers 64
Load/store queue 64
ALUs 4
MULs 2
MEMs (address computation) 2
FP ALUs 1
FP MULs 1
Branch misprediction latency 7 cycles
Branch predictor combined (bimodal and 2 level)
Branch target buffer 2048 entry
Instruction TLB 512 entry
Data TLB 512 entry
Data cache (L1) 128 KB
Instruction cache (L1) 128 KB
L2 cache 16 MB
Memory access width 128 B
TLB miss latency 30 cycles
L2 cache latency 32 cycles
Memory latency 55 cycles
Table 8.1: DEC Alpha 21264 specification used for simulation and power estimation.
to simulate power-gating of the execution units. To preserve the validity of the simulator the
only original code that we modify is a small section of code associated with obtaining execution
unit resources during instruction issue. We add an on flag to resource objects, to indicate if
the resource is currently on, otherwise the unit behaves as if busy, preventing instruction issue
to the unit.
When a unit is to be turned off or on, either due to a prediction or loop exit, an FSM associated
with each unit is used to implement the necessary delays. The FSM has four states; pending on,
on, pending off, and off. If a unit is deemed unnecessary for executing a loop, the FSM for
that unit enters the pending off state where the on flag is unset to prevent further issue to
the unit. When the unit has completed any previously issued instructions it transitions to off
and we assume the unit stops consuming power. When a unit is requested to be switched on,
the FSM first enters the pending on state for a delay to simulate the unit powering on, before
transitioning to on where the on flag of the resource is set and the unit becomes available for
instruction issue once more.
96 Chapter 8. Simulation and power estimation
During pending on, a linear increase in both static and dynamic power is assumed as an approx-
imation to the power dissipated while the capacitances in the circuits are charged. In Section
8.4 we will show how the number of times a unit is switched on is used to estimate the power
overheads from LDM and ELDM.
The power-gating delay depends on many transistor-level design parameters, including max-
imum supply current and rush current [HII+06] affecting neighbouring units, which can be
adjusted during low level design or managed through staged power-up to adjust delay. De-
tailed analysis, low level design optimisation and low level simulation would be required to
evaluate the design tradeoffs. This is outside of the scope of our architectural evaluation, so a
conservative wakeup delay of eight cycles is selected in line with [ADSN06] and [BS02]. This
delay is used for all execution units in the simulator.
The effect of an increased power-gating delay would be that units remain unavailable for longer
when they are requested to be switched on. This only occurs at loop exit and during prediction
refinement. Performance may be reduced during this period, but the other units that are
already on would remain available for issue, so the pipeline would not stall completely. From
Table 8.1 we can see that the branch misprediction latency in our simulated architecture is 7
cycles. If loop exit is caused by a branch misprediction, instructions could not issue to any
execution unit for most of the 8 cycle warm-up delay, and performance would only be potentially
affected during one cycle. However, any increase in delay could not be hidden by the branch
misprediction latency.
To implement the balancing scheduler, the issue stage of the simulator is modified according to
the design in the previous chapter. In the original simulator code, a copy of the issue queue is
made and then the issue queue is deleted. Instructions are issued from the copy until all of the
issue slots have been filled, and then the remaining instructions are inserted into a new issue
queue for the next cycle. We implement the balancing scheduler by making an additional queue
that is ordered by the instruction costs. The scheduler then issues up to four instructions from
the head of this queue. The original issue queue copy is used to create the issue queue for the
next cycle, to ensure that unissued instructions remain in their original order. This is not the
8.4. Power estimation 97
most efficient design for the scheduler, but it was implemented in this way so that a minimal
amount of original simulator code would be modified.
8.4 Power estimation
SimpleScalar allows us to estimate the performance impact of our techniques by comparing
them to a simulated processor with the same specification that does not implement LDM or
ELDM. To estimate the power consumption of our techniques we output usage statistics from
SimpleScalar which we can use with a power estimation tool.
Wattch [BTM00] is such a tool, and it can produce fast high-level power estimates for archi-
tectures. Unfortunately, it focuses on dynamic power dissipation, which is dominant in the
350nm technology generation which Wattch estimates for. It would therefore be inappropriate
for smaller generations as it would overestimate the contributions of units that exhibit high
activity compared to large devices that dissipate more static power.
McPAT (Multicore Power Area and Timing) [LAS+10] allows the user to select the technology
generation that would be used to implement the simulated processor, ranging from 90nm to
22nm. McPAT takes a hierarchical approach to power estimation: At the technology-level
data and projections from the ITRS roadmap for different technology generations define the
characteristics of the transistors in terms of power usage and delay. The circuit-level models
fundamental circuit blocks, such as arrays, complex logic and clock distribution logic, which can
be combined to model larger components at the architectural level. By breaking the modelling
process into these levels of detail, new models can be created in one level without needing to
change the models in another level.
The projections used by McPAT for technology scaling predict that the current planar bulk
CMOS devices could not be scaled beyond 36nm, so it assumes a switch to silicon-on-insulator
(SOI) technology at 32nm. It is also predicted that SOI will reach scaling limits at 25nm,
so the models for 22nm assume double gate technology. Intel’s microprocessors have been
implemented in the 32nm technology since 2009 [Int10b], however, these were implemented
98 Chapter 8. Simulation and power estimation
using planar bulk CMOS technology. As the planar bulk CMOS technology is more mature,
and McPAT does not estimate power for 32nm planar bulk CMOS technology, we will use the
45nm technology generation for our evaluation.
McPAT takes two sets of information as input, the simulation statistics and a specification
of the architecture. Before estimating power consumption, McPAT uses the specification to
produce a realistic architectural design that meets the timing constraints and uses minimal
area. It does this by searching the design space of processor configurations and optimising
the individual components of each configuration to meet the timing constraints and minimise
area. When a configuration has been selected, the static power consumption of the processor
is calculated and the simulation statistics from SimpleScalar are used to calculate dynamic
power. McPAT was validated against real data for the Niagara, Niagara2, Alpha 21364 and
Xeon Tulsa processors [LAS+10]. The relative power consumption of individual components in
the modelled Alpha was within 3% (of total power) of the published power data for the Alpha
21364.
The operating frequency we select is 1.2GHz, as this is close to the maximum frequency sup-
ported by the DEC Alpha 21264 [Com02]. Other low power processors also operate at around
this frequency; The Intel Atom operates at 1.1GHz to 2.1GHz, and the ARM Cortex-A8 oper-
ates at 1.0GHz to 1.5GHz [Tex11, Int10a]. High performance architectures will often operate
at higher frequencies, so in Chapter 9 we will compare the estimated power savings at 1.2GHz
to the estimated savings at 2.0GHz and 3.0GHz.
Power-gating the ALU, MUL and FPU units is implemented in McPAT by weighting the total
static power of the units by the proportion of execution time they are on. The MEM units
communicate with the load/store queue, but are combined with the memory subsystem in
McPAT, so the effects of power-gating these units could not be included. The linear increase in
static and dynamic power over eight cycles during unit power-up is equivalent to an additional
4 cycles at full power for each unit power-up. By adjusting the total number of cycles each
execution unit is operational (for static power) and active (for dynamic power) in the simulation
statistics, this overhead can be reflected in the power estimation provided by McPAT.
8.5. Hardware overheads 99
The overheads resulting from additional structures implemented for each technique are dis-
cussed in the next section. As area (more specifically, number of transistors) can be used as
a guide for the static power dissipation, and static power accounts for the majority of power
consumption, the power overheads should be negligible if we can show that the additional
structures required to implement each technique have negligible size.
8.5 Hardware overheads
For simulation of LDM, we use a resource prediction cache that can hold predictions for 64
innermost loops. To accommodate the complex control flow in a loop of the gcc benchmark
the branch pattern is 128 bits long, and the storage requirement for the cache is 1728 bytes
(equivalent to 2.6% of the L1 cache). As the L1 cache constitutes only around 5% of the total
power of the processor (see Figure 3.2), the power consumed by the prediction cache should be
negligible. Even so, we iteratively halved the number of entries to determine the effects of a
smaller cache on EDP savings. A two entry cache (requiring only 54 bytes of storage) reduced
EDP savings by less than 1% (of total EDP) for 15 benchmarks, and sjeng has the worst EDP
increase of 2.3%. On average, the two entry cache reduced the EDP savings by 0.2% (of total
EDP) when compared to the 64-entry cache. The 64-entry cache is used however, as the size
of the device is likely to be negligible.
Counters are currently included in many processors for monitoring performance, and utilisation
counters required for LDM would be similar. The counters only need to count up to the
maximum number of cycles required to execute an iteration of a loop. 16-bit counters would
be more than sufficient for our benchmark traces (see benchmark loop statistics in Table 9.1)
and the power dissipated should also be negligible when compared to other larger structures
such as the L1 cache.
To simulate ELDM we use lists of loop entry and exit points as described earlier. For our
simulation these lists must be passed to the processor in their entirety as the compiler could
not be modified, but in a real processor this information would be conveyed through special
100 Chapter 8. Simulation and power estimation
instructions inserted into the program code. The overheads of conveying this information should
be low, as it only needs to be provided at the beginning of each loop, and consists of one address
for loop entry and a tuple of two addresses for each possible loop exit. To operate correctly,
ELDM needs to update predictions for all loops in a nest on each cycle, so the prediction cache
will need to be have as many entries as there are loops in the nest, unless the outer loops are
ignored in very deep nests. As this cache is likely to have fewer entries than the LDM prediction
cache, and the entries will not contain the forward branch pattern, the ELDM cache should
consume less power than the LDM cache, and is therefore negligible in the context of overall
processor power.
8.6 Benchmarks
The benchmarks we use to evaluate our power-gating techniques are from the SPEC CPU2006
suite [Cor11]. Only 16 programs could be cross compiled for SimpleScalar/Alpha and suc-
cessfully executed on the simulator. Nine are from the integer suite, six are from the floating
point suite, and specrandom was also included. The remaining programs were either written in
Fortran or contained system calls not supported by the simulator.
Executing some benchmarks to completion using the simplest functional simulator from the
SimpleScalar toolset (sim-fast) takes in the order of days to complete, which would make full
simulation intractable for the more complex sim-outorder simulator used in this work. Therefore
evaluation is divided into two stages, the first simulates all 16 benchmarks for a trace of 1 million
instructions after a warm up of 5 million instructions. The short traces allow the simulations
to be repeated for different combinations of thresholds to determine the optimal thresholds
for LDM, ELDM and ELDM when using the balancing scheduler. As the techniques power-
gate units during loops, it is important that a variety of loops exist throughout the 1 million
instruction traces. Although the traces contained only a single loop for some benchmarks, up to
148 loops were detected for others. Over 280 different loops exist in total across the benchmark
traces, and they range in size and scope. For each benchmark, all test datasets were used and
8.6. Benchmarks 101
Benchmark Coverage
perlbench 13.0%
sjeng 13.4%
lbm 64.3%
gromacs 54.7%
Table 8.2: Proportions of each benchmark that are similar to the chosen sample.
the average over the datasets is given in the results.
The second stage of the evaluation simulates a representative 10 million instruction trace for
selected benchmarks using the optimal thresholds determined in the first stage. This allows us
to estimate the results we would see from executing the benchmark to completion. Due to time
constraints, not all benchmarks can be simulated in the second stage, so four benchmarks are
selected (two integer, two floating point) using the dendograms in [PJJ07]. The dendograms
group the benchmarks in terms of similarity and can be used to select the most representative
groups of benchmarks from the SPEC CPU2006 suite. The integer benchmarks perlbench and
sjeng, and the floating point benchmarks lbm and gromacs are chosen for the second stage.
The 10 million instruction traces are selected using the Simpoint tool [SPHC02], which groups
similar basic block traces and provides an example trace from each group. A trace from the
largest group is then selected and used for simulation. Table 8.2 shows how much of the original
benchmarks are covered by the chosen traces. Although coverage is low for perlbench and sjeng,
achieving a high coverage for these benchmarks would involve many more simulations.
The large simulation times for these benchmark traces are largely due to fast-forwarding, which
requires simulating around 100 billion instructions with the simple functional simulator to reach
the start of the traces. Selecting longer traces for our second stage would not significantly
increase simulation time, but would have made loop detection more difficult as we are currently
limited to detecting the loop entry and exit points manually. Given the length of the warm-up
periods, our minimum coverage of 13% (Table 8.2) means that our traces represent execution
of at least tens of billions of instructions. Moreover, our evaluation later will demonstrate that
increasing the trace size from 1 million instructions to 10 million instructions has little effect
on the results.
102 Chapter 8. Simulation and power estimation
8.7 Summary
In this chapter we describe the simulation and power estimation tools that we will use to
evaluate LDM, ELDM and ELDM when using the balancing scheduler. We first discuss factors
that can influence simulator accuracy in Section 8.1, and describe how these should have a
low impact on our results. Then in Section 8.3 we describe the changes we have made to the
simulator to implement LDM, ELDM and schedule balancing. In Section 8.4 we describe the
McPAT power estimation tool, which attempts to produce a realistic processor architecture
model before it estimates power consumption. The overheads of structures that would need
to be added to support LDM and ELDM are discussed in Section 8.5 and the benchmarks are
described in Section 8.6.
Simulator accuracy is crucial, otherwise results produced by the simulator are meaningless.
Existing research has identified key sources of inaccuracy, which include assumptions about the
operating system, but also shows that minor assumptions in more precise simulators can have
large effects on accuracy. We use simulation to compare the difference between an architecture
that implements our technique to one that does not. Doing so lessens the requirement for ab-
solute accuracy as long as the effects that lead to inaccuracy affect both architectures similarly.
We choose the SimpleScalar/Alpha simulator to evaluate our techniques as it has been broadly
adopted by the research community and has sufficient precision in the superscalar pipeline to
account for the effects of power-gating execution units.
We have chosen a recent power estimation tool, McPAT, so that power estimations can be made
using data from recent technology generations which consume more static power than dynamic
power. The tool attempts to determine a realistic processor design given a specification, by
considering different configurations and selecting one that meets timing constraints while min-
imising area. Then, statistics from simulation are used in calculating the power consumption
of the processor when executing a particular benchmark trace. The structures required to im-
plement our power-gating techniques are deemed negligible when compared in size to the L1
cache.
8.7. Summary 103
The benchmarks we will use to evaluate out techniques are taken from the SPEC CPU2006
suite. Traces of the 16 benchmarks that are 1 million instructions long will be used in the first
stage so that optimal thresholds can be determined. Then 10 million instruction traces that
are representative of large proportions of four benchmarks will be used along with the optimal
thresholds to estimate the overall power savings that could be achieved using our techniques.
This chapter describes the simulation and power estimation framework, including the choice
of tools and benchmark traces. The next chapter will use these to evaluate LDM, ELDM and
schedule balancing, and compare them to existing research.
Chapter 9
Results and discussion
This chapter presents an evaluation of our loop-directed execution unit power-gating strategies.
We will examine the following different aspects of the techniques:
• Different combinations of utilisation thresholds.
• Performance, power consumption and EDP when implementing LDM, ELDM and sched-
ule balancing.
• Relationships between loop size and EDP savings.
• Estimation of performance, power consumption and EDP for entire benchmark execution.
• Comparison of ELDM to an oracle which has prior knowledge of idle periods.
• Effects of higher operating frequency on savings.
The final part of our evaluation compares our approach to existing research, and presents
results from simulating two existing techniques using our simulation and power estimation
methodology.
104
9.1. Optimal threshold selection 105
9.1 Optimal threshold selection
LDM and ELDM switch execution units off or on based on the utilisation of the units during
loops. The switch off threshold is set such that units are power-gated if their utilisation is
below the threshold, and a switch on threshold is set to switch on another unit if the units that
are currently on are highly utilised. To evaluate the optimal thresholds for both techniques,
and the effects of incorporating the balancing scheduler into ELDM, we simulated the 1 million
instruction traces of the benchmarks with combinations of different thresholds. Figures 9.1,
9.2 and 9.3 show EDP averaged (arithmetic mean) over all the traces when using different
combinations of switch on and switch off thresholds with each technique. We use EDP instead
of power because aggressive thresholds will power-gate many units, but will also detrimentally
affect performance.
From Figure 9.1 we can see that the optimal thresholds for LDM are 5% and 70% (switch off
and switch on respectively). The optimal switch off threshold is low, which implies that there is
not much opportunity for moving computation to a different unit without causing performance
loss. The switch on threshold shows that although the utilisation of the next least used unit will
increase by at most 5% when a unit is switched off (assuming the 5% switch off threshold), it is
clearly difficult for the new unit to adopt the extra computation without affecting performance.
Out-of-order schedulers attempt to increase instruction-level parallelism by executing as many
instructions as possible each cycle. As a result, the data dependencies between instructions
can be the limiting factor when it comes to performance. Switching off execution units will
delay instructions in the schedule and will cause performance degradation if that instruction is
on a critical path and other instructions are dependant on its result. The quite conservative
thresholds for LDM suggest that the performance of the simulated traces is limited by data
dependence in many cases. The optimal thresholds for ELDM are similar to those for LDM
(Figure 9.2), and are 2% and 70%.
The balancing scheduler attempts to reuse units that have already been used by issuing a
balanced schedule of instructions that avoids using all of a particular type of unit. We showed in
106 Chapter 9. Results and discussion
Figure 9.1: EDP savings when using LDM with different combinations of thresholds.
Figure 9.2: EDP savings for ELDM when using different combinations of thresholds.
Figure 9.3: EDP savings when using ELDM and schedule balancing with different combinations
of thresholds.
9.2. Comparison of LDM, ELDM and schedule balancing 107
an earlier example that when using our scheduler, the utilisation of the most used units increased
and the utilisation of the lesser used units decreased. The balancing is only permitted when
the data dependencies allow, so if the balancing scheduler produces an unbalanced schedule,
switching off a unit will likely affect performance. Figure 9.3 demonstrates the effectiveness of
our scheduler, and the optimal thresholds are 0% and 99%. Only units that are never used
are switched off, and units are not switched back on due to performance loss. The optimality
of these thresholds suggest that if a unit is used, even if only rarely, the scheduler could not
move the computation because of dependencies. Therefore, switching off the unit has resulted
in lower EDP when the switch off threshold is increased.
The optimal thresholds when using the balancing scheduler have implications for the imple-
mentation of ELDM. The prediction cache no longer needs to store the utilisation counts from
each unit, rather a flag to indicate if each unit was used at least once. Also threshold com-
parison can be avoided. Because our implementation is set up to use the threshold method for
making predictions, only the least used unit of each type is power-gated after each iteration. As
schedule balancing removes the need for thresholds, we could potentially switch off all unused
units after a single iteration instead of switching off only the least used unit of each type. This
would likely only improve power savings slightly however.
We use the optimal thresholds to produce the simulation results in the next sections.
9.2 Comparison of LDM, ELDM and schedule balancing
To compare LDM, ELDM and the balancing scheduler, we present performance, power and EDP
data for each benchmark. Where multiple test datasets were available, traces were simulated
with each dataset and the results averaged. All results have been normalised against simulation
of the traces using a baseline version of the simulator where no power-gating is implemented.
From Figure 9.4 we can see that the performance impact of all three techniques is low (below
2.5%). This is to be expected as the thresholds were selected for optimal EDP, and perfor-
mance degradation has a double effect on the energy-delay product; Performance features in
108 Chapter 9. Results and discussion
the product directly, but performance degradation also increases energy consumption as the
chip must remain on for longer.
The 7% performance improvement for the sphinx3 benchmark was unexpected. Analysis of the
LDM runtime statistics revealed a significant change in load/store queue usage when executing
this trace: The load/store queue occupancy was reduced by around 20% and the load/store
queue latency decreased by 25%. Memory units are not power-gated during the trace, but
power-gating other units may have changed the order of issued instructions and resulted in
fewer speculatively executed memory operations. Reordering instructions also has the potential
to improve performance by executing critical path instructions sooner.
In Figure 9.5 we can see that the power savings for LDM are closely linked to the coverage
of executed instructions that are inside innermost loops (shown earlier in Figure 5.2). LDM
has few opportunities to power-gate units in the 6 benchmark traces with low coverage, but
by extending the approach to include all loops in ELDM, the power savings from these traces
is increased by up to 13%. When using the ELDM implementation that incorporates schedule
balancing, the power savings are similar to those seen when not using the balancing scheduler,
despite the power-gating thresholds being different. For four benchmarks, power consumption
is reduced by 20% by all three techniques. The average EDP savings from LDM, ELDM and the
balancing scheduler are 10.3%, 12.8% and 11.7% respectively. If we remove sphinx3 from the
average however, as the same performance improvement was not seen when using the balancing
scheduler, the difference between the average EDP for ELDM and ELDM with the balancing
scheduler is less than 0.2%.
LDM, ELDM and ELDM with the balancing scheduler achieve significant power savings while
executing the traces. When execution is dominated by innermost loops, such as in the bzip2,
mcf, milc and libquantum traces, the less complex LDM is preferable as it achieves the same
savings as ELDM and ELDM with schedule balancing. More complex loop structures clearly
benefit from ELDM, either with or without schedule balancing, and in gromacs power savings
are increased from 0% with LDM to over 13%. Adding the compile time complexity of detecting
loops may be justified in systems where a range of different application types is expected.
9.2. Comparison of LDM, ELDM and schedule balancing 109
N
or
m
. e
xe
cu
tio
n 
tim
e
0.
7
0.
8
0.
9
1.
0
1.
1
40
0.p
er
lbe
nc
h
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
43
5.g
ro
ma
cs
43
6.c
ac
tus
AD
M
43
7.l
es
lie
3d
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
ua
ntu
m
46
4.h
26
4r
ef
47
0.l
bm
48
2.s
ph
inx
3
99
8.s
pe
cra
nd
om
Av
er
ag
e
LDM
ELDM
Balancing
Scheduler
Figure 9.4: Execution time for the 16 benchmark traces normalised against a baseline processor
configuration that does not power-gate execution units.
N
or
m
. p
ow
er
0.
6
0.
8
1.
0
1.
2
40
0.p
er
lbe
nc
h
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
43
5.g
ro
ma
cs
43
6.c
ac
tus
AD
M
43
7.l
es
lie
3d
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
ua
ntu
m
46
4.h
26
4r
ef
47
0.l
bm
48
2.s
ph
inx
3
99
8.s
pe
cra
nd
om
Av
er
ag
e
LDM
ELDM
Balancing
Scheduler
Figure 9.5: Power consumption for the 16 benchmark traces normalised against a baseline
processor configuration that does not power-gate execution units.
N
or
m
. E
D
P
0.
6
0.
8
1.
0
1.
2
40
0.p
er
lbe
nc
h
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
43
5.g
ro
ma
cs
43
6.c
ac
tus
AD
M
43
7.l
es
lie
3d
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
ua
ntu
m
46
4.h
26
4r
ef
47
0.l
bm
48
2.s
ph
inx
3
99
8.s
pe
cra
nd
om
Av
er
ag
e
LDM
ELDM
Balancing
Scheduler
Figure 9.6: EDP for the 16 benchmark traces normalised against a baseline processor configu-
ration that does not power-gate execution units.
110 Chapter 9. Results and discussion
Lastly, we should note that the EDP of the baseline processor was never exceeded when simulat-
ing execution of the traces with our techniques. This is due to the fact that we only power-gate
units when we have an accurate prediction of the set of units that will be required, and the
choice of thresholds that optimise for EDP. We would expect similar savings for a larger set of
applications, and in the worst case no change in EDP if no loops can be detected. The following
section details some simulation statistics gathered during trace execution and discusses them
in the context of our power-saving techniques.
9.2.1 Benchmark runtime statistics
Table 9.1 lists loop statistics for the 1 million instruction traces. The traces provide a large num-
ber of loops, with h264ref contributing 148 by itself. Where multiple datasets were provided,
we chose the most representative datasets according to [PJJ07].
The loop characteristics vary in terms of loop size, number of iterations and number of separate
visits to the loop, which allows LDM and ELDM to be thoroughly evaluated under a range of
different conditions. The number of loops contained within a trace is linked to the statistics
in the other three columns. A large number of loops in a trace will likely be the result of loop
nesting, and inner loops in a nest will be restarted on each iteration of the outer loops. This
explains why benchmarks with more loops have more loop visits. If the trace only contains one
loop, it can only be visited once, and assuming it accounts for most of the trace execution, the
loop will have a high number of iterations and consume many cycles. Traces that contain more
loops do so because the loops are shorter.
Power and performance overheads are normally incurred only once for any loop (either at the
start or end when units are power-gated on). Therefore benchmarks with fewer, longer loops
will have lower overheads than those with many short loops. Figures 9.7 and 9.8 show how
our previous normalised EDP results are related to the number of iterations per loop visit and
number of cycles per visit respectively. From the figures we can see a tendency for lower EDP
results when the loops are larger or have more iterations.
9.2. Comparison of LDM, ELDM and schedule balancing 111
Benchmark Different loops Total loop visits Av. iter. per visit Av. cycles per visit
perlbench 41 1963 6 612
bzip2 1 1 83334 972328
gcc 6 1070 9 1339
mcf 1 1 35714 464316
milc 1 1 35714 464371
gromacs 4 3368 5 186
cactusADM 41 2818 8 240
leslie3d 5 4902 17 328
gobmk 2 2 54370 271880
hmmer 11 211 56 5012
sjeng 2 22 132 33777
libquantum 1 1 35714 464259
h264ref 148 1605 16 738
lbm 1 1 9523 342898
sphinx3 1 1 8621 823741
specrand 7 2524 5 203
Table 9.1: Loop statistics for the 1 million instruction traces, showing the number of different
loops detected in the trace, the number of times a loop is entered, the average number of
iterations loops execute for and the average number of cycles to execute a loop.
0.8
2
0.8
8
0.9
4
Average number of iterations per loop visit
No
rm.
 ED
P
1e+01 1e+02 1e+03 1e+04 1e+05
Figure 9.7: Scatter plot of EDP achieved during simulation and average number of iterations
per loop visit for the 16 benchmark traces.
0.8
2
0.8
8
0.9
4
Average number of cycles per loop visit
No
rm.
 ED
P
1e+03 1e+04 1e+05 1e+06
Figure 9.8: Scatter plot of EDP achieved during simulation and average number of cycles per
loop visit for the 16 benchmark traces.
112 Chapter 9. Results and discussion
Tables 9.2 and 9.3 show, for the 16 traces, the total number of cycles that units are off, the
number of times each unit is switched off, and the average number of cycles each unit remains
off when it is power-gated. In Chapter 8 we stated that the time overhead of switching on a
unit is 8 cycles and a linear increase in power consumption is assumed over this period. This is
equivalent to a power overhead of 4 cycles at full power. The values in the tables for the total
number of cycles a unit is off account these four cycles of power overhead.
For the 16 benchmarks, all units that are power-gated remain off for longer than 4 cycles on
average. The static power savings during this period should compensate for the dynamic power
costs, and therefore no unit will have consumed more power as a result of our power-gating
strategy.
The data for mcf, milc and libquantum are very similar in the tables. We suspect that these
benchmarks are executing the same library function or the same standard implementation (such
as searching a list) in the chosen traces.
9.2. Comparison of LDM, ELDM and schedule balancing 113
E
x
ec
u
ti
on
u
n
it
s
T
ot
al
B
en
ch
m
ar
k
A
L
U
A
L
U
A
L
U
A
L
U
M
U
L
M
U
L
M
E
M
M
E
M
F
P
A
D
D
F
P
M
U
L
cy
cl
es
to
ta
l
cy
cl
es
u
n
it
off
0
27
11
84
19
21
19
14
03
96
84
13
44
61
33
91
08
1
23
75
00
84
13
92
89
10
93
pe
rl
be
n
ch
n
u
m
b
er
of
sw
it
ch
off
s
0
3
21
31
9
81
7
35
9
25
17
6
14
47
35
8
av
.
cy
cl
es
p
er
sw
it
ch
9
56
60
2
17
2
23
44
24
5
51
8
16
4
23
50
to
ta
l
cy
cl
es
u
n
it
off
0
7
97
22
70
97
22
70
0
97
22
70
56
97
22
70
97
22
70
97
22
70
97
24
48
bz
ip
2
n
u
m
b
er
of
sw
it
ch
off
s
0
1
2
1
0
1
1
1
1
1
av
.
cy
cl
es
p
er
sw
it
ch
7
48
61
35
97
22
70
97
22
70
56
97
22
70
97
22
70
97
22
70
to
ta
l
cy
cl
es
u
n
it
off
0
0
12
3
72
20
0
50
25
14
0
48
8
15
36
58
15
37
14
51
30
83
gc
c
n
u
m
b
er
of
sw
it
ch
off
s
0
0
1
19
0
18
0
3
10
21
10
21
av
.
cy
cl
es
p
er
sw
it
ch
12
3
38
0
27
91
7
16
3
15
0
15
1
to
ta
l
cy
cl
es
u
n
it
off
0
0
0
46
42
24
46
42
45
46
42
45
0
0
46
42
45
46
42
45
46
46
52
m
cf
n
u
m
b
er
of
sw
it
ch
off
s
0
0
0
1
1
1
0
0
1
1
av
.
cy
cl
es
p
er
sw
it
ch
46
42
24
46
42
45
46
42
45
46
42
45
46
42
45
to
ta
l
cy
cl
es
u
n
it
off
0
0
0
46
42
45
46
42
45
46
42
45
0
0
46
42
45
46
42
45
46
46
52
m
il
c
n
u
m
b
er
of
sw
it
ch
off
s
0
0
0
1
1
1
0
0
1
1
av
.
cy
cl
es
p
er
sw
it
ch
46
42
45
46
42
45
46
42
45
46
42
45
46
42
45
to
ta
l
cy
cl
es
u
n
it
off
0
0
0
18
8
71
71
3
39
84
36
5
58
7
39
84
36
39
84
36
41
09
22
gr
om
ac
s
n
u
m
b
er
of
sw
it
ch
off
s
0
0
0
4
31
79
14
1
11
14
14
av
.
cy
cl
es
p
er
sw
it
ch
47
23
28
46
0
5
53
28
46
0
28
46
0
to
ta
l
cy
cl
es
u
n
it
off
5
25
83
7
74
83
47
36
8
15
42
66
47
23
79
38
1
23
02
3
33
56
45
33
58
20
71
46
03
ca
ct
u
sA
D
M
n
u
m
b
er
of
sw
it
ch
off
s
1
11
2
16
20
4
61
1
22
26
29
67
22
46
22
47
av
.
cy
cl
es
p
er
sw
it
ch
5
23
1
46
8
23
2
25
2
21
2
13
34
4
14
9
14
9
to
ta
l
cy
cl
es
u
n
it
off
0
0
0
68
55
3
45
87
12
45
87
12
15
59
45
87
12
45
87
12
45
95
63
le
sl
ie
3d
n
u
m
b
er
of
sw
it
ch
off
s
0
0
0
24
44
13
13
3
5
13
13
av
.
cy
cl
es
p
er
sw
it
ch
28
35
28
6
35
28
6
5
12
35
28
6
35
28
6
T
ab
le
9.
2:
E
x
ec
u
ti
on
st
at
is
ti
cs
fo
r
th
e
fi
rs
t
8
b
en
ch
m
ar
k
tr
ac
es
w
h
il
e
u
si
n
g
E
L
D
M
,
in
cl
u
d
in
g
th
e
to
ta
l
n
u
m
b
er
of
cy
cl
es
ea
ch
ex
ec
u
ti
on
is
off
,
th
e
n
u
m
b
er
of
ti
m
es
ea
ch
u
n
it
is
sw
it
ch
ed
off
,
th
e
av
er
ag
e
n
u
m
b
er
of
cy
cl
es
ea
ch
u
n
it
re
m
ai
n
s
off
an
d
th
e
to
ta
l
n
u
m
b
er
of
cy
cl
es
fo
r
th
e
1
m
il
li
on
in
st
ru
ct
io
n
b
en
ch
m
ar
k
tr
ac
e
to
co
m
p
le
te
.
114 Chapter 9. Results and discussion
E
x
ecu
tion
u
n
its
T
otal
B
en
ch
m
ark
A
L
U
A
L
U
A
L
U
A
L
U
M
U
L
M
U
L
M
E
M
M
E
M
F
P
A
D
D
F
P
M
U
L
cy
cles
total
cy
cles
u
n
it
off
0
7
26
543752
14
44
16
44
543752
543752
544054
gobm
k
n
u
m
b
er
of
sw
itch
off
s
0
1
3
2
2
2
3
3
2
2
av
.
cy
cles
p
er
sw
itch
7
9
271876
7
22
5
15
271876
271876
total
cy
cles
u
n
it
off
0
0
10173
494586
3018
551119
26
821
551551
551551
590408
hm
m
er
n
u
m
b
er
of
sw
itch
off
s
0
0
5
64
85
31
5
41
31
31
av
.
cy
cles
p
er
sw
itch
2035
7728
36
17778
5
20
17792
17792
total
cy
cles
u
n
it
off
0
0
0
2845
0
377213
0
645
377213
377213
381940
sjen
g
n
u
m
b
er
of
sw
itch
off
s
0
0
0
1
0
3
0
1
3
3
av
.
cy
cles
p
er
sw
itch
2845
125738
645
125738
125738
total
cy
cles
u
n
it
off
0
0
0
464222
464243
464243
0
0
464243
464243
464652
libqu
an
tu
m
n
u
m
b
er
of
sw
itch
off
s
0
0
0
1
1
1
0
0
1
1
av
.
cy
cles
p
er
sw
itch
464222
464243
464243
464243
464243
total
cy
cles
u
n
it
off
0
15
960
136051
88938
379890
8053
20858
457805
460393
645054
h264ref
n
u
m
b
er
of
sw
itch
off
s
0
2
27
299
436
539
484
218
389
388
av
.
cy
cles
p
er
sw
itch
8
36
455
204
705
17
96
1177
1187
total
cy
cles
u
n
it
off
0
0
0
342774
0
342774
0
0
342774
342774
344361
lbm
n
u
m
b
er
of
sw
itch
off
s
0
0
0
1
0
1
0
0
1
1
av
.
cy
cles
p
er
sw
itch
342774
342774
342774
342774
total
cy
cles
u
n
it
off
0
0
0
823095
0
823095
0
0
0
0
824992
sphin
x3
n
u
m
b
er
of
sw
itch
off
s
0
0
0
1
0
1
0
0
0
0
av
.
cy
cles
p
er
sw
itch
823095
823095
total
cy
cles
u
n
it
off
0
0
80
262
87540
392825
219
28190
35622
87540
434191
specran
d
n
u
m
b
er
of
sw
itch
off
s
0
0
2
9
1993
515
7
1000
1494
1993
av
.
cy
cles
p
er
sw
itch
40
29
44
763
31
28
24
44
T
ab
le
9.3:
E
x
ecu
tion
statistics
for
th
e
secon
d
8
b
en
ch
m
ark
traces
w
h
ile
u
sin
g
E
L
D
M
,
in
clu
d
in
g
th
e
total
n
u
m
b
er
of
cy
cles
each
ex
ecu
tion
is
off
,
th
e
n
u
m
b
er
of
tim
es
each
u
n
it
is
sw
itch
ed
off
,
th
e
average
n
u
m
b
er
of
cy
cles
each
u
n
it
rem
ain
s
off
an
d
th
e
total
n
u
m
b
er
of
cy
cles
for
th
e
1
m
illion
in
stru
ction
b
en
ch
m
ark
trace
to
com
p
lete.
9.3. Representative 10 million instruction traces 115
perlbench (int) sjeng (int) gromacs (fp) lbm (fp)
Exec. time 100.1% 100.0% 100.0% 100.6%
EDP 98.0% 98.8% 90.5% 88.2%
Exec. unit power 94.6% 96.8% 75.7% 64.9%
Table 9.4: LDM results when executing representative 10 million instruction traces of the
benchmarks.
perlbench (int) sjeng (int) gromacs (fp) lbm (fp)
Exec. time 100.2% 100.3% 100.0% 100.1%
EDP 87.2% 89.1% 89.6% 89.9%
Exec. unit power 64.1% 69.2% 73.3% 72.3%
Table 9.5: ELDM results when executing representative 10 million instruction traces of the
benchmarks.
perlbench (int) sjeng (int) gromacs (fp) lbm (fp)
Exec. time 100.3% 100.9% 101.3% 100.0%
EDP 87.3% 88.4% 94.5% 97.1%
Exec. unit power 64.6% 64.5% 79.8% 92.1%
Table 9.6: Balancing scheduler results when executing representative 10 million instruction
traces of the benchmarks.
9.3 Representative 10 million instruction traces
To estimate the savings that we would see from execution of entire benchmarks and test the
optimal thresholds we found for the 1 million instruction traces, we execute representative 10
million instruction traces on the simulator. A full description of the selection of the benchmarks
and traces can be found in Section 8.6.
In Tables 9.4, 9.5 and 9.6 we show the normalised execution time, EDP and execution unit
power consumption data for the two integer benchmark traces from perlbench and sjeng, and
the two floating point benchmark traces from gromacs and lbm.
Using the same thresholds as for the 1 million instruction traces, similar performance degra-
dation has been achieved, and execution time does not increase by more than 1.3% for any
trace when using LDM, ELDM and ELDM with schedule balancing. EDP results are also close
to the average results for the 1 million instruction traces, although low savings were seen for
the integer benchmarks when using LDM and the floating point benchmarks when using the
balancing scheduler. In all cases, the EDP still does not exceed that of the baseline processor.
116 Chapter 9. Results and discussion
The final row in each table shows the power savings for the execution units rather than the
processor as a whole. Power savings of over 35% can be seen for many traces. This is much
lower than the 70% savings described in Subsection 3.3.2 for cache memories, but the savings
are significant for execution units, as their utilisation is less predictable and changes rapidly
from cycle to cycle.
9.3.1 Applying schedule balancing oﬄine
In the results from the 1 million instruction traces the EDP results are on average similar when
using the balancing scheduler with ELDM and when using standard scheduler (assuming we
do not include sphinx3 in the average). The representative 10 million instruction traces show
that schedule balancing reduces the effectiveness of ELDM - although these results are only
included to verify the that the savings reported earlier are realistic, and the sample size is too
small to draw conclusions from.
Schedule balancing permits different thresholds to be used to achieve similar savings when
using ELDM, and this has the advantage of reducing the prediction complexity. Schedule
balancing increases the scheduler complexity, but this complexity could be absorbed at compile
time, either using static VLIW scheduling or by reordering instructions that will be issued by
a dynamic scheduler. With static scheduling for a VLIW, the very long instructions would
contain a balanced mix of operations that would reuse units. The dynamic scheduler is greedy
and will issue the instructions nearest to the head of the queue that have their operands ready.
By reordering the instruction stream so that instructions of the same type to not appear next
to each other, the instructions at the head of the queue should have a variety of instruction
types, so the dynamic scheduler will issue a more balanced set of instructions.
Either of these options would make the schedule balancing an appealing addition to ELDM, as
the overall complexity of the hardware needed to implement ELDM is reduced as long as the
complexity of the new scheduler can be moved oﬄine. For our evaluation, we implement the
scheduler changes in the simulator, only because the compiler could not be modified.
9.3. Representative 10 million instruction traces 117
9.3.2 Oracle results
To evaluate the effectiveness of our power-gating strategy, we will compare it to an oracle
implementation that has knowledge of all future idle periods. This will reveal the potential
power-savings that could be achieved with a perfect strategy. Our oracle implementation will
power-gate execution units during any idle period that it can, and will ensure that units are
switched on prior to their next usage so performance is not affected.
For the implementation, we run the 10 million instruction traces through the baseline simulator
and output for each execution unit a list of idle and active periods with the length of each
period. To ensure performance is not degraded, a unit can only be power-gated during idle
periods that are longer than 8 cycles. Using the lists the number of cycles each unit will be off
and the number of times a unit is power-gated can be calculated, and this information input
into the power estimator. The other statistics that are provided to the power estimator can
be obtained from the baseline simulation of the trace, as the oracle should not affect trace
execution in any way.
Figures 9.9, 9.10, 9.11 and 9.12 show for the four traces the proportion of execution time each
unit is off and the number of times each unit is switched off when using the oracle implemen-
tation and ELDM. On the right hand side of each figure, units are normally either always on
or always off with ELDM. This is to be expected if a loop is being executed in the trace, as
power-gating decisions are made at the loop level. The oracle is able to power-gate at a finer
granularity however, so can power-gate units during a loop iteration if they are idle for long
enough. In Figure 9.11 the fourth ALU is off for longer with ELDM, because the unit has been
switched off due to low utilisation and computation must move to another unit. In Figure 9.10
ELDM uses the FP units for more than 13% of the trace, despite the oracle being able to switch
them off for almost all of the trace. This is probably due to execution of non-loop code, where
ELDM requires that all units must be switched on to prevent performance degradation.
On the left hand side of the figures we can see that the oracle switches some units on up to
1.5 million times during the trace. ELDM does not switch a unit on more than 41,000 times
118 Chapter 9. Results and discussion
Figure 9.9: Power-gating statistics for the 10 million instruction perlbench trace when using
ELDM and an oracle implementation.
Figure 9.10: Power-gating statistics for the 10 million instruction sjeng trace when using ELDM
and an oracle implementation.
9.3. Representative 10 million instruction traces 119
Figure 9.11: Power-gating statistics for the 10 million instruction gromacs trace when using
ELDM and an oracle implementation.
Figure 9.12: Power-gating statistics for the 10 million instruction lbm trace when using ELDM
and an oracle implementation.
120 Chapter 9. Results and discussion
during any of the four traces. The large amount of switching performed by the oracle will not
increase power consumption, but it will decrease the savings as the overhead of switching has
an associated cost.
0
20
40
60
80
10
0
N
or
m
. E
D
P
40
0.p
er
lbe
nc
h
45
8.s
jen
g
43
5.g
ro
ma
cs
47
0.l
bm
ELDM
Oracle
Figure 9.13: EDP of the processor when executing the 10 million instruction traces with ELDM
and the oracle implementation.
Figure 9.13 we show the normalised EDP for the simulated processor when executing the traces
with the oracle implementation and ELDM. For three of the traces, ELDM achieves over 50%
of the savings realised by the oracle, and for the gromacs trace ELDM achieves 77% of the
oracle savings. Given that the oracle has access to information that is not practically available
at runtime, which allows it to power-gate units at the start of idle periods and power them on
at the optimum time to prevent performance degradation, ELDM has performed very well with
the resource requirement predictions made at the loop granularity.
9.3.3 Operating frequency
In Chapter 8 we selected an operating frequency of 1.2GHz, in line with current low-power
architectures. As high performance processors normally operate at frequencies of up to 3GHz,
we used McPAT to estimate the power that would be consumed if the processor operated at
higher frequencies. In Figure 9.14, we show the power consumption when executing our 10
9.3. Representative 10 million instruction traces 121
million instruction traces on the baseline processor that does not implement execution unit
power-gating.
At 3GHz the static power consumption of the processor increases only slightly, but the dynamic
power consumption is three times higher and accounts for around a third of the total power
consumption. The lbm trace has a considerably lower average throughput than the other traces,
so the dynamic contribution is less. In Figure 9.15, we can see that the increase in dynamic
power consumption is closely related to the power savings that we achieve. However, at 3GHz
power savings of between 7% and 10% are achieved for all four benchmark traces.
P
ow
er
 (W
at
ts
)
0
5
10
15
20
40
0.p
er
lbe
nc
h
45
8.s
jen
g
43
5.g
ro
ma
cs
47
0.l
bm
1.2GHz
2.0GHz
3.0GHz
Figure 9.14: Power consumption of the baseline processor with different operating frequencies.
N
or
m
. p
ow
er
 re
du
ct
io
n
0.
00
0.
05
0.
10
0.
15
40
0.p
er
lbe
nc
h
45
8.s
jen
g
43
5.g
ro
ma
cs
47
0.l
bm
1.2GHz
2.0GHz
3.0GHz
Figure 9.15: Normalised power savings for the 10 million instruction traces at different operating
frequencies when using ELDM.
122 Chapter 9. Results and discussion
9.4 Discussion of related work
Maro et al. apply power-gating to clusters of execution units in the DEC Alpha architecture
[MBB01], which contains one cluster for floating point and two identical integer clusters. Three
proposed techniques monitor the level of parallelism and power-gate an integer cluster, and are
summarised as follows:
• Functional unit usage is monitored for each cluster by recording the number of units used
each cycle. Shift registers record the history of each cluster’s utilisation and a consistently
under-used cluster is power-gated off. If the remaining cluster is heavily used, the second
cluster is switched back on.
• IPC (instructions committed per clock) is monitored over 512 cycle intervals to measure
the global trend in instruction-level parallelism. Thresholds indicate if a cluster should
be power-gated off or on.
• Lastly, a cluster is power-gated off when there are many input dependencies between the
instructions in the instruction window. A cluster is power-gated back on as in the first
technique.
For all the integer techniques above, half of the floating point cluster is power-gated off when
an integer cluster is turned off, and the entire floating point cluster is power-gated off if no
floating point instruction is fetched for 3 cycles. The cluster is switched on if new floating point
instructions are fetched.
The first and second methods were implemented in our simulation and power estimation frame-
work and compared to ELDM. These two methods were chosen as they showed the largest power
savings and smallest performance degradation respectively in the original work. For brevity,
the first method described above will be referred to as Shift, and the second as IPC. Although
ELDM requires information produced at compile time to detect loops, power-gating decisions
are made at runtime using hardware similarly to Shift and IPC.
9.4. Discussion of related work 123
N
or
m
. e
xe
cu
tio
n 
tim
e
0.
6
1.
0
1.
4
40
0.p
er
lbe
nc
h
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
43
5.g
ro
ma
cs
43
6.c
ac
tus
AD
M
43
7.l
es
lie
3d
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
ua
ntu
m
46
4.h
26
4r
ef
47
0.l
bm
48
2.s
ph
inx
3
99
8.s
pe
cra
nd
om
Av
er
ag
e
ELDM
IPC
Shift
Figure 9.16: Execution time for the 16 benchmark traces normalised against a baseline processor
configuration that does not power-gate execution units.
N
or
m
. p
ow
er
0.
6
0.
8
1.
0
1.
2
40
0.p
er
lbe
nc
h
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
43
5.g
ro
ma
cs
43
6.c
ac
tus
AD
M
43
7.l
es
lie
3d
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
ua
ntu
m
46
4.h
26
4r
ef
47
0.l
bm
48
2.s
ph
inx
3
99
8.s
pe
cra
nd
om
Av
er
ag
e
ELDM
IPC
Shift
Figure 9.17: Power consumed during the execution of the 16 benchmark traces normalised
against a baseline processor configuration that does not power-gate execution units.
N
or
m
. E
D
P
0.
0
0.
5
1.
0
1.
5
2.
0
40
0.p
er
lbe
nc
h
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
43
5.g
ro
ma
cs
43
6.c
ac
tus
AD
M
43
7.l
es
lie
3d
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
ua
ntu
m
46
4.h
26
4r
ef
47
0.l
bm
48
2.s
ph
inx
3
99
8.s
pe
cra
nd
om
Av
er
ag
e
ELDM
IPC
Shift
Figure 9.18: EDP for the 16 benchmark traces normalised against a baseline processor config-
uration that does not power-gate execution units.
124 Chapter 9. Results and discussion
As can be seen in Figure 9.17, these techniques produce similar average power savings to ELDM
of around 13% and exceed the savings of ELDM in 7 of the benchmarks. Shift offers the greater
power savings of the two, which is consistent with the findings in the original work. Figure 9.16
compares the execution time of IPC and Shift to ELDM, showing that the methods have
sacrificed on average 5.5% and 20.7% performance in achieving these savings. Although IPC
has similar performance to ELDM for 11 of the benchmarks, large degradation in benchmarks
such as sphinx3 have swayed the average. Shift has increased execution time by over 18% for
11 of the benchmarks.
The importance of low performance degradation can be seen in the results for the combined
EDP metric in Figure 9.18. Although IPC and Shift offered similar power savings to ELDM,
the performance impact of the two techniques has resulted in average EDP that is similar or
even worse than the baseline processor with no power-gating at all.
Whereas ELDM has fine-grain control over execution units, clustering limits the ability of IPC
and Shift to match available units to specific resource requirements. For Shift, in a situation
where only a single unit in a cluster is required, the entire cluster would be power gated-off
to save power, at the expense of performance. The clustering also limits the IPC method, as
trends in types of units that are used over an interval cannot be reflected by power-gating the
clusters. Another restriction is that control flow is not taken into account in either technique,
resulting in observed usage that may not represent the characteristics of the intervals where
power-gating is actually applied. ELDM overcomes this by making predictions only for loops.
The accuracy of the prediction made over an iteration of the loop means that when they are
applied by power-gating units for the remainder of the loop, the performance loss should be
low.
Rosner et al. take control flow into account by storing traces of frequently executed micro-ops in
a trace cache [RAM+04]. The most frequent traces are then optimised dynamically at runtime
to improve performance. Energy is reduced as a consequence, although no power-gating is
implemented. This method could be used to power-gate units, by making predictions for entire
traces (which include loops) after online analysis. However, as static power becomes a more
9.4. Discussion of related work 125
dominant contributor to power dissipation, the power overhead of the highly complex design
may reduce its efficacy.
The Geyser-1 chip fabricated by Ikebuchi et al. [ISK+09] demonstrates leakage energy reduction
of up to 24% when using their method. The chip operates at 60MHz with a 10ns wake-up delay
(less than a clock cycle) and as a result instructions in the fetch stage can trigger power-up
of execution units, which will be ready without delay. Assuming a similar power-gating delay
(10ns) at the 1200MHz operating frequency of the DEC Alpha, this delay is 12 cycles (which
is in keeping with [ADSN06]). Due to the short pipeline of Geyser-1, instructions could not be
detected early enough to hide these increased power-up delays, making the design infeasible for
such a clock frequency. LDM and ELDM can tolerate increased delay as predictions anticipate
required resources before instructions enter the pipeline.
The compiler-directed approach in [TSC06] reduces overall processor energy by up to 18%
(the maximum EDP savings achieved by ELDM was 18.7%) by determining execution unit
requirements oﬄine and inserting special instructions to power-gate units. Their results show
negligible energy savings for some benchmarks however, and significant energy increase for
others, resulting in average savings around 4%. This variation in savings suggests that the
compiler approach is not appropriate as a general approach to power-gating, although it can
offer good savings in some cases. LDM, ELDM and ELDM with the scheduler modification
consistently decrease EDP and can achieve savings greater than the compiler approach.
In [LBBS09], Lungu et al. identify intervals where power-gating a unit has been harmful and
disable power-gating so prevent an increase in power consumption. They also limit the amount
by which power could be increased by disabling power-gating after a threshold amount of power
wastage has been observed. This technique is independent of the mechanism chosen to power-
gate units, and could be applied in conjunction with LDM and ELDM. Lungu et al. apply their
techniques on top of a time-out based power-gating mechanism [HBS+04], which increases
power consumption significantly for some benchmarks.
126 Chapter 9. Results and discussion
9.5 Summary
This chapter evaluates LDM, ELDM and schedule balancing and compares the LDM approach
to power-gating units to existing research. Section 9.1 explores different combinations of thresh-
olds for the techniques such that the optimal thresholds can be found. The optimal thresholds
are used in Section 9.2 to assess the effectiveness of LDM and evaluate the effects of ELDM
and the balancing scheduler. Runtime statistics from the simulation are also presented and
discussed. In Section 9.3 we simulate 10 million instruction traces that represent the execution
of a large proportion of four benchmarks. This allows us to estimate the power savings we
would expect when executing the entire benchmark, and allows us to verify that the optimal
thresholds are appropriate for different traces. We also compare execution of these traces using
ELDM to execution on an oracle implementation that has knowledge of all execution unit idle
periods in advance. Section 9.4 compares our techniques to other similar research discussed
previously in Chapter 3.
The optimal thresholds for LDM and ELDM are 2-5% and 70% for the switch off and switch on
thresholds respectively. Such conservative thresholds mean that moving computation between
units causes performance degradation in many cases, which suggests many data dependencies
between instructions in the instruction window. The balancing scheduler is able to identify
instructions that can be moved in the schedule and balance the types of instruction that are
issued each cycle. As a result, the optimal thresholds are 0% and 99% when the scheduler is
used with ELDM. Moving computation from an infrequently used unit produces inferior EDP,
as the scheduler would have avoided using this unit during scheduling if it was possible to do
so without performance loss.
The accuracy of the resource requirement predictions mean that LDM and ELDM (with and
without schedule balancing) show very low performance degradation. Power savings are high
for LDM, mainly in cases where innermost loops account for most of the executed instructions.
In cases where innermost loops do not, ELDM (with or without schedule balancing) can be
used to apply the power-gating methodology to outer loops. Because of this, ELDM performs
better than LDM on average. When a benchmark trace that experienced a performance increase
9.5. Summary 127
with LDM and ELDM (using the standard scheduler) is removed from the average, the ELDM
achieves roughly the same EDP savings when using the balancing scheduler. None of our
techniques result in EDP greater than that of the baseline processor when simulating traces.
The balancing scheduler could be implemented oﬄine at compile-time, either through static
scheduling for a VLIW, or by reordering the instruction stream such that a dynamic scheduler
is presented with a variety of instruction types at the head of the instruction window. Doing so
would make the balancing scheduler approach more appealing, as thresholds for the scheduler
make the hardware implementation less complex.
Simulation of the representative 10 million instruction traces demonstrates that the savings seen
when simulating the 1 million instruction traces could be expected when executing the entire
benchmark. Using an oracle implementation we show that despite having no prior knowledge
of idle periods, ELDM is able to achieve over 50% of the EDP savings achieved by the oracle
for three out of the four traces, and in one case achieves 77% of the oracle savings.
No existing research that power-gates execution units can consistently reduce power consump-
tion while maintaining low performance impact. Clustering limits a technique’s ability to match
the resource requirements of an application, and predictions based on observations over an in-
terval cannot accurately predict the requirements of the following interval unless control flow
is considered. Compiler and time-out based methods have previously shown increases in power
consumption, although techniques exist to limit these losses.
Chapter 10
Conclusion
Technology scaling is driving the power consumption of CMOS processors higher, by adding
more transistors and increasing the operating frequency. Due to the energy requirement and
heat dissipation challenges of current processing environments, architectural power reduction
techniques are needed to allow further scaling. Recent feature sizes have seen a shift from
dynamic power to static power as the dominant contributor to chip power, resulting in new
strategies that power off large area devices to eliminate static power leakage. This chapter sum-
marises our approach to power saving, which power-gates execution units by making predictions
about their usage during loops.
10.1 Summary of thesis achievements
This thesis presents three techniques for power-gating execution units: Loop-Directed Moth-
balling, Extended Loop-Directed Mothballing and a balancing scheduler. The main three thesis
achievements are as follows:
• A new method is described (LDM), which achieves a 10.3% reduction in energy-delay
product on average during simulation of 16 benchmark traces. LDM monitors execution
unit utilisation over innermost loop bodies, and powers off units that are predicted to be
128
10.1. Summary of thesis achievements 129
unnecessary for the remaining iterations. The threshold approach allows predictions to be
made quickly, and allows the prediction to adapt to variation in execution characteristics
caused by out-of-order execution or cache misses.
• An extension to LDM (ELDM) allows the LDM technique to be applied to all loops by
gathering information on loop entry end exit points oﬄine. Doing so increases the EDP
savings for benchmark traces where few executed instructions come from innermost loops
by up to 13%. On average, ELDM increases EDP savings to 12.8%.
• A balancing scheduler is introduced, which achieves similar EDP savings when combined
with ELDM, and offers a less complex hardware implementation if the scheduling is
performed oﬄine.
Our selection of thresholds for the techniques optimises for maximum EDP savings. Perfor-
mance is incorporated into the EDP metric, and consequently the thresholds that are selected
result in performance loss below 2.5% in all simulated traces. We test our implementations
and optimal thresholds on 10 million instruction traces which are representative of large pro-
portions of four benchmarks and show that savings are consistent with those achieved for the
1 million instruction traces used for the main evaluation. Using an oracle implementation, we
demonstrate that our best technique, ELDM, achieves up to 77% of the savings that would be
achieved by power-gating units at the optimal time using advance knowledge of idle periods.
Our loop-based approach offers accuracy, speed, and adaptability. Accuracy is achieved by
recording the usage of execution units during an iteration of a loop and using that to make pre-
dictions for future iterations. These previous iterations are likely to be representative of future
iterations, but the prediction can also be adapted during execution of the loop if performance
degradation is indicated. As a prediction is available after a single iteration of the loop, units
can be power-gated quickly, and will be switched off for most of their idle periods. Another
feature of the loop-based approach is that it restricts power-gating to situations where long idle
periods are likely to occur, so power savings are large when compared to the power overheads
of power-gating.
130 Chapter 10. Conclusion
Existing research into power-gating processor components such as execution units use the time-
out approach or predictions made over fixed intervals. Timeout techniques delay power-gating
units until a long idle period has been observed, which reduces the power-saving potential.
Fixed interval prediction techniques assume that the execution characteristics of the next in-
terval are similar to the previous interval, and as such suffer performance degradation when
the assumption does not hold. Compiler methods can be used to detect idle periods and insert
special power-gating instructions, but runtime events such as cache misses cannot be accounted
for.
10.2 Assumptions
In our evaluation we assume 45nm planar bulk CMOS technology for our power estimations.
Alternative technologies are being developed to reduce static power leakage, such as the FinFET
[HLK+00] and high-k dielectrics [KAB+03]. The latter addresses gate leakage only, which is
currently a small component of static power leakage (around 2.5% in our power estimations),
but is predicted to become as significant as subthreshold leakage if not addressed. FinFET
transistors exhibit less subthreshold leakage, but continued scaling of any technology will result
in increasing static power, so FinFETs will only delay the need for architectural methods to
reduce static power.
Another assumption is that the power required to switch on a unit can be modelled by a linear
increase in both static and dynamic power over the 8 cycle warm up period. We anticipate
that the effects of longer warm up period would be low, as our techniques only power-gate
units during idle periods that are likely to be long and we have shown in Section 9.3.2 that the
number of times each unit is switched on is relatively low.
10.3. Applications 131
10.3 Applications
Most major computing applications suffer in some way from the power requirements of proces-
sors:
• Mobile devices must offer high performance to support computationally expensive media
and gaming applications, but are constrained by the limited amount of energy stored in
the battery.
• High performance machines are provided with many high frequency processor cores inside
a single chip, but as the transistor density increases the area through which to remove
the large amount of heat generated decreases.
• High volume data centres must grow in size to meet the increasing data demands of large
companies and websites, and the associated energy costs increase accordingly.
The modifications required to implement LDM, ELDM and schedule balancing are architecture
independent, and the techniques can be applied to microprocessor cores for use in all computing
systems. Instruction sets do not need to be modified, although additional instructions will
be required to communicate loop information for ELDM. Power-gating predictions are made
after a small number of loop iterations, so the method can be applied in a context switching
environment as predictions will be recreated quickly after a context switch. When performing
repetitive tasks such as media processing and batch requests to a database, the techniques
should be particularly effective as these tasks are likely to be dominated by loop execution.
10.4 Future work
Reducing the EDP of a processor requires either reducing the power consumption or increasing
performance. To achieve the former, we show in the next subsection how our loop-based
approach could be used to accurately predict the usage of other components so that power-
gating can be applied to different parts of the superscalar pipeline. The latter can be achieved
132 Chapter 10. Conclusion
by widening the architecture to increase performance where high ILP exists, and using power-
gating to eliminate the power consumption of the additional logic where it does not. This
strategy will be discussed further in Section 10.4.2
10.4.1 Increasing power savings
Our loop-based approach could be used to estimate the the extent to which other pipeline
resources are used during execution of the loop. In addition, power-gating execution units will
have a direct impact on other resources which either supply operands to the units or consume
their results. In the following, we describe some pipeline resources and how they could be
power-gated alongside LDM or ELDM:
• Issue and commit width defines how many instructions can be issued to execution units
and committed to the architectural state each cycle. By monitoring the maximum re-
quired issue width during a loop iteration, we can match the issue and commit width
to the width required for the loop by powering off some of the issue and commit logic.
If many execution units are switched off, the issue and commit width will be naturally
limited by the number of execution units that are on.
• The instruction window (issue queue) stores fetched instructions and flags if they are
issuable. Instructions may be issued out of order from any position in the queue, if the
data dependencies and memory access dependencies are satisfied. During a loop, we can
monitor the maximum depth in the queue that instructions are issued from and reduce
the length of the queue to this depth until the loop exits.
• Bypass logic forwards results from an execution unit directly back into the inputs of any
execution unit as an operand. The complexity of this logic relates to the issue width of
the processor, as this determines the maximum number of instructions that may require
operands to be forwarded through the bypass logic. If the issue width is decreased for a
particular loop, some of the bypass logic may be power-gated off accordingly.
10.4. Future work 133
• Similarly, the register file must be able to supply enough operands for all issued instruc-
tions, so the number of active register file ports could be decreased for a loop if the issue
width has been decreased. In addition, if the register usage is monitored over loop iter-
ations, the size of the register file could be reduced dynamically by switching off either
whole sections of the file, or individual entries.
• The floating point register file could be power-gated off (or to a state preserving low-power
mode) if both floating point units are power-gated off.
• Load/store units access the data cache by queueing requests in a load/store queue, but
if both load/store units are power-gated off, no requests can be sent to the queue. Once
the queue is empty therefore, the queue can be power-gated off. Consequently, the data
caches themselves could be placed in a low-power state-preserving mode as there will be
no more accesses until the load/store units are powered on again.
The power consumption of some of these units is far below that of Level 2 caches and execution
units, but when power-gating is applied successfully to caches and execution units the propor-
tion of power consumed by the remaining units will increase accordingly. Linking power-gating
of these additional units to loop iterations offers the same benefits as LDM and ELDM for
execution units, in that accurate utilisation predictions can be made, performance loss is small
and power-gating modes are likely to persist for long periods.
10.4.2 Improving performance
The specific set of processor resources that an application would require to achieve optimal
performance varies greatly, so many existing architectures choose generic resources to provide
reasonable performance across many different applications while maintaining acceptable power
consumption. Power-gating can be used to dynamically change the architecture at runtime
however, and could be used to achieve the optimal set of resources for executing instructions.
LDM and ELDM currently change the architecture by powering units off to save power, but
the idea could be extended to dynamically increase the width of the architecture. Additional
134 Chapter 10. Conclusion
execution units and wider issue and commit stages would support higher performance where
more instruction level parallelism (ILP) exists. Wide architectures have been achieved previ-
ously by partitioning the execution units and register file into independent clusters [CDN92].
This reduces the complexity of the bypass logic and the number of register file ports that are
required.
Both [BYP+91] and [PGTM99] find instruction level parallelism in the order of 10-1000 in-
structions for SPEC benchmarks. For such applications with high levels of ILP, the optimal
combination of units will be powered on to increase performance. Although the additional units
and issue/commit logic will dissipate static power when powered on, this should be offset by
the fact that the remainder of the chip will not need to be powered for as long due to the lower
execution time.
Applications with a small amount of ILP will power-gate off all the unnecessary units and
issue/commit logic, and the processor should not consume much more power than an equivalent
static processor design. The amount of ILP and the optimal resources that would be required
can be determined dynamically at runtime using LDM and ELDM, or during static scheduling
for a VLIW processor.
The dominance of dynamic power in previous technology generations may have made such an
architecture infeasible due to the energy required to perform the extra computation in a wider
architecture, but with the growing contribution of static power to overall power consumption
and the availability of power-gating, such an architecture is now a viable option which should
be studied in future work.
Appendix A
SimpleScalar/Alpha command line
arguments
The command line arguments passed to the SimpleScalar/Alpha simulator are provided in
Table A.1 for reference.
135
136 Appendix A. SimpleScalar/Alpha command line arguments
Argument Value Notes
-fastfwd Varies per benchmark and trace
-max:inst Normally either 1000000 or 10000000
-fetch:mplat 7
-bpred comb
-bpred:bimod 1024
-bpred:2lev 1 1024 10 0
-bpred:comb 4096
-bpred:ras 32
-bpred:btb 1024 2
-ruu:size 64
-lsq:size 64
-cache:dl1 dl1:4096:16:2:l
-cache:dl1lat 3
-cache:dl2 dl2:131072:16:8:l
-cache:dl2lat 32
-cache:il1 il1:4096:16:2:l
-cache:il1lat 2
-cache:il2 dl2
-cache:il2lat 32
-mem:lat 55 2
-mem:width 128
-tlb:itlb itlb:128:4096:4:l
-tlb:dtlb dtlb:128:4096:4:l
-tlb:lat 30
-res:ialu 4
-res:imult 2
-res:memport 2
-res:fpalu 1
-res:fpmult 1
-LDM (user defined) Apply LDM (0 or 1)
-bal (user defined) Apply schedule balancing (1 or 0)
-input loop fn (user defined) File containing loop entry points
-input exit fn (user defined) File containing loop exit points
-switch on thresh (user defined) Switch on threshold
-switch off thresh (user defined) Switch off threshold
Table A.1: SimpleScalar/Alpha command line arguments.
Appendix B
McPAT preprepared XML file
The preprepared XML file containing the architecture specification is based on a reference
DEC Alpha XML file provided by McPAT. The file along with the fields to be replaced with
simulation statistics (values all in capitals) is provided for reference as follows:
<?xml version="1.0" ?>
<component id="root" name="root">
<component id="system" name="system">
<!--McPAT will skip the components if number is set to 0 -->
<param name="number_of_cores" value="1"/>
<param name="number_of_L1Directories" value="0"/>
<param name="number_of_L2Directories" value="1"/>
<param name="number_of_L2s" value="1"/> <!-- This number means how many L2 clusters in each cluster there can
be multiple banks/ports -->
<param name="Private_L2" value="0"/><!--1 Private, 0 shared/coherent -->
<param name="number_of_L3s" value="0"/> <!-- This number means how many L3 clusters -->
<param name="number_of_NoCs" value="1"/>
<param name="homogeneous_cores" value="1"/><!--1 means homo -->
<param name="homogeneous_L2s" value="1"/>
<param name="homogeneous_L1Directorys" value="1"/>
<param name="homogeneous_L2Directorys" value="1"/>
<param name="homogeneous_L3s" value="1"/>
<param name="homogeneous_ccs" value="1"/><!--cache coherece hardware -->
<param name="homogeneous_NoCs" value="1"/>
<param name="core_tech_node" value="45"/><!-- nm -->
<param name="target_core_clockrate" value="1200"/><!--MHz -->
<param name="temperature" value="380"/> <!-- Kelvin -->
<param name="number_cache_levels" value="2"/>
137
138 Appendix B. McPAT preprepared XML file
<param name="interconnect_projection_type" value="0"/><!--0: agressive wire technology;
1: conservative wire technology -->
<param name="device_type" value="0"/><!--0: HP(High Performance Type); 1: LSTP(Low standby power)
2: LOP (Low Operating Power) -->
<param name="longer_channel_device" value="0"/><!-- 0 no use; 1 use when approperiate -->
<param name="machine_bits" value="64"/>
<param name="virtual_address_width" value="64"/>
<param name="physical_address_width" value="52"/>
<param name="virtual_memory_page_size" value="4096"/>
<!-- address width determins the tag_width in Cache, LSQ and buffers in cache controller
default value is machine_bits, if not set -->
<stat name="total_cycles" value="NUMBER_OF_SIMULATION_CYCLES"/>
<stat name="idle_cycles" value="0"/>
<stat name="busy_cycles" value="NUMBER_OF_SIMULATION_CYCLES"/>
<!--This page size(B) is complete different from the page size in Main memo secction. this page size is the
size of virtual memory from OS/Archi perspective; the page size in Main memo secction is the actuall
physical line in a DRAM bank -->
<!-- *********************** cores ******************* -->
<component id="system.core0" name="core0">
<!-- Core property -->
<param name="clock_rate" value="1200"/>
<!-- for cores with unknow timing, set to 0 to force off the opt flag -->
<param name="opt_local" value="1"/>
<param name="instruction_length" value="32"/>
<param name="opcode_width" value="7"/>
<param name="x86" value="0"/>
<param name="micro_opcode_width" value="8"/>
<param name="machine_type" value="0"/>
<!-- inorder/OoO; 1 inorder; 0 OOO-->
<param name="number_hardware_threads" value="1"/>
<!-- number_instruction_fetch_ports(icache ports) is always 1 in single-thread processor,
it only may be more than one in SMT processors. BTB ports always equals to fetch ports since
branch information in consective branch instructions in the same fetch group can be read out from
BTB once.-->
<param name="fetch_width" value="4"/>
<!-- fetch_width determins the size of cachelines of L1 cache block -->
<param name="number_instruction_fetch_ports" value="1"/>
<param name="decode_width" value="4"/>
<!-- decode_width determins the number of ports of the
renaming table (both RAM and CAM) scheme -->
<param name="issue_width" value="4"/>
<param name="peak_issue_width" value="4"/>
<!-- issue_width determins the number of ports of Issue window and other logic
as in the complexity effective proccessors paper; issue_width==dispatch_width -->
<param name="commit_width" value="4"/>
<!-- commit_width determins the number of ports of register files -->
139
<param name="fp_issue_width" value="2"/>
<param name="prediction_width" value="1"/>
<!-- number of branch instructions can be predicted simultannouesl-->
<!-- Current version of McPAT does not distinguish int and floating point pipelines
Theses parameters are reserved for future use.-->
<param name="pipelines_per_core" value="1,1"/>
<!--integer_pipeline and floating_pipelines, if the floating_pipelines is 0, then the pipeline is
shared-->
<param name="pipeline_depth" value="7,7"/>
<!-- pipeline depth of int and fp, if pipeline is shared, the second number is the average cycles of
fp ops -->
<!-- issue and exe unit-->
<param name="ALU_per_core" value="4"/>
<!-- contains an adder, a shifter, and a logical unit -->
<param name="MUL_per_core" value="2"/>
<!-- For MUL and Div -->
<param name="FPU_per_core" value="2"/>
<!-- buffer between IF and ID stage -->
<param name="instruction_buffer_size" value="32"/>
<!-- buffer between ID and sche/exe stage -->
<param name="decoded_stream_buffer_size" value="16"/>
<param name="instruction_window_scheme" value="0"/><!-- 0 PHYREG based, 1 RSBASED-->
<!-- McPAT support 2 types of OoO cores, RS based and physical reg based-->
<param name="instruction_window_size" value="20"/>
<param name="fp_instruction_window_size" value="15"/>
<!-- the instruction issue Q as in Alpha 21264; The RS as in Intel P6 -->
<param name="ROB_size" value="80"/>
<!-- each in-flight instruction has an entry in ROB -->
<!-- registers -->
<param name="archi_Regs_IRF_size" value="32"/>
<param name="archi_Regs_FRF_size" value="32"/>
<!-- if OoO processor, phy_reg number is needed for renaming logic,
renaming logic is for both integer and floating point insts. -->
<param name="phy_Regs_IRF_size" value="80"/>
<param name="phy_Regs_FRF_size" value="72"/>
<!-- rename logic -->
<param name="rename_scheme" value="1"/>
<!-- can be RAM based(0) or CAM based(1) rename scheme
RAM-based scheme will have free list, status table;
CAM-based scheme have the valid bit in the data field of the CAM
both RAM and CAM need RAM-based checkpoint table, checkpoint_depth=# of in_flight instructions;
Detailed RAT Implementation see TR -->
<param name="register_windows_size" value="0"/>
<!-- how many windows in the windowed register file, sun processors;
no register windowing is used when this number is 0 -->
<!-- In OoO cores, loads and stores can be issued whether inorder(Pentium Pro) or
140 Appendix B. McPAT preprepared XML file
(OoO)out-of-order(Alpha), They will always try to exeute out-of-order though. -->
<param name="LSU_order" value="inorder"/>
<param name="store_buffer_size" value="32"/>
<!-- By default, in-order cores do not have load buffers -->
<param name="load_buffer_size" value="32"/>
<!-- number of ports refer to sustainable concurrent memory accesses -->
<param name="memory_ports" value="2"/>
<!-- max_allowed_in_flight_memo_instructions determins the # of ports of load and store buffer
as well as the ports of Dcache which is connected to LSU -->
<!-- dual-pumped Dcache can be used to save the extra read/write ports -->
<param name="RAS_size" value="32"/>
<!-- general stats, defines simulation periods;require total, idle, and busy cycles for senity check -->
<!-- please note: if target architecture is X86, then all the instrucions refer to (fused) micro-ops -->
<stat name="total_instructions" value="NUMBER_OF_INSTRUCTIONS_SIMULATED"/>
<stat name="int_instructions" value="NUMBER_OF_INT_INSTRUCTIONS_SIMULATED"/>
<stat name="fp_instructions" value="NUMBER_OF_FP_INSTRUCTIONS_SIMULATED"/>
<stat name="branch_instructions" value="NUMBER_OF_BRANCH_INSTRUCTIONS_SIMULATED"/>
<stat name="branch_mispredictions" value="NUMBER_OF_BRANCH_MISPREDICTIONS_SIMULATED"/>
<stat name="load_instructions" value="NUMBER_OF_LOAD_INSTRUCTIONS_SIMULATED"/>
<stat name="store_instructions" value="NUMBER_OF_STORE_INSTRUCTIONS_SIMULATED"/>
<stat name="committed_instructions" value="NUMBER_OF_COMMITTED_INSTRUCTIONS_SIMULATED"/>
<stat name="committed_int_instructions" value="NUMBER_OF_COMMITTED_INT_INSTRUCTIONS_SIMULATED"/>
<stat name="committed_fp_instructions" value="NUMBER_OF_COMMITTED_FP_INSTRUCTIONS_SIMULATED"/>
<stat name="pipeline_duty_cycle" value="1"/><!--<=1, runtime_ipc/peak_ipc; averaged for all cores if
homogenous -->
<!-- the following cycle stats are used for heterogeneouse cores only,
please ignore them if homogeneouse cores -->
<stat name="total_cycles" value="100000"/>
<stat name="idle_cycles" value="0"/>
<stat name="busy_cycles" value="100000"/>
<!-- instruction buffer stats -->
<!-- ROB stats, both RS and Phy based OoOs have ROB
performance simulator should capture the difference on accesses,
otherwise, McPAT has to guess based on number of commited instructions. -->
<stat name="ROB_reads" value="NUMBER_OF_ROB_READS_SIMULATED"/>
<stat name="ROB_writes" value="NUMBER_OF_ROB_WRITES_SIMULATED"/>
<!-- RAT accesses -->
<stat name="rename_reads" value="NUMBER_OF_RENAME_READS_SIMULATED"/> <!--lookup in renaming logic -->
<stat name="rename_writes" value="NUMBER_OF_RENAME_WRITES_SIMULATED"/><!--update dest regs. renaming
logic -->
<stat name="fp_rename_reads" value="NUMBER_OF_FP_RENAME_READS_SIMULATED"/>
<stat name="fp_rename_writes" value="NUMBER_OF_FP_RENAME_WRITES_SIMULATED"/>
<!-- decode and rename stage use this, should be total ic - nop -->
<!-- Inst window stats -->
<stat name="inst_window_reads" value="NUMBER_OF_INSTRUCTION_WINDOW_READS_SIMULATED"/>
<stat name="inst_window_writes" value="NUMBER_OF_INSTRUCTION_WINDOW_WRITES_SIMULATED"/>
141
<stat name="inst_window_wakeup_accesses" value="NUMBER_OF_INSTRUCTION_WINDOW_WAKEUP_ACCESSES_SIMULATED"/>
<stat name="fp_inst_window_reads" value="NUMBER_OF_FP_INSTRUCTION_WINDOW_READS_SIMULATED"/>
<stat name="fp_inst_window_writes" value="NUMBER_OF_FP_INSTRUCTION_WINDOW_WRITES_SIMULATED"/>
<stat name="fp_inst_window_wakeup_accesses"
value="NUMBER_OF_FP_INSTRUCTION_WINDOW_WAKEUP_ACCESSES_SIMULATED"/>
<!-- RF accesses -->
<stat name="int_regfile_reads" value="NUMBER_OF_INT_REGFILE_READS_SIMULATED"/>
<stat name="float_regfile_reads" value="NUMBER_OF_FP_REGFILE_READS_SIMULATED"/>
<stat name="int_regfile_writes" value="NUMBER_OF_INT_REGFILE_WRITES_SIMULATED"/>
<stat name="float_regfile_writes" value="NUMBER_OF_FP_REGFILE_WRITES_SIMULATED"/>
<!-- accesses to the working reg -->
<stat name="function_calls" value="NUMBER_OF_FUNCTION_CALLS_SIMULATED"/>
<!-- Number of Windowes switches (number of function calls and returns)-->
<!-- Alu stats by default, the processor has one FPU that includes the divider and
multiplier. The fpu accesses should include accesses to multiplier and divider -->
<stat name="ialu_accesses" value="NUMBER_OF_ALU_ACCESSES_SIMULATED"/>
<stat name="fpu_accesses" value="NUMBER_OF_FPU_ACCESSES_SIMULATED"/>
<stat name="mul_accesses" value="NUMBER_OF_MUL_ACCESSES_SIMULATED"/>
<stat name="cdb_alu_accesses" value="NUMBER_OF_ALU_ACCESSES_SIMULATED"/>
<stat name="cdb_mul_accesses" value="NUMBER_OF_MUL_ACCESSES_SIMULATED"/>
<stat name="cdb_fpu_accesses" value="NUMBER_OF_FPU_ACCESSES_SIMULATED"/>
<stat name="ALU_power_on" value="AMOUNT_OF_TIME_ALU_POWERED_ON (added as part of LDM)"/>
<stat name="ALU_switch_on_count" value="NUMBER_OF_ALU_POWER_ONS (added as part of LDM)"/>
<stat name="MUL_power_on" value="AMOUNT_OF_TIME_MUL_POWERED_ON (added as part of LDM)"/>
<stat name="MUL_switch_on_count" value="NUMBER_OF_MUL_POWER_ONS (added as part of LDM)"/>
<stat name="FPU_power_on" value="AMOUNT_OF_TIME_FPU_POWERED_ON (added as part of LDM)"/>
<stat name="FPU_switch_on_count" value="NUMBER_OF_FPU_POWER_ONS (added as part of LDM)"/>
<!-- multiple cycle accesses should be counted multiple times,
otherwise, McPAT can use internal counter for different floating point instructions
to get final accesses. But that needs detailed info for floating point inst mix -->
<!-- currently the performance simulator should
make sure all the numbers are final numbers,
including the explicit read/write accesses,
and the implicite accesses such as replacements and etc.
Future versions of McPAT may be able to reason the implicite access
based on param and stats of last level cache
The same rule applies to all cache access stats too! -->
<!-- following is AF for max power computation.
Do not change them, unless you understand them-->
<stat name="IFU_duty_cycle" value="1"/>
<stat name="LSU_duty_cycle" value="1"/>
<stat name="MemManU_I_duty_cycle" value="1"/>
<stat name="MemManU_D_duty_cycle" value="1"/>
<stat name="ALU_duty_cycle" value="1"/>
<stat name="MUL_duty_cycle" value="0.3"/>
<stat name="FPU_duty_cycle" value="1"/>
142 Appendix B. McPAT preprepared XML file
<stat name="ALU_cdb_duty_cycle" value="1"/>
<stat name="MUL_cdb_duty_cycle" value="0.3"/>
<stat name="FPU_cdb_duty_cycle" value="1"/>
<param name="number_of_BPT" value="2"/>
<component id="system.core0.predictor" name="PBT">
<!-- branch predictor; tournament predictor see Alpha implementation -->
<param name="local_predictor_size" value="10,3"/>
<param name="local_predictor_entries" value="1024"/>
<param name="global_predictor_entries" value="4096"/>
<param name="global_predictor_bits" value="2"/>
<param name="chooser_predictor_entries" value="4096"/>
<param name="chooser_predictor_bits" value="2"/>
<!-- These parameters can be combined like below in next version
<param name="load_predictor" value="10,3,1024"/>
<param name="global_predictor" value="4096,2"/>
<param name="predictor_chooser" value="4096,2"/>
-->
</component>
<component id="system.core0.itlb" name="itlb">
<param name="number_entries" value="128"/>
<stat name="total_accesses" value="NUMBER_OF_ITLB_ACCESSES_SIMULATED"/>
<stat name="total_misses" value="NUMBER_OF_ITLB_MISSES_SIMULATED"/>
<stat name="conflicts" value="0"/>
<!-- there is no write requests to itlb although writes happen to itlb after miss,
which is actually a replacement -->
</component>
<component id="system.core0.icache" name="icache">
<!-- there is no write requests to itlb although writes happen to it after miss,
which is actually a replacement -->
<param name="icache_config" value="65536,16,2,1,1,2,16,0"/>
<!-- the parameters are capacity,block_width, associativity, bank, throughput w.r.t. core clock,
latency w.r.t. core clock,output_width, cache policy, -->
<!-- cache_policy;//0 no write or write-though with non-write allocate;1 write-back with
write-allocate -->
<param name="buffer_sizes" value="16, 16, 16,0"/>
<!-- cache controller buffer sizes: miss_buffer_size(MSHR),fill_buffer_size,prefetch_buffer_size,
wb_buffer_size-->
<stat name="read_accesses" value="NUMBER_OF_ICACHE_ACCESSES_SIMULATED"/>
<stat name="read_misses" value="NUMBER_OF_ICACHE_MISSES_SIMULATED"/>
<stat name="conflicts" value="0"/>
</component>
<component id="system.core0.dtlb" name="dtlb">
<param name="number_entries" value="128"/><!--dual threads-->
<stat name="total_accesses" value="NUMBER_OF_DTLB_ACCESSES_SIMULATED"/>
<stat name="total_misses" value="NUMBER_OF_DTLB_MISSES_SIMULATED"/>
<stat name="conflicts" value="0"/>
143
</component>
<component id="system.core0.dcache" name="dcache">
<!-- all the buffer related are optional -->
<param name="dcache_config" value="65536,16,2,1,1,3,16,0"/>
<param name="buffer_sizes" value="16, 16, 16, 16"/>
<!-- cache controller buffer sizes: miss_buffer_size(MSHR),fill_buffer_size,prefetch_buffer_size,
wb_buffer_size-->
<stat name="read_accesses" value="NUMBER_OF_DCACHE_READ_ACCESSES_SIMULATED"/>
<stat name="write_accesses" value="NUMBER_OF_DCACHE_WRITE_ACCESSES_SIMULATED"/>
<stat name="read_misses" value="NUMBER_OF_DCACHE_READ_MISSES_SIMULATED"/>
<stat name="write_misses" value="NUMBER_OF_DCACHE_WRITE_MISSES_SIMULATED"/>
<stat name="conflicts" value="0"/>
</component>
<param name="number_of_BTB" value="2"/>
<component id="system.core0.BTB" name="BTB">
<!-- all the buffer related are optional -->
<param name="BTB_config" value="6144,4,2,1, 1,3"/> <!--48Kbits -->
<!-- the parameters are capacity,block_width,associativity,bank, throughput w.r.t. core clock,
latency w.r.t. core clock,-->
<stat name="read_accesses" value="NUMBER_OF_BTB_READ_ACCESSES_SIMULATED"/> <!--See IFU code for
guideline -->
<stat name="write_accesses" value="NUMBER_OF_BTB_WRITE_ACCESSES_SIMULATED"/>
</component>
</component>
<component id="system.L2Directory0" name="L2Directory0">
<param name="Directory_type" value="0"/>
<!--0 cam based shadowed tag. 1 directory cache -->
<param name="Dir_config" value="512,4,0,1,1, 1"/>
<!-- the parameters are capacity,block_width, associativity,bank, throughput w.r.t. core clock,
latency w.r.t. core clock,-->
<param name="buffer_sizes" value="16, 16, 16, 16"/>
<!-- all the buffer related are optional -->
<param name="clockrate" value="1200"/>
<param name="ports" value="1,1,1"/>
<!-- number of r, w, and rw search ports -->
<param name="device_type" value="0"/>
<!-- altough there are multiple access types,
Performance simulator needs to cast them into reads or writes
e.g. the invalidates can be considered as writes -->
<stat name="read_accesses" value="NUMBER_OF_L2DIR_READ_ACCESSES_SIMULATED"/>
<stat name="write_accesses" value="NUMBER_OF_L2DIR_WRITE_ACCESSES_SIMULATED"/>
<stat name="read_misses" value="NUMBER_OF_L2DIR_READ_MISSES_SIMULATED"/>
<stat name="write_misses" value="NUMBER_OF_L2DIR_WRITE_MISSES_SIMULATED"/>
<stat name="conflicts" value="NUMBER_OF_L2DIR_CONFLICTS_SIMULATED"/>
</component>
<component id="system.L20" name="L20">
144 Appendix B. McPAT preprepared XML file
<!-- all the buffer related are optional -->
<param name="L2_config" value="1835008,16, 8, 16, 32, 32, 12, 1"/>
<!-- the parameters are capacity,block_width, associativity, bank, throughput w.r.t. core clock,
latency w.r.t. core clock,output_width, cache policy -->
<param name="buffer_sizes" value="16, 16, 16, 16"/>
<!-- cache controller buffer sizes: miss_buffer_size(MSHR),fill_buffer_size,prefetch_buffer_size,
wb_buffer_size-->
<param name="clockrate" value="1200"/>
<param name="ports" value="1,1,1"/>
<!-- number of r, w, and rw ports -->
<param name="device_type" value="0"/>
<stat name="read_accesses" value="NUMBER_OF_L2CACHE_READ_ACCESSES_SIMULATED"/>
<stat name="write_accesses" value="NUMBER_OF_L2CACHE_WRITE_ACCESSES_SIMULATED"/>
<stat name="read_misses" value="NUMBER_OF_L2CACHE_READ_MISSES_SIMULATED"/>
<stat name="write_misses" value="NUMBER_OF_L2CACHE_WRITE_MISSES_SIMULATED"/>
<stat name="conflicts" value="0"/>
<stat name="duty_cycle" value="1.0"/>
</component>
<!--**********************************************************************-->
<component id="system.NoC0" name="noc0">
<param name="clockrate" value="1200"/>
<param name="type" value="1"/>
<!--0:bus, 1:NoC , for bus no matter how many nodes sharing the bus
at each time only one node can send req -->
<param name="horizontal_nodes" value="1"/>
<param name="vertical_nodes" value="1"/>
<param name="has_global_link" value="1"/>
<!-- 1 has global link, 0 does not have global link -->
<param name="link_throughput" value="1"/><!--w.r.t clock -->
<param name="link_latency" value="1"/><!--w.r.t clock -->
<!-- througput >= latency -->
<!-- Router architecture -->
<param name="input_ports" value="8"/>
<param name="output_ports" value="7"/>
<!-- For bus the I/O ports should be 1 -->
<param name="virtual_channel_per_port" value="2"/>
<param name="input_buffer_entries_per_vc" value="128"/>
<param name="flit_bits" value="40"/>
<param name="chip_coverage" value="1"/>
<!-- When multiple NOC present, one NOC will cover part of the whole chip.
chip_coverage <=1 -->
<param name="link_routing_over_percentage" value="1.0"/>
<!-- Links can route over other components or occupy whole area.
by default, 50% of the NoC global links routes over other
components -->
145
<stat name="total_accesses" value="100000"/>
<!-- This is the number of total accesses within the whole network not for each router -->
<stat name="duty_cycle" value="1"/>
</component>
<!--**********************************************************************-->
<component id="system.mem" name="mem">
<!-- Main memory property -->
<param name="mem_tech_node" value="180"/>
<param name="device_clock" value="200"/><!--MHz, this is clock rate of the actual memory device,
not the FSB -->
<param name="peak_transfer_rate" value="6400"/><!--MB/S-->
<param name="internal_prefetch_of_DRAM_chip" value="4"/>
<!-- 2 for DDR, 4 for DDR2, 8 for DDR3...-->
<!-- the device clock, peak_transfer_rate, and the internal prefetch decide the DIMM property -->
<!-- above numbers can be easily found from Wikipedia -->
<param name="capacity_per_channel" value="4096"/> <!-- MB -->
<!-- capacity_per_Dram_chip=capacity_per_channel/number_of_dimms/number_ranks/Dram_chips_per_rank
Current McPAT assumes single DIMMs are used.-->
<param name="number_ranks" value="2"/>
<param name="num_banks_of_DRAM_chip" value="8"/>
<param name="Block_width_of_DRAM_chip" value="64"/> <!-- B -->
<param name="output_width_of_DRAM_chip" value="8"/>
<!--number of Dram_chips_per_rank=" 72/output_width_of_DRAM_chip-->
<!--number of Dram_chips_per_rank=" 72/output_width_of_DRAM_chip-->
<param name="page_size_of_DRAM_chip" value="8"/> <!-- 8 or 16 -->
<param name="burstlength_of_DRAM_chip" value="8"/>
</component>
<component id="system.mc" name="mc">
<!-- Memeory controllers are for DDR(2,3...) DIMMs -->
<!-- current version of McPAT uses published values for base parameters of memory controller
improvments on MC will be added in later versions. -->
<param name="mc_clock" value="800"/><!--MHz-->
<param name="peak_transfer_rate" value="1600"/><!--MB/S-->
<param name="llc_line_length" value="16"/><!--B-->
<param name="number_mcs" value="2"/>
<!-- current McPAT only supports homogeneous memory controllers -->
<param name="memory_channels_per_mc" value="2"/>
<param name="number_ranks" value="2"/>
<!-- # of ranks of each channel-->
<param name="req_window_size_per_channel" value="32"/>
<param name="IO_buffer_size_per_channel" value="32"/>
<param name="databus_width" value="32"/>
<param name="addressbus_width" value="32"/>
<!-- McPAT will add the control bus width to the addressbus width automatically -->
<!-- McPAT does not track individual mc, instead, it takes the total accesses and calculate
the average power per MC or per channel. This is sufficent for most application.
146 Appendix B. McPAT preprepared XML file
Further trackdown can be easily added in later versions. -->
</component>
<!--**********************************************************************-->
</component>
</component>
Bibliography
[ADSN06] K. Agarwal, H. Deogun, D. Sylvester, and K. Nowka. Power gating with multiple
sleep modes. In Quality Electronic Design, 7th International Symposium on, pages
633–637. IEEE, 2006.
[AGVO05] Jaume Abella, Antonio Gonza´lez, Xavier Vera, and Michael F. P. O’Boyle. IATAC:
a smart predictor to turn-off L2 cache lines. In Architecture and Code Optimization,
Transactions on, volume 2, pages 55–77. ACM, 2005.
[ALR02] A. Agarwal, Hai Li, and K. Roy. DRG-cache: a data retention gated-ground cache
for low power. In 39th Design Automation Conference, pages 473–478. IEEE, 2002.
[BA97] D. Burger and T.M. Austin. The SimpleScalar tool set, version 2.0. In SIGARCH
Computer Architecture News, volume 25, pages 13–25. ACM, 1997.
[BB95] T.D. Burd and R.W. Brodersen. Energy efficient CMOS microprocessor design. In
System Sciences, 28th Annual Hawaii International Conference on, pages 288–297.
IEEE, 1995.
[BBDM00] L. Benini, A. Bogliolo, and G. De Micheli. A survey of design techniques for system-
level dynamic power management. In Very Large Scale Integration (VLSI) Systems,
Transactions on, volume 8, pages 299–316. IEEE, 2000.
[BC06] D.P. Bovet and M. Cesati. Understanding the Linux kernel. O’Reilly Media, 3rd
edition, 2006.
147
148 BIBLIOGRAPHY
[BKAB03] A. Buyuktosunoglu, T. Karkhanis, D.H. Albonesi, and P. Bose. Energy efficient
co-adaptive instruction fetch and issue. In Computer Architecture, 30th Annual
International Symposium on, pages 147–156. IEEE, 2003.
[BS02] J.A. Butts and G.S. Sohi. A static power model for architects. In Microarchitecture,
33rd Annual IEEE/ACM International Symposium on, pages 191–201. IEEE, 2002.
[BTM00] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-
level power analysis and optimizations. In Computer Architecture, 27th Annual
International Symposium on, pages 83–94. ACM, 2000.
[BYP+91] M. Butler, T.Y. Yeh, Y. Patt, M. Alsup, H. Scales, and M. Shebanow. Single
instruction stream parallelism is greater than two. In Computer Architecture, 18th
Annual International Symposium on, pages 276–286. ACM, 1991.
[CDN92] Andrea Capitanio, Nikil Dutt, and Alexandru Nicolau. Partitioned register files
for VLIWs: a preliminary analysis of tradeoffs. In Microarchitecture, 25th Annual
International Symposium on, pages 292–300. IEEE, 1992.
[CGZ09] Jie Chen, Deyuan Gao, and Qiaoshi Zheng. A research on an optimized adaptive
dynamic power management. In Computer Science and Information Technology,
2nd International Conference on, pages 52–55. IEEE, 2009.
[CK11a] C.A. Court and P.H.J. Kelly. Loop-directed mothballing: Power-gating execution
units using fast analysis of inner loops. In Cool Chips XIV. IEEE, 2011.
[CK11b] C.A. Court and P.H.J. Kelly. Loop-directed mothballing: Power-gating execution
units using runtime analysis of loops. In Micro, volume 31, pages 29–38. IEEE,
2011.
[CLSL02] H.W. Cain, K.M. Lepak, B.A. Schwartz, and M.H. Lipasti. Precise and accu-
rate processor simulation. In Computer Architecture Evaluation using Commercial
Workloads, Workshop on, volume 8, 2002.
BIBLIOGRAPHY 149
[Com02] Compaq Computer Corporation, Shrewsbury, Massachusetts. 21264/EV68CB and
21264/EV68DC Hardware Reference Manual, 2002.
[Cor11] Standard Performance Evaluation Corporation. Spec cpu2006.
http://www.spec.org/cpu2006/, 2011.
[Dav96] B. Davari. CMOS technology scaling, 0.1µm and beyond. In International Electron
Devices Meeting, pages 555–558. IEEE, 1996.
[DR06] Gaurav Dhiman and Tajana Simunic Rosing. Dynamic power management using
machine learning. In Computer-Aided Design, IEEE/ACM International Confer-
ence on, pages 747–754. ACM, 2006.
[eco08] Down on the server farm. In The Economist. The Economist Newspaper Limited,
2008.
[FG01] Daniele Folegnani and Antonio Gonza´lez. Energy-effective issue logic. In Computr
Architecture, 28th Annual International Symposium on, pages 230–239. ACM, 2001.
[GKO+00] J. Gibson, R. Kunz, D. Ofelt, M. Horowitz, J. Hennessy, and M. Heinrich. Flash
vs.(simulated) flash: Closing the simulation loop. In Architectural Support for Pro-
gramming Languages and Operating Systems, Ninth International Conference on,
pages 49–58. ACM, 2000.
[HBS+04] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose.
Microarchitectural techniques for power gating of execution units. In Low Power
Electronics and Design, International Symposium on, pages 32–37. ACM, 2004.
[Hew10] Hewlett-Packard Corp., Intel Corp., Microsoft Corp, Phoenix Technologies Ltd. and
Toshiba Corp. Advanced Configuration and Power Interface Specification, 2010.
http://www.acpi.info/.
[HII+06] T. Hattori, T. Irita, M. Ito, E. Yamamoto, H. Kato, G. Sado, T. Yamada,
K. Nishiyama, H. Yagi, T. Koike, Y. Tsuchihashi, M. Higashida, H. Asano,
I. Hayashibara, K. Tatezawa, S. Shimazaki, N. Morino, Y. Yasu, T. Hoshi,
150 BIBLIOGRAPHY
Y. Miyairi, K. Yanagisawa, K. Hirose, S. Tamaki, S. Yoshioka, T. Ishii, Y. Kanno,
H. Mizuno, Tetsuy. Yamada, N. Irie, R. Tsuchihashi, N. Arai, T. Akiyama, and
K. Ohno. Hierarchical power distribution and power management scheme for a sin-
gle chip mobile processor. In 43rd Design Automation Conference, pages 292–295.
ACM/IEEE, 2006.
[HLK+00] D. Hisamoto, W.C. Lee, J. Kedzierski, H. Takeuchi, K. Asano, C. Kuo, E. Anderson,
T.J. King, J. Bokor, and C. Hu. FinFET-a self-aligned double-gate MOSFET
scalable to 20nm. In Electron Devices, Transactions on, volume 47, pages 2320–
2325. IEEE, 2000.
[IM02] A. Iyer and D. Marculescu. Microarchitecture-level power management. In Very
Large Scale Integration (VLSI) Systems, Transactions on, volume 10, pages 230–
239. IEEE, 2002.
[Int10a] Intel Corp. Intel Atom Processor Z5xx Series, 2010. http://www.intel.com.
[Int10b] Intel Corp. Introduction to Intel’s 32nm Process Technology, 2010.
http://www.intel.com.
[Int11] Intel Corp. 2nd Generation Intel Core Processor Family Desktop, Intel Pentium
Processor Family Desktop, and Intel Celeron Processor Family Desktop Datasheet,
Volume 1, 2011. http://www.intel.com.
[ISG03] Sandy Irani, Sandeep Shukla, and Rajesh Gupta. Online strategies for dynamic
power management in systems with multiple power-saving states. In Embedded
Computing Systems, Transactions on, volume 2, pages 325–346. ACM, 2003.
[ISK+09] D. Ikebuchi, N. Seki, Y. Kojima, M. Kamata, L. Zhao, H. Amano, T. Shirai,
S. Koyama, T. Hashida, Y. Umahashi, et al. Geyser-1: A MIPS R3000 CPU core
with fine grain runtime power gating. In Asian Solid-State Circuits Conference,
pages 281–284. IEEE, 2009.
[ITR10] ITRS. International technology roadmap for semiconductors. http://www.itrs.net/,
2010.
BIBLIOGRAPHY 151
[JOA+05] T.M. Jones, M.F.P. O’Boyle, J. Abella, A. Gonza´lez, and O. Ergin. Compiler di-
rected early register release. In Parallel Architectures and Compilation Techniques,
14th International Conference on, pages 110–122. IEEE, 2005.
[JOAG05] T.M. Jones, M.F.P. O’Boyle, J. Abella, and A. Gonza´lez. Software directed issue
queue power reduction. In High-Performance Computer Architecture, 11th Inter-
national Symposium on, pages 144–153. IEEE, 2005.
[KAB+03] N.S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J.S. Hu, M.J. Irwin,
M. Kandemir, and V. Narayanan. Leakage current: Moore’s law meets static power.
In Computer, volume 36, pages 68–75. IEEE, 2003.
[KFBM04] N.S. Kim, K. Flautner, D. Blaauw, and T. Mudge. Circuit and microarchitectural
techniques for reducing cache leakage power. In Very Large Scale Integration (VLSI)
Systems, Transactions on, volume 12, pages 167–184. IEEE, 2004.
[KHM01] S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: exploiting generational behavior
to reduce cache leakage power. In Computer Architecture, 28th Annual International
Symposium on, pages 240–251. IEEE, 2001.
[KSS08] C.M. Kumar, M. Sindhwani, and T. Srikanthan. Profile-based technique for dy-
namic power management in embedded systems. In Electronic Design, International
Conference on, pages 1–6. IEEE, 2008.
[KTYZ06] Fei Kong, Pin Tao, Shi Qiang Yang, and Xiao Li Zhao. Genetic algorithm based
idle length prediction scheme for dynamic power management. In Computational
Engineering in Systems Applications, Multiconference on, pages 1437–1443. IEEE,
2006.
[LAS+10] S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P. Jouppi. Mc-
PAT: an integrated power, area, and timing modeling framework for multicore and
manycore architectures. In Microarchitecture, 42nd Annual IEEE/ACM Interna-
tional Symposium on, pages 469–480. IEEE, 2010.
152 BIBLIOGRAPHY
[LBBS09] A. Lungu, P. Bose, A. Buyuktosunoglu, and D.J. Sorin. Dynamic power gating
with quality guarantees. In Low Power Electronics and Design. 14th ACM/IEEE
International Symposium on, pages 377–382. ACM, 2009.
[MBB01] R. Maro, Y. Bai, and R. Bahar. Dynamically reconfiguring processor resources
to reduce power consumption in high-performance processors. In Lecture Notes in
Computer Science, pages 97–111. Springer-Verlag, 2001.
[PGTM99] Matthew A. Postiff, David A. Greene, Gary S. Tyson, and Trevor N. Mudge. The
limits of instruction level parallelism in SPEC95 applications. In SIGARCH Com-
puter Architecture News, volume 27, pages 31–34. ACM, 1999.
[PJJ07] Aashish Phansalkar, Ajay Joshi, and Lizy K. John. Analysis of redundancy and
application balance in the spec cpu2006 benchmark suite. In Computer Architecture,
34th Annual International Symposium on, pages 412–423. ACM, 2007.
[PJS97] Subbarao Palacharla, Norman P. Jouppi, and J. E. Smith. Complexity-effective
superscalar processors. In Computer Architecture, 24th Annual International Sym-
posium on, pages 206–218. ACM, 1997.
[PPV02] I. Park, M.D. Powell, and TN Vijaykumar. Reducing register ports for higher speed
and lower energy. In Microarchitecture, 35th Annual IEEE/ACM International
Symposium on, pages 171–182. IEEE, 2002.
[PYF+00] M. Powell, S.H. Yang, B. Falsafi, K. Roy, and TN Vijaykumar. Gated-Vdd: a
circuit technique to reduce leakage in deep-submicron cache memories. In Low
Power Electronics and Design, International Symposium on, pages 90–95. ACM,
2000.
[RAM+04] R. Rosner, Y. Almog, M. Moffie, N. Schwartz, and A. Mendelson. Power awareness
through selective dynamically optimized traces. In Computer Architecture, 31st
Annual International Symposium on, pages 162–173. IEEE, 2004.
[REL00] J.A. Redstone, S.J. Eggers, and H.M. Levy. An analysis of operating system be-
havior on a simultaneous multithreaded architecture. In Architectural Support for
BIBLIOGRAPHY 153
Programming Languages and Operating Systems, Ninth International Conference
on, pages 245–256. ACM, 2000.
[RKM05] Z. Ren, B.H. Krogh, and R. Marculescu. Hierarchical adaptive dynamic power
management. In Computers, Transactions on, volume 54, pages 409–420. IEEE,
2005.
[RPOG02] S. Rele, S. Pande, S. Onder, and R. Gupta. Optimizing static power dissipation by
functional units in superscalar processors. In Compiler Construction. Lecture Notes
in Computer Science, pages 261–275. Springer, 2002.
[RRK09] S. Roy, N. Ranganathan, and S. Katkoori. A framework for power-gating functional
units in embedded microprocessors. volume 17, pages 1640–1649. IEEE, 2009.
[SDM08] A. Sesic, S. Dautovic, and V. Malbasa. Dynamic power management of a system
with a two-priority request queue using probabilistic-model checking. In Computer-
Aided Design of Integrated Circuits and Systems, Transactions on, volume 27, pages
403–407. IEEE, 2008.
[SPHC02] Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. Automatically
characterizing large scale program behavior. In Architectural Support for Program-
ming Languages and Operating Systems, 10th International Conference on, pages
45–57. ACM, 2002.
[TEE+96] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and R.L. Stamm. Ex-
ploiting choice: Instruction fetch and issue on an implementable simultaneous mul-
tithreading processor. In Computer Architecture, 23rd Annual International Sym-
posium on, pages 191–202. ACM, 1996.
[TEL95] D.M. Tullsen, S.J. Eggers, and H.M. Levy. Simultaneous multithreading: Maxi-
mizing on-chip parallelism. In Computer Architecture, 22nd Annual International
Symposium on, pages 392–403. ACM, 1995.
[Tex11] Texas Instruments Inc. AM3894, AM3892 Sitara ARM Microprocessors (MPUs)
(Rev. B), 2011. http://www.ti.com.
154 BIBLIOGRAPHY
[TPB98] S. Thompson, P. Packan, and M. Bohr. MOS scaling: Transistor challenges for the
21st century. In Intel Technology Journal, Q3. Intel Corp., 1998.
[TSC06] S. Talli, R. Srinivasan, and J. Cook. Compiler-directed functional unit shutdown for
microarchitecture power optimization. In International Performance, Computing,
and Communications Conference, pages 372–379. IEEE, 2006.
[Vee84] H.J.M. Veendrick. Short-circuit dissipation of static CMOS circuitry and its impact
on the design of buffer circuits. In Solid-State Circuits, Journal of, volume 19, pages
468–473. IEEE, 1984.
[WH05] N.H.E. Weste and D Harris. CMOS VLSI Design. Pearson/Addison Wesley, 3rd
edition, 2005.
[YAE06] A. Youssef, M. Anis, and M. Elmasry. Dynamic standby prediction for leakage
tolerant microprocessor functional units. In Microarchitecture, 39th Annual Inter-
national Symposium on, pages 371–384. IEEE/ACM, 2006.
[YCCY11] C. Yeh, K. Chang, T. Chen, and C. Yeh. Maintaining performance on power gating
of microprocessor functional units by using a predictive pre-wakeup strategy. In
Architecture and Code Optimization, Transactions on, volume 8, pages 16:1–16:27.
ACM, 2011.
[YL06] J.J. Yi and D.J. Lilja. Simulation of computer architectures: simulators, bench-
marks, methodologies, and recommendations. In Computers, Transactions on, vol-
ume 55, pages 268–280. IEEE, 2006.
[ZHD+02] W. Zhang, J.S. Hu, V. Degalahal, M. Kandemir, N. Vijaykrishnan, and M.J. Ir-
win. Compiler-directed instruction cache leakage optimization. In Microarchitecture,
35th Annual IEEE/ACM International Symposium on, pages 208–218. IEEE, 2002.
[ZKKC03] W. Zhang, M. Karakoy, M. Kandemir, and G. Chen. A compiler approach for reduc-
ing data cache energy. In Supercomputing, 17th Annual International Conference
on, pages 76–85. ACM, 2003.
