ROPE: Reducing the Omni-kernel Power Expenses by Karlberg, Jan-Ove
Faculty of Science and Technology
Department of Computer Science
ROPE: Reducing the Omni-kernel Power Expenses
Implementing power management in the Omni-kernel architecture—
Jan-Ove A. KarlbergINF-3990 Master thesis in Computer Science, May 2014


“That’s it man, game over man, game over!
What the fuck are we gonna do now?
What are we gonna do?”
–Hudson, Aliens (1986)
Abstract
Over the last decade, power consumption and energy efficiency have arisen
as important performance metrics for data center computing systems hosting
cloud services. The incentives for reducing power consumption are several,
and often span economic, technological, and environmental dimensions. Be-
cause of the vast number of computers currently employed in data centers, the
economy of scale dictates that even small reductions in power expenditure on
machine level can amount to large energy savings on data center scale.
Clouds commonly employ hardware virtualization technologies to allow for
higher degrees of utilization of the physical hardware. The workloads en-
capsulated by virtual machines constantly compete for the available physical
hardware resources of their host machines. To prevent execution of one work-
load from seizing resources that are intended for another, absolute visibility
and opportunity for resource allocation is necessary. The Omni-kernel ar-
chitecture is a novel operating system architecture especially designed for
pervasive monitoring and scheduling. Vortex is an experimental implementa-
tion this architecture.
This thesis describes rope (Reducing the Omni-kernel Power Expenses),
which introduces power management functionality to the Vortex implementa-
tion of the Omni-kernel architecture. rope reduces the power consumption
of Vortex, and does so while limiting performance degradation.
We evaluate the energy savings and performance impacts of deploying rope
using both custom tailored and industry standard benchmarks. We also dis-
cuss the implications of the Omni-kernel architecture with regards to power




I would like to thank my supervisor, Dr. Åge Kvalnes for his great advice, ideas,
and highly valued input throughout this project. Your passion and knowledge
is unmatched. The realization of this project would not have been possible
without you.
Further I would like to thank Robert Pettersen for his invaluable help. Your
incredible insight into the Omni-kernel architecture project has been a life-
saver in many a situation. Many thanks for putting up with all my requests
for new convenience tools.
I would also like to thank some of my fellow students whose help have been
crucial throughout this project. In no particular order: Kristian Elsebø, thank
you so much for all help with diskless servers, HammerDB, and general sysad-
min stuff. Erlend Graff, your knowledge, kindness, and enthusiasm is simply
incredible. Only he who has seen the red zone can understand my gratitude.
Einar Holsbø, my harshest literary critic: thank you so much for your valued
advice, and for proof-reading this entire text. To all of the above, and all the
rest of you whose names are to many to list here: Thank you so much, I am
truly blessed to be able to call you my friends.
To my family, parents and brothers: without your continuous support none
of this would have been possible.
Finally, to my girlfriend Martine Espeseth. Thank you so much for all your love





List of Figures xi
List of Tables xiii
List of Listings xv
List of Abbreviations xvii
1 Introduction 1
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Power Management 7
2.1 Introduction to Power Management . . . . . . . . . . . . . . . 7
2.2 Evaluating Power Management Policies . . . . . . . . . . . . . 8
2.3 Approaches to Power Management . . . . . . . . . . . . . . . 10
2.3.1 Heuristic Power Management Policies . . . . . . . . . . 11
2.3.2 Stochastic Power Management Policies . . . . . . . . . 12
2.3.3 Machine Learning Based Power Management Policies . 12
2.4 Power Management Mechanisms . . . . . . . . . . . . . . . . 14
2.4.1 Device Detection and Configuration . . . . . . . . . . . 14
2.4.2 Power Management of Devices . . . . . . . . . . . . . 15
2.4.3 Power Management of CPUs . . . . . . . . . . . . . . . 15
3 ROPE Architecture and Design 17
3.1 ROPE Power Management Architecture . . . . . . . . . . . . . 17
vii
viii CONTENTS
3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 CPU Core Power State Management . . . . . . . . . . 20
3.2.2 CPU Core Performance State Management . . . . . . . 22
3.2.3 Energy Efficient Scheduling . . . . . . . . . . . . . . . 23
4 Implementation 27
4.1 ACPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Supporting ACPI in Vortex . . . . . . . . . . . . . . . . 28
4.1.2 CPU Performance and Power Management using ACPI 29
4.2 Intel Specific Advanced Power Management . . . . . . . . . . 32
4.2.1 The CPUID Instruction . . . . . . . . . . . . . . . . . . 33
4.2.2 The MONITOR/MWAIT Instruction Pair . . . . . . . . 33
4.3 The Share Algorithm . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Per-core Power State Management . . . . . . . . . . . . . . . 37
4.4.1 Aggressive Entry of C-states . . . . . . . . . . . . . . . 37
4.4.2 Static Timeout Based C-state Entry . . . . . . . . . . . 37
4.4.3 Select Best Performing C-state . . . . . . . . . . . . . . 40
4.5 Per-core Performance State Management . . . . . . . . . . . . 49
4.5.1 Quantifying CPU Utilization . . . . . . . . . . . . . . . 50
4.5.2 Global Phase History Table Predictor . . . . . . . . . . 50
4.5.3 Naive Forecaster . . . . . . . . . . . . . . . . . . . . . 57
4.6 Energy Efficient Scheduling . . . . . . . . . . . . . . . . . . . 58
4.6.1 Topology Agnostic Dynamic Round-Robin . . . . . . . 59
5 Evaluation 63
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 Experimental Platform . . . . . . . . . . . . . . . . . . 64
5.1.2 Measuring Power Consumption . . . . . . . . . . . . . 64
5.1.3 ApacheBench Workload Traces . . . . . . . . . . . . . 65
5.1.4 Prediction Test Workload . . . . . . . . . . . . . . . . . 66
5.1.5 HammerDB and TPC-C Benchmark . . . . . . . . . . . 66
5.1.6 Fine Grained Workload . . . . . . . . . . . . . . . . . . 67
5.2 CPU Power State Management . . . . . . . . . . . . . . . . . . 68
5.2.1 Aggressively Entering C-states . . . . . . . . . . . . . . 68
5.2.2 Static Timeout Based C-state Entry . . . . . . . . . . . 71
5.2.3 Select Best Performing C-state . . . . . . . . . . . . . . 74
5.2.4 Comparison of C-state Management Polices . . . . . . 76
5.3 CPU Performance Management . . . . . . . . . . . . . . . . . 80
5.3.1 GPHT Predictor . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2 Comparison of Prediction Accuracy . . . . . . . . . . . 80
5.3.3 Comparison of P-state Management Polices . . . . . . . 84
5.4 Core Parking Scheduler . . . . . . . . . . . . . . . . . . . . . 86
5.4.1 Internal Measures . . . . . . . . . . . . . . . . . . . . 87
CONTENTS ix
5.4.2 Comparison with Standard Vortex Scheduler . . . . . . 89
5.4.3 Effects of Energy Efficient Dynamic Round-Robin Schedul-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Performance Comparison of Power Management Policies . . . 93
5.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 94
6 Related Work 95
6.1 Power Management of CPUs . . . . . . . . . . . . . . . . . . . 95
6.1.1 Share Algorithm . . . . . . . . . . . . . . . . . . . . . 96
6.2 Energy Efficient Scheduling . . . . . . . . . . . . . . . . . . . 97
6.3 Power Management in Data Centers and Cloud . . . . . . . . 99
6.4 Reduction of CO2 Emissions . . . . . . . . . . . . . . . . . . . 101
7 Discussion and Concluding Remarks 103
7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.2 Power Management in the Omni-kernel Architecture . 104
7.2 Contributions and Achievements . . . . . . . . . . . . . . . . 105
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 106
A ACPI Objects and Namespace 107
A.1 The ACPI Namespace . . . . . . . . . . . . . . . . . . . . . . . 107
A.2 ACPI Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.2.1 The _OSC Object . . . . . . . . . . . . . . . . . . . . . 108
A.2.2 The _PSS Object . . . . . . . . . . . . . . . . . . . . . 109
A.2.3 The _PPC Object . . . . . . . . . . . . . . . . . . . . . 109
A.2.4 The _PCT Object . . . . . . . . . . . . . . . . . . . . . 109
A.2.5 The _PSD Object . . . . . . . . . . . . . . . . . . . . . 111




3.1 Architecture of PM functionality implemented in Vortex . . . 19
3.2 Design of a policy for aggressively entering C-states. . . . . . 21
3.3 Design of policy for entering a C-state following a static timeout. 21
3.4 Design of policy dynamically selecting best C-state. . . . . . . 22
3.5 Design of policy for managing CPU performance states. . . . . 23
3.6 Design of energy efficient scheduler. . . . . . . . . . . . . . . 25
4.1 Architecture of ACPI. . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Interaction between ACPICA-subsystem and host OS. . . . . . 30
4.3 Structure of ACPICA-subsystem. . . . . . . . . . . . . . . . . . 30
4.4 Organization of CPU power-, performance-, and throttling
states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Cost of handling spurious timer interrupt. . . . . . . . . . . . 39
4.6 Static Timeout Policy Overheads - Latency distributions. . . . 42
4.7 Implementation of the policy selecting the best performing
C-state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Problem with periodic sampling of CPU activity. . . . . . . . . 44
4.9 Cost of sending RLE C-state trace. . . . . . . . . . . . . . . . . 45
4.10 Computational overhead of Share Algorithm. . . . . . . . . . 47
4.11 Selection of best performing C-state. . . . . . . . . . . . . . . 49
4.12 Implementation of Core Performance State Manager. . . . . . 52
4.13 PHT operation costs. . . . . . . . . . . . . . . . . . . . . . . . 53
4.14 Implementation of GPHT. . . . . . . . . . . . . . . . . . . . . 55
4.15 Idle function overhead attributable to per-core performance
management. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.16 Runtime selection of P-states. . . . . . . . . . . . . . . . . . . 57
4.17 Overhead of naive forecaster. . . . . . . . . . . . . . . . . . . 59
4.18 Implementation of energy efficient scheduler. . . . . . . . . . 61
4.19 CPU utilization over time under dynamic round-robin schedul-
ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.20 Energy efficient scheduling - Cost of operations. . . . . . . . . 62
5.1 Screenshot of running TPC-C with HammerDB. . . . . . . . . 67
5.2 Power consumption of aggressively entering different C-states. 69
xi
xii LIST OF FIGURES
5.3 Summed latency of aggressively entering different C-states. . 70
5.4 Instantaneous Latency of aggressively entering different C-
states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5 Power consumption of entering C-state following static timeout. 72
5.6 Excess completion time resulting from entering C-state after
static timeout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Timer cancel rates for static timeout policies. . . . . . . . . . 75
5.8 User-perceived latency using static timeout policies. . . . . . . 75
5.9 Comparisons of Share Algorithm Configurations. . . . . . . . 77
5.10 Comparison of power consumption using different CPU C-
state management policies. . . . . . . . . . . . . . . . . . . . 78
5.11 Comparison of user experienced latency when employing dif-
ferent CPU C-state management policies. . . . . . . . . . . . . 79
5.12 GPHT accuracy and hitrate. . . . . . . . . . . . . . . . . . . . 81
5.13 MASE of different prediction algorithms . . . . . . . . . . . . 83
5.14 Power consumption of using only P-states. . . . . . . . . . . . 85
5.15 Excess completion times using P-states for power management. 85
5.16 Instantaneous latency of using P-states for power management. 86
5.17 Properties of energy efficient DRR scheduler. . . . . . . . . . . 88
5.18 Power consumption of synthetic loads using energy efficient
DRR scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.19 Comparison of power consumption using different CPU PM
policies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.20 Excess completion time resulting from use of PM policy com-
binations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.21 User-perceived latency using PM policy combinations. . . . . 92
A.1 ACPI namespace object. . . . . . . . . . . . . . . . . . . . . . 108
A.2 Evaluation of _PSS object. . . . . . . . . . . . . . . . . . . . . 110
A.3 Entering of CPU P-state. . . . . . . . . . . . . . . . . . . . . . 111
A.4 Evaluation of _CST object. . . . . . . . . . . . . . . . . . . . . 112
List of Tables
4.1 “mem/𝜇-ops” to execution phase mappings. . . . . . . . . . . 51
4.2 Summary of Core Performance Management Implementation
costs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Summary of AB workload trace generation. . . . . . . . . . . 66
5.2 Workload relative energy savings when entering C-states di-
rectly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Summary of performance degradation due to direct entry of
C-states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Summary of performance degradation due to entering C-state
following a static timeout. . . . . . . . . . . . . . . . . . . . . 74
5.5 Share Master Configurations. . . . . . . . . . . . . . . . . . . 76
5.6 Summary of performance of Share Algorithm with different
configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7 Comparison of C-state management policies. . . . . . . . . . . 79
5.8 Accuracy of implemented P-state prediction algorithms. . . . . 82
5.9 Accuracy and MASE of P-state prediction algorithms. . . . . . 84
5.10 Comparison of P-state management policies. . . . . . . . . . . 84
5.11 Relative energy saving when using energy efficient DRR sched-
uler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.12 Comparison of ROPE PM policies. . . . . . . . . . . . . . . . . 92




4.1 Usage of MONITOR/MWAIT instruction pair. . . . . . . . . . 35
4.2 Idle function accommodating aggressive entry of C-states. . . 38
4.3 Idle function with code for entering C-states after static timeout. 41
4.4 Idle loop modified to support entry of best performing C-state. 48
4.5 Idle function modified to support P-state selection. . . . . . . 54
4.6 Performance State Management Thread . . . . . . . . . . . . 56






abi application binary interface
acpi Advanced Configuration and Power Interface
acpica acpi Component Architecture
aml acpi Machine Language
apm advanced power management
cdn content delivery network
cmos Complementary metal-oxide-semiconductor
cpu central processor unit
cpuid cpu Identification
dc data center
dpm dynamic power management
drr dynamic round-robin
dvfs dynamic voltage and frequency scaling
dvs dynamic voltage scaling
em enforcement mechanism
xvii
xviii List of Abbreviations
ema exponential moving average
ffh functional fixed hardware
gphr Global Phase History Register
gpht Global Phase History Table
http HyperText Transfer Protocol
i/o input/output
ict information and communication technologies
ipmi Intelligent Platform Management Interface
lkm loadable kernel module
llc last-level cache
lru least recently used
mase mean absolute scaled error
mwait Monitor Wait
nopm new-order transactions per minute
oka omni-kernel architecture
oltp online transaction processing
os operating system
osl OS Service Layer
ospm Operating System-directed Power Management
pci Periperhal Component Interconnect
pcie Peripheral Component Interconnect Express
pht Pattern History Table
List of Abbreviations xix
pm power management
pmc performance monitoring counter
pmi performance monitoring interrupt
psm Policy State Manager
qos quality of service
rle run-length encoding
rope Reducing the Omni-kernel Power Expenses
sla service level agreement
slo service level objective
sma simple moving average
soc system on chip
tpc Transaction Processing Performance Council
tpm transactions per minute




vmm virtual machine monitor
vpc Virtual Private Cloud




Over the last decade, power consumption and energy efficiency have arisen as
important performance metrics for data center (dc) computing systems host-
ing cloud services. The incentives for reducing power consumption are several,
and often span economic, technological, environmental, and even political
dimensions. Already in the mid nineties, the concept of "green computers"
surfaced as a response to increased CO2 emissions and growing costs [85].
Emissions and costs have only increased with time, and in 2007, the infor-
mation and communication technologies (ict) sector was estimated to have
been responsible for approximately 2% of all global carbon emissions, 14% of
this being attributable to data center operation [87]. In 2012, Facebook’s dcs
alone had a carbon footprint of 298 000 tons of CO2 [8], roughly the same as
emissions originating from the electricity use of 24 000 homes [1].
Emissions originating fromdcs are expected to double by 2020 [87]. Thus,dc
energy consumption is no longer only a business concern, but also a national
energy security and policy concern [90]. For example, in the United States, the
U.S. Congress is taking the issue of dc energy efficiency to the national level
[4]. Equivalently, the European Commission is considering the introduction
of a voluntary code of conduct regarding energy efficiency in dcs [5]. It is
expected that other regions and countries will introduce similar regulations
and legislation within the foreseeable future [90].
Technologically, reducing power consumption is important mainly for two rea-
sons. First, high power consumption is directly correlated with increased heat
1
2 CHAPTER 1 INTRODUCTION
dissipation. As computers are getting both smaller and packed more densely,
the necessary cooling solutions become increasingly expensive. For instance,
a cooling system for a 30.000 square feet, 10MW data center can cost $2–$5
million [71]. According to to the same source, every watt spent by computing
equipment necessitates the consumption of an additional 0.5 to 1.0W oper-
ating the cooling infrastructure. This amounts to $4–$8 million in annual
operating costs [76]. In data centers, where availability is essential, failures
within the cooling infrastructure can be a major problem as this can result in
reduced mean time between failure, and even service outages [64].
Towards the turn of the millennium, huge data centers started emerging
across the globe. With as many as tens of thousands of servers [64], energy
consumption again surfaced as a key design constraint. For some facilities
the electricity expenses are one of the dominant costs of operating the dc,
accounting for 10% of the total cost of ownership [49]. Although the rate at
which power consumption in dcs increase is slowing down [48], trends of
increased energy consumption in dcs are likely to continue.
Because of the vast number of computers currently employed in dcs, the
economy of scale dictates that even small reductions in power expenditure on
machine and device level can amount to large energy savings on data center
scale.
At the machine level, clouds commonly employ hardware virtualization tech-
nologies to allow for higher degrees of utilization of the physical hardware.
However, this approach is not without problems. The workloads encapsulated
by virtual machines constantly compete for the available physical hardware
resources of their host machines, and interference from this resource shar-
ing can cause unpredictable performance. Even when virtual machine (vm)
technologies are employed, some operating system (os) will have to serve
as a foundation, enabling the allocation of resources to the vms. Virtual ma-
chine monitor (vmm) functionality is implemented as an extension to such
an operating system.
The omni-kernel architecture (oka) is a novel operating system architecture
especially designed for pervasive monitoring and scheduling [52], [53]. The
architecture ensures that all resource consumption is measured, that the re-
source consumption resulting from a scheduling decision is attributed to the
correct activity, and that scheduling decisions are fine-grained.
Vortex [52], [53], [68], [69], [67] is an experimental implementation of the
oka providing a novel and light weight approach to virtualization. While
many conventional vmms expose virtualized device interfaces to its guest
vms [11], [78, p. 254–258], Vortex offers high-level commodity os abstrac-
1.1 PROBLEM DEFINITION 3
tions. Examples of such abstractions are files, memory mappings, network
connections, processes, and threads. By doing this, Vortex aims to offload
common os functionality from its guest oss, reducing both the resource foot-
print of a single vm and the duplication of functionality across all the vm oss.
Vortex does not currently provide a complete virtualization environment ca-
pable of hosting ports of commodity oss. Instead, a light weight approach
that targets compatibility at the application level is used; thin vm oss similar
to compatibility layers emulate the functionality of existing oss, retaining
their application binary interfaces (abis). Currently, a vm os capable of run-
ning Linux applications such as Apache1, mysql2, and Hadoop3 exists [68],
[69], while efforts to do the same for Windows applications are under way
[36].
This thesis describes rope (Reducing the Omni-kernel Power Expenses),
which introduces several dynamic power management algorithms to the Vor-
tex implementation of the oka.
1.1 Problem Definition
This thesis shall design, deploy, and evaluate power management functionality
within the Vortex operating system kernel. Focus will be to conserve energy while
limiting performance degradation.
1.2 Scope and Limitations
Power management is a multifaceted problem. In the context of data centers,
a multitude of approaches and techniques are currently being employed. In
order to save as much energy as possible, all aspects of power consumption
must be analyzed, understood, and managed cleverly. In this thesis, the focus
lies on power management on a per-node basis. We do not focus on tech-
niques involving multiple nodes, for instance through consolidation of virtual
machines [55], [43], [81], intelligent placement of workloads [32], [64], or
scheduling of work based on the availability of green power [10], [13]. Fur-
ther, we do not focus on the use of exotic or mobile technology in order to
save power [59], [34]. Rather, we specifically target software techniques for




4 CHAPTER 1 INTRODUCTION
1.3 Interpretation
Our problem concerns effectuating pm while limiting performance degrada-
tions. If tradeoffs between energy savings and performance must be made,
the evaluation of the implemented functionality should enable a user to make
enlightened decisions. This requires that the system be tested with relevant
benchmarks and workloads. Specifically, the implemented system should ad-
here to the following principles and requirements:
Design Goals and Principles
• Simplicity: Where possible, simple solutions should be adopted. This
is especially important as Vortex is under continuous development by
faculty and students.
• Flexibility: Flexibility is preferable over rigidity: Vortex is constantly
changing. If rope is to remain relevant its implementation must be
both flexible and robust.
• Orthogonality:rope should target the issues of energy efficiency along
orthogonal axes. Solutions and policies should compliment rather than
rely on each other, making future additions and modifications as un-
complicated as possible.
Requirements
• The system should reduce the power consumption of Vortex.
• The system should make use of existing standards for power manage-
ment (pm) wherever possible.
• The implementation of the system should focus on performance. For all
implemented functionality, performance implications should be evalu-
ated.
1.4 Methodology
According to the final report of the acm Task Force on the Core of Com-
puter Science, the discipline of computing can be divided into three major
paradigms. These paradigms are theory, abstraction, and design.
1.5 CONTEXT 5
The first paradigm is theory. Rooted in mathematics, the theoretical approach
is characterized by first defining problems, then proposing theorems, and
finally seeking to prove them in order to determine potentially new relation-
ships and make progress in computing.
Abstraction is rooted in the experimental scientific method. Following this
approach one seeks to investigate a phenomenon by constructing a hypothesis,
and then make models or perform simulations to challenge this hypothesis.
Finally, the results are analyzed.
Design is the last paradigm,and is rooted in engineering. Within this paradigm,
one seeks to construct a system for solving a given problem. First, the require-
ments and specifications of said system are defined. The system should then
be designed and implemented according to these requirements and specifica-
tions. Upon completion of the system, it should be tested and evaluated.
For this thesis, the design paradigm seems to be the most fitting. We have
stated a problem, its requirements, and specifications. Now, a prototype solv-
ing the problem according to the specification must be designed and imple-
mented. The testing of the system will amount to quantitatively evaluating
its performance and cost of operation.
1.5 Context
This project is written as a part of the Information Access Disruption (iad)
center. The iad center focuses on research into fundamental concepts and
structures for large-scale information access. The main focus areas are tech-
nologies related to sport, cloud computing, and analytic runtimes.
Specifically, this project is related to the Omni-kernel architecture project,
where a novel os architecture providing total control over resource alloca-
tion is under development. It is in Vortex, the current instantiation of this
architecture, that rope has been implemented.
1.6 Outline
Chapter 2 covers some basics of power management, introduces some key
concepts, and outlines often employed classes of pm algorithms. It also
introduces different power management mechanisms.
6 CHAPTER 1 INTRODUCTION
Chapter 3 describes the architecture and design of rope, and how power
management functionality have been introduced to the Vortex oper-
ating system kernel. Further, this chapter explains how different pm
policies coexist and complement each other.
Chapter 4 covers the implementation specific details of the pm policies de-
scribed in chapter 3. It also covers acpi and how it is incorporated into
the Vortex kernel. Intel-specific pm capabilities are also discussed.
Chapter 5 evaluates the power management policies implemented in rope,
and contains our experiments and results.
Chapter 6 presents related work.
Chapter 7 discuss some of our findings and results in greater detail, and
covers how energy efficiency can be accommodated in the Omni-kernel
architecture. Finally, it concludes this thesis.
2
Power Management
This chapter covers the basic principles of power management (pm). First,
key terms and concepts are introduced. Following this, different approaches
to pm as well as common classes of algorithms are outlined.
2.1 Introduction to Power Management
Dynamic power management (dpm) [70], [19], [77] is a design methodology
for dynamic configuration of systems, components, or software, that aims to
minimize the power expenditure of these under the current workload, and
constraints of service and performance levels requested by the user.
dpm can be applied by different techniques that selectively deactivate or
power down components when they are idle or unexploited. The goal of all
such techniques is to achieve more energy efficient computation, and a central
concept of dpm is the ability to trade off performance for power savings in
a controlled fashion while also taking the workload into account. The main
function of a power management scheme, or policy, is to determine when to
initiate transitions between component power states, and to decide which
transitions to perform. For instance, if a device supports more than one pm
state, one of these states may be more suitable to reduce energy consumption
under a specific workload, than another. Consider, if the workload is memory
bound, it might be possible to lower the frequency of a central processor unit
7
8 CHAPTER 2 POWER MANAGEMENT
(cpu), thereby reducing power consumption, while still managing to perform
all computations in a timely fashion.
Exactly what decision is (or should be) made at any time will be highly de-
pendent on the active pm policy. Any performance constraints will play a
fundamental role when making power management decisions [14], [70], re-
gardless of whether the policy makes use of historical-, current-, or predicted
future workloads to find the best course of action.
2.2 Evaluating Power Management Policies
Throughout this text, the term workload, as defined in [41], describes the
physical usage of the system over time. This usage can be measured in any
unit suitable as a metric. For instance, when implementing a policy that will
power up or down a network interface card, average bandwidth and latency
might be suitable metrics for capturing the characteristics of the workload.
When deciding to spin down a hard drive, the number of read and write
operations might be of more interest. Often, consecutive measurements of
the workload characteristics are stored in a trace.
When attempting to conserve power by means ofdpm policies, a problem that
needs to be considered is the possibly detrimental effects of said policies on
system performance. More precisely, it is desirable that the negative impact on
metrics such as latency and job throughput is reduced to the extent possible
[65], [70], [19], [77]. Because this is a critical issue, a means of comparing
the amount of performance degradation of different policies or algorithms for
dynamic power management is necessary. One way of approaching this is to
define some baseline that different policies and algorithms can be compared
to.
An important consideration of any power management system is that tran-
sitions between different power states of a device most often comes with a
cost. If a transition is not instantaneous, and if the device is non-operational
during this transition, performance can be lost when a transition is performed.
Whenever transitions are non-instantaneous, there might also be a power cost
associated with the transition. If the costs in performance, power, or both, are
large, a power state might become close to useless as amortizing the transition
costs becomes difficult [14].
Given a trace corresponding to some workload, an optimal baseline can be
created in hindsight. Note that because no system can be all knowing of past,
present, and future workloads, the optimal baseline can only be generated
2.2 EVALUATING POWER MANAGEMENT POLICIES 9
offline. We define this optimal baseline, an all knowing oracle, as having the
following characteristics:
• The device should not be powered down if this will affect performance
negatively.
• The device should not change power states if the state transition will
result in a reduction of saved power, i.e., the state transition will con-
sume more power than is saved by entering the state and remaining in
it throughout the idle period.
• The device should be powered down at all points in time not covered
by the above cases, thereby maximizing the amount of saved power.
In [14], the authors survey several different approaches todpm. We will adopt
their notation when explaining applicability of dynamic power management.
When switching a device into an inactive (sleep) state, this results in a period
of inactivity. We denote the duration of a generic period at time 𝑛, 𝑇𝑛. This is
the sum of the time spent in a state 𝑆, as well as the time spent transitioning
into and out of the state. The break-even time for a given state, 𝑆, is denoted
𝑇𝐵𝐸,𝑆 , and corresponds to the amount of time that must be spent in the state
𝑆 in order to compensate for the power spent transitioning to and from it.
Most often, lower power states have higher associated performance penalties.
In the case that 𝑇𝑛 < 𝑇𝐵𝐸,𝑆 , there is typically not enough time to both enter
and leave the inactive (sleep) state, or the time spent in the low-power state
is not long enough to amortize the cost of transitioning to and from it. This
means that if all idle periods corresponding to a given workload are shorter
than 𝑇𝐵𝐸,𝑆 , this power state can not be used to save any power. Transitions
into a low-power state during an idle period that is shorter than the break-
even time for this state is guaranteed to waste energy. Because of this, it is
often desirable to filter out idle times of very short duration [84].
𝑇𝐵𝐸 is the sum of two terms: 𝑇𝐵𝐸 = 𝑇𝑇 𝑅 + 𝑃𝑇 𝑅. The first of these, 𝑇𝑇 𝑅,
is the total transition time, meaning the time necessary to both enter and
leave the low-power state. The second, transition power (𝑃𝑇 𝑅), is the time
that must be spent in the low-power state to compensate for the power spent
during the transition phases. 𝑇𝑇 𝑅 and 𝑃𝑇 𝑅 are given by equations 2.1 and
2.2.
𝑇𝑇 𝑅 = 𝑇On,Off + 𝑇Off,On (2.1)
Where 𝑇On,Off and 𝑇Off,On are the times necessary to perform a transition from
10 CHAPTER 2 POWER MANAGEMENT









𝑇𝑇 𝑅 + 𝑇𝑇 𝑅 𝑃𝑇 𝑅−𝑃On𝑃On−𝑃Off , if 𝑃𝑇 𝑅 > 𝑃On
𝑇𝑇 𝑅 , if 𝑃𝑇 𝑅 ≤ 𝑃On
(2.3)
Further, the energy saved by entering a state 𝑆 for an idle period 𝑇idle, with
a duration longer than the break even time for this state, 𝑇𝐵𝐸,𝑆 is given by
the following equation:
𝐸𝑆(𝑇idle) = (𝑇idle − 𝑇𝑇 𝑅)(𝑃On − 𝑃𝑆) + 𝑇𝑇 𝑅(𝑃On − 𝑃𝑇 𝑅) (2.4)
In [35], the authors introduce two concepts useful when evaluating the qual-
ity of power management policies. The first of these—external measures—
gives a quantitative measure of the interference between the PM policy and
other applications. Any profit resulting from the use of the policy will also be
quantified with external measures. Examples of metrics covered by external
measures are power consumption, latency, and system resource consumption.
Internal measures, on the other hand, can be used to quantify the accuracy of
prediction algorithms. While the internal measures are useful when deciding
on which prediction algorithms to use, the external measures are the ones
that really matter for a running system.
2.3 Approaches to Power Management
In general, approaches to pm are either heuristic or stochastic. Although their
goals are similar, the two classes of policies differ widely in their implemen-
tation as well as their theoretical foundations. Stochastic policies are theoret-
ically optimal, and use statistical distributions to model workloads. Heuristic
approaches, on the other hand, are not theoretically optimal, but often based
on experimental results and practical engineering. Because of this, heuris-
tic approaches such as static timeout policies—albeit often used with great
2.3 APPROACHES TO POWER MANAGEMENT 11
success—may lead to bad performance and unnecessary power consumption
when the request rate for a service is highly variable [77].
pm policies can be further divided into discrete-time or event-driven vari-
ants. With discrete-time policies, decisions are re-evaluated at every time
slice. Event-driven policies, on the other hand, only evaluate what action to
take when some event, such as the arrival or completion of a request, occurs.
The fact that discrete time solutions continuously re-evaluate the power set-
tings of the system means that power is wasted if the device could otherwise
be placed in a lower power state [84]. Further, pm policies can be split into
stationary and non stationary policies. A stationary policy is constant in that it
remains the same for all points in time, while a non stationary policy changes
over time according to the experienced workload.
2.3.1 Heuristic Power Management Policies
Heuristic pm algorithms can in broad terms be split into two classes: timeout-
and rate-based. Common for all heuristic policies, is that results are not guar-
anteed to be optimal.
Timeout-based Policies
The most common heuristic policies are variants of timeout policies [84].
Whenever an idle period occurs, a timer is started. A state transition is initi-
ated if the idle period is longer than some pre-defined timeout value. Timeout
based policies have the disadvantage that they result in unnecessary power
expenditure while waiting for the timeout to expire [85]. A static timeout
based policy always employs the same timeout value, whereas adaptive time-
out policies modify their timeout parameters in order to better meet some
performance goal [35]. Simple adaptive policies can result in larger power
savings for a given level of latency [39].
Rate-based Policies
With rate-based policies, some prediction unit or algorithm calculates an es-
timate (prediction) of future load for the power-managed device or system.
Typical examples are estimation of idle periods [35], the number of incoming
requests, or the bandwidth consumption of a webserver. Rate-based policies
make the system or component perform a state transition as soon as an idle
period begins, given that it is likely that the idle period will be long enough
to save power [85]. The efficiency of an algorithm therefore depends on the
12 CHAPTER 2 POWER MANAGEMENT
ability to estimate the duration of the idle periods to a satisfactory degree,
as failure to do so can cause performance penalties and also result in waste
of energy. Normally, thresholds are used to determine when to power a de-
vice up or down. Examples of rate-based predictors are different variants of
moving averages.
2.3.2 Stochastic Power Management Policies
Stochastic approaches to pm differ from heuristic techniques in that they
guarantee optimal results as long as the system to be managed is modeled
well enough [84]. All stochastic models use statistical distributions to describe
the system that is to be power managed. For instance, while one distribution
describes the time spent transitioning to and from different power states, an-
other might describe the time spent by some resource to service a client’s
request, while a third might model the inter-arrival times of these requests.
The distribution used to model the different parts of the system varies, and
one typically tries to find a distribution that closely matches the real system.
Although theoretically optimal, the quality of these models are highly depen-
dent on the system being sufficiently well-modeled. Further, as the optimal
solution is calculated from traces, themodel quality will always be constrained
by the trace quality, and the degree to which distributions fit the actual usage
patterns [75].
2.3.3 Machine Learning Based Power Management Policies
Machine learning revolves around the creation and study of computer systems
and machines that can learn from data presented to them. Many examples of
such systems exist, for example speech recognition systems, spam filters, and
robot controlled cars. A formal definition of machine learning was presented
by Tom M. Mitchell in his 1997 book on the topic:
A computer program is said to learn from experience 𝐸 with re-
spect to some class of tasks 𝑇 and performance measure 𝑃 , if
its performance at tasks in 𝑇 , as measured by 𝑃 , improves with
𝐸. [60]
For example, a computer that learns to play a card game might improve its
performance with regards to winning that game, as it gains experience by
playing it against itself or others. The quality of machine learning systems are
dependent on multiple design choices, such as which training data is used,
the features extracted from this data, and the chosen algorithm for learning
the target function based on the training data.
2.3 APPROACHES TO POWER MANAGEMENT 13
Machine learning has been used extensively in the context of pm. Examples
include powering down idle hard drives and wireless network interfaces [28],
[39], [86], putting cpus to sleep and adjusting core frequencies [54], [9],
[61], as well as energy efficient scheduling of tasks [93].
There are many possible ways to train a machine learning system. With su-
pervised learning, pre-labeled training data is used to approximate the target
function. The key point of supervised learning is that a human, for example
some expert in the field of the classification problem, have to correctly label
the training data so that it can be used to improve the performance of the
algorithm. In contrast, unsupervised learning rely on the ability to uncover
some hidden structure in raw unlabeled data. Semi-supervised learning com-
bine these approaches. Labeled training data is used as an starting point
for automatically labeling the remaining data. This has been found to im-
prove learning accuracy considerably in some cases where the generation of a
fully labeled training set is unfeasible, while unlabeled data can be obtained
cheaply. Reinforcement learning is a category of machine learning algorithms
that attempt to find ways of maximizing the reward (minimizing the cost) of
a system by repeatedly selecting one among a set of predefined actions.
In the context of power management, machine learning approaches are often
divided into two classes of systems, namely pre-trained systems, and systems
employing online learning.
Pre-trained Systems
Pre-trained systems commonly employ supervised- or semi-supervised learn-
ing to train the employed pm algorithm. Typically, example workloads are
used to obtain traces used for training. Unless a representative set of train-
ing workloads is used, the algorithm may become unable to generalize when
experiencing real loads. Because of the dynamic nature of pm problems, pre-
trained systems are less common than online ones. Nonetheless, pre-trained
systems have been used successfully for pm, for instance to change cpu fre-
quencies [9], [61], [47].
Online Learning
Online learning encompasses machine learning methods in which no set of
separate training samples is used. Instead, the algorithm tries to produce the
best prediction possible as examples arrive, starting with the first example that
it receives. After making a prediction, the algorithm is told whether or not this
prediction was correct, and uses this information to improve its prediction-
14 CHAPTER 2 POWER MANAGEMENT
hypothesis. Each example-prediction pair is called a trial. The algorithm will
continue to produce predictions as long as new examples arrive, and the
learning and improvement of the hypothesis is thus continuous. Because
the learning process never stops, it is sensible to employ an algorithm that
is capable of computing successive prediction-hypotheses incrementally, and
avoiding the potentially excessive costs of recalculating every hypothesis from
scratch (thus having to store the examples indefinitely) [56], [57].
Online statistical machine learning has been used to control cpu frequency
[45],while online reinforcement learning has been employed in the domain of
energy efficient task scheduling and for managing wireless local area network
(wlan) interfaces [93], [86]. Online machine learning has also been used
to spin down hard drives in order to conserve energy [39].
2.4 Power Management Mechanisms
pm can be performed on a wealth of devices. Notable examples are powering
down network cards, spinning down hard drives, adjusting fan-speeds, etc.
To enable software—for example an os—to take advantage of available pm
capabilities of the hardware, some mechanisms are necessary. First, it must
be possible to detect the presence of any pm capable hardware. Second, the
configuration of devices (and possibly the software itself) must be performed.
Third, the software must have some method to communicate pm decisions
to the hardware. The following sections gives a brief overview of different
pm mechanisms. Thorough descriptions of these mechanisms are deferred to
chapter 4.
2.4.1 Device Detection and Configuration
With myriads of different hardware vendors and rapid technological develop-
ment, the existence of multiple standards for detection and configuration of
devices is not surprising. Examples are the use of vendor- and device-specific
bios code, and advanced power management (apm) apis and description
tables. Advanced Configuration and Power Interface (acpi) evolved from the
need to combine and refine existing power management techniques such as
those mentioned above, into a robust and well defined specification for device
configuration and power management. Using this interface, an os can man-
age any acpi-compliant hardware in a simple and efficient manner.
2.4 POWER MANAGEMENT MECHANISMS 15
2.4.2 Power Management of Devices
Through the use of for example acpi, software can detect and configure
pm capable devices in the system. acpi provides information about devices
through objects. While the objects normally contain tables of information,
they can also contain methods that can be used to communicate software
capabilities and initiate state transitions. Consider a pm-capable rotating hard
drive drive: through acpi it presents tables listing its power/performance
states. Such tables would likely include the power consumption, bandwidth,
and expected read/write latency for each state. Probably, it would also contain
how long it takes to enter the different states, and their associated spin-up
time—the time spent spinning the disk back upwhen leaving them. Now,using
the information obtained from such tables, and acpi accessible methods for
the device, software can decide on what state the disk should reside in at any
given time, and then make it enter the chosen state. Figure A.3 in appendix
A illustrates the concepts described here, but for entry of cpu performance
states.
Although acpi provides a well defined and robust way of interacting with
devices, not all devices support this specification. Legacy devices, and also
relatively new hardware, can lack support for acpi1. Sometimes, such devices
provide pm capabilities by means of some other specification or technology.
Common examples are devices connected via pci and pcie interfaces, which
define their own methods for performing pm. In addition, some specialized
devices have functionality and properties not easily representable through
acpi. For instance, Intel employ both acpi and their own custommechanisms
for detection, configuration, and power- and performance management of
their cpus [25], [26].
2.4.3 Power Management of CPUs
acpi provides several mechanisms for controlling the power consumption and
performance of cpus. The two most commonly used (and the ones employed
in this work) are power states (C-states), and performance states (P-states).
C-states correspond to different cpu sleep-states, that is, states when no
instructions can be executed and energy is conserved by powering down
parts of the processor. P-states corresponds to combinations of cpu frequency
and voltage, higher frequencies results in increased performance but also
consume more power.
1. On our hardware, the Gbit nics are pm capable but via pci instead of acpi. In fact, our
entire platform is severely lacking in terms of acpi support.
16 CHAPTER 2 POWER MANAGEMENT
Dynamic Voltage Scaling
Dynamic voltage scaling (dvs) enables software to dynamically adjust both
the cpu voltage and frequency. Typical use of dvs can result in substantial
power savings due to the dependency between cpu voltage, frequency, and
power dissipation [40]. Consider, if implementing a chip using static cmos
gates, its switching power dissipation is given by equation 2.5, where 𝑃 is the
power, 𝐶 is the capacitance being switched per clock cycle, 𝑉 is the voltage,
and 𝑓 is the switching frequency.
𝑃 = 𝐶𝑉 2𝑓 (2.5)
As can be seen from equation 2.5, power dissipation is quadratically propor-
tional to the cpu voltage2. Because of this, the frequency of the cpu must
be reduced when lowering the operating voltage in order to save power. As
such, dvs trades performance for energy reductions. In modern cpus, a set
of frequency/voltage pairs are often exposed through technologies such as En-
hanced Intelr SpeedStepr [27], or AMDs Cool ’n’ Quiet™/PowerNow!™[20],
[21]. These technologies simplify power management, and ensure that the
paired frequencies and voltages are compatible. For this reason, dvs is often
also called dynamic voltage and frequency scaling.
2. For a relatively non-theoretical explanation as to why is this is the case, the following






This chapter covers the architecture and design of rope, which contains all
pm functionality in Vortex. First, the architectural foundations and basic ideas
are established. Following this, designs for various pm policies are introduced
and discussed in more detail.
3.1 ROPE Power Management Architecture
Power management (pm) of the cpu is critical for saving power. Despite mod-
ern cpus having become considerably more energy efficient in recent years
[12], the cpu is still responsible for 25%–60% of the total power consumption
of computers [59], [88], [12]. To reduce power consumption, the cpu must
be managed with regards to both power- and performance states. In broader
terms, this means putting the cpu to sleep and adjusting its frequency.
rope introduces pm functionality to the Vortex kernel. In rope, manage-
ment of power- and performance states (C- and P-states, respectively) is per-
formed independently of each other. The architecture of rope is founded
on two central concepts; policies: rules describing the use of power- and per-
formance states; and enforcement: effectuating the policy with hardware. As
17
18 CHAPTER 3 ROPE ARCHITECTURE AND DESIGN
outlined in section 2.1, different policies can vary significantly in their nature,
i.e. they can be static or dynamic, pre-configured or based on runtime metrics.
Thus, they may rely on different data structures, libraries, instrumentation1,
etc. Further, different policies might share method of enforcement. For in-
stance, a set of timeout-based policies, each with a distinct timeout value, can
rely on the same method for enforcing the timeout, while a more complex
policy might need a completely different enforcement.
The design goals for rope include simplicity and flexibility. Guided by these,
we base the architecture and design of our solutions on the basic ideas of poli-
cies and enforcers. An os kernel is a delicate and complex piece of software,
and our solutions are influenced by this environment. The necessary logic,
instrumentation, and data structures, must often be incorporated into the
Vortex kernel in a scattered and fragmented fashion, leveraging existing code
paths and infrastructure. Based on these observations, we deem it appropriate
not to establish any kind of rigid framework for the implementation of our pm
functionality, as such a framework may quickly become an impediment with
regards to flexibility. Figure 3.1 outlines the described architecture.
3.2 Design
This section describes several different designs of pm policies. Some of these
provide alternative solutions to the same problem, while other policies have
different objectives. Policies with different objectives are intended to comple-
ment each other, augmenting the Vortex pm system as a whole. Although all
cpu pm features will be somewhat intertwined, we propose policies along
three orthogonal axes:
• cpu power state management: These policies control when to turn a
cpu-core off, and what to do when it is not executing instructions, e.g.
which cpu C-state to enter.
• cpu performance state management: These are policies governing
the performance of the cpu core when it is powered on. Typically, this
means selecting one of the available cpu P-states.
• Energy efficient scheduling: When the individual cpu cores are pow-
ered down and clocked according to the workload, it becomes interest-
ing to avoid unnecessary wake-ups. Using energy efficient scheduling,





Figure 3.1: Architecture of pm functionality implemented in Vortex.
it is possible to assign tasks to cores such that as many cores as possible
are kept inactive, and thus sleeping, at any given time.
A single pm policy is realized through an enforcement mechanism (em),
and the structures and logic necessary to support this em. These auxiliary
structures and pieces of logic provide the em with necessary information
when effectuating pm decisions. Examples of such information could be how
long the cpu must remain idle before entering a given C-state, which C-state
to enter, and so on. We treat this logic, its data structures, and the information
it contains, as a black box, and call it a Policy State Manager (psm). As long
as the psm is paired with its matching em, the policy implemented by the
two will be effectuated, and the rest of the Vortex kernel will remain entirely
independent from the introduced pm policy.
Modern cpus commonly provide the possibility of adjusting the voltage and
frequency of sub-components, or voltage/frequency islands (vfis) within a
single chip. For example, recent AMD Opteron processors support indepen-
dent frequency control of all four cores, the shared L3 cache and on-chip
northbridge, the DDR interface, and four HyperTransport links [46], [40]. In
accordance with this trend, all pm functionality implemented in rope oper-
20 CHAPTER 3 ROPE ARCHITECTURE AND DESIGN
ates at a per-core level. This design choice results in a high degree of flexibility,
which is especially desirable in the context of a vmm. For example, in the
current version of the Vortex kernel all interrupts related to network traffic
are handled by a single cpu core2. For some application scenarios, it might be
necessary to ensure maximum performance with regards to network latency.
With the per-core pm approach, a less aggressive policy3 can be configured
for the core handling the network traffic. Further, different tenants can be
configured to run on separate cores, which in turn are configured to use a pm
policy matching the customer’s desired performance/carbon footprint.
The complexity of different psms will vary with the policies they support.
In the following paragraphs, different pm policies available in Vortex are
described. Some of these are relatively straight-forward and have very simple
psms, while others are more sophisticated.
3.2.1 CPU Core Power State Management
rope supports three different policies for managing cpu C-states. This sec-
tion covers their design.
Aggressive Entry of C-states
The simplest pm policy in Vortex is to enter a given C-state aggressively
whenever a core is idle. With this policy, the enforcer simply needs to know
which C-state to enter, and then execute a state transition. The design for this
policy is illustrated in figure 3.2.
Static Timeout based C-state Entry
With static timeout-based policies, a constant timeout value is used to deter-
mine whether or not a C-state should be entered. The device being managed
has to remain inactive throughout the duration of the timeout for the low
power state to be entered. In this policy, the psm is responsible for providing
the static timeout value to the enforcement module. Figure 3.3 illustrates the
2. Using only a single core for handling network interrupts is a configuration choice. On the
current hardware, it is most performance inducive that the same core both insert packets
into the transmit queue, and remove packets from the receive queue. This is because
both of these queues are protected by the same lock, and using a single core thus avoids
lock contention. Experiments have shown a 5.5% increase in cpu consumption if more
than one core is used [52].






 Static C-state 
Figure 3.2: Design of a policy that aggressively enters C-states.





 Static Timeout    
Figure 3.3: Design of a policy for entering a C-state following a static timeout.
Select Best Performing C-state
This policy employs an algorithm that dynamically selects which C-state to
enter at runtime. This is done at the granularity of individual cpu-cores. The
psm for this policy is composed of two main sub-components, each serving
a distinct purpose:
• Power state logger: Performs runtime logging of the current power
state of a cpu-core.
• C-state selection module: Calculates the power saved and latency
added as a result of aggressively entering different C-states. Based on
these calculations, the currently best performing C-state is chosen.
Figure 3.4 illustrates the design of the policy where the best performing C-
state is selected.












Per Cycle Power State
Figure 3.4: The design of a policy where the best performing C-state is chosen.
3.2.2 CPU Core Performance State Management
While the previous section outlined different policies for conserving energy us-
ing cpu C-states, this section considers the design of a policies for controlling
performance states on a per-core basis. These policies and their components
are responsible for adjusting the operating frequency of an individual core
according to the workload. We employ two workload prediction techniques
for selecting which P-state to use at any given time, both of which share the
same general design.
The psm for supporting dynamic selection of P-states consists of two key
components: a Performance State Logger and a Prediction Module. Figure 3.5
illustrates the design of our P-state management solution.
• Performance state logger: Performs runtime logging of the current
performance state of a cpu-core.
• Prediction module: Generates predictions about which P-state to use.
Relies on records of previously experienced workload patterns in order

















Figure 3.5: The design of a policy capable of managing cpu performance states.
3.2.3 Energy Efficient Scheduling
The third and final branch of pm algorithms implemented in rope is that of
energy efficient scheduling. In essence, this problems boils down to keeping
the maximum number of cores idle for as long as possible. There are several
issues that must be considered when designing an energy efficient scheduling
solution:
• Multiple cores per chip: Modern cpus often have multiple cores per
chip, or package. Although some cpus, such as the previouslymentioned
AMD Opteron, feature voltage/frequency islands (vfis) per core, many
do not. On our platform, all cores on a package share the circuitry
providing the voltage. This means that the voltage must be maintained
at a level corresponding to the needs of the most voltage-hungry core
on the chip. If the maximum amount of power is to be saved, the cpu-
topology must be taken into account. For example, under light load,
maximum power savings can be achievedwhen all the cores in a physical
package are completely loaded before distributing the load to another
idle package [83].
• Cache warmness: To keep caches warm, it is undesirable to unneces-
24 CHAPTER 3 ROPE ARCHITECTURE AND DESIGN
sarily move threads between cpu-cores. This results in an interesting
tradeoff: distribute the total load evenly to all active cores, making sure
that the lowest possible voltage can be drawn; or optimize for cache
warmness and performance, risking that a single core increases the
voltage requirements of the entire package.
• Cache Contention: Although maximum power savings can be achieved
by loading a single package as much as possible before recruiting an
inactive one, this can lead to suboptimal performance. The reason for
this is that cores on the same package often share lower level caches.
If many threads are scheduled on the same package, contention for
these shared resources will occur. The performance impact of this will
be highly dependent on the behavior of the scheduled tasks: if the tasks
are not memory or cache intensive, performance impact will be mini-
mal [83].
We do not wish to break our design principle of orthogonality, and have our
energy efficient scheduler directly alter the C- or P-states of the different
cores. This would certainly complicate the coexistence with existing policies
managing cpu-states. Instead, we opt for an design where the scheduler is
decoupled from the hardware, and attempts to save power while working
with abstract proxies of the physical cpu-cores. A Core Selector implements a
scheduling policy while interacting with a ProxyManager. The ProxyManager
holds the state of the physical cpu-cores, such as their current utilization,
whether they are active or sleeping, and so on. Together, these components
constitute the psm. This psm is hooked directly into the existing scheduling


































In this chapter, we describe the implementation of several pm policies along
the three orthogonal axes of C-states, P-states, and energy efficient scheduling.
First, we describe some of the mechanisms that are used in rope to control
the hardware. We then explain a online machine learning algorithm used in
one of our P-state predictors. Lastly, we describe the implementation of each
of the pm policies available in rope, and provide insights into the costs of
operating these.
4.1 ACPI
acpi is a method of describing hardware interfaces in a terms abstract enough
to allow flexible hardware implementations, while being concrete enough to
allow standardized software to interact with such hardware interfaces [22].
The acpi specification was created to establish common interfaces that enable
robust os-directed motherboard device configuration, and power manage-
ment of devices and systems. acpi is a fundamental component for Operating
System-directed Power Management (ospm).
acpi evolved from the need to combine and refine existing power manage-
ment bios code, advanced power management (apm) apis, and description
tables, etc., into a robust and well-defined specification for device configura-
tion and power management. Using an acpi-specified interface, an os can
27
28 CHAPTER 4 IMPLEMENTATION
manage any acpi-compliant hardware in a simple and efficient manner.
At the heart of the acpi specification are the acpi System Description Tables.
These describe a specific platform’s hardware, and the role of acpi firmware
is primarily to supply these acpi tables (rather than a native instruction api)
to pm software [22]. The System Description Tables describe the interfaces
to the hardware, and some of these tables include “definition blocks” that
contain acpi Machine Language (aml): an interpreted, and abstract ma-
chine language. Using this language, hardware vendors can describe their
hardware—certain that any acpi-capable os can interpret it correctly. The
acpi System Firmware, in addition to supplying description tables, also imple-
ments interfaces that can be used by the os. For example, if the description
tables contain an acpi-object for restarting a device, this object (which the
os can interact with) will be backed by firmware that controls the actual
hardware.
Figure 4.1 shows the architecture of acpi. acpi itself (shown in red) is inde-
pendent of both the underlying hardware and the os. Any os that wishes
to support acpi must provide drivers, and a aml parser- and interpreter. If
these components are in place, any interfaces and information presented by
the platform can be accessed and used. Similarly, the hardware is not a part
of acpi per se. However, if the hardware wishes to expose functionality to
the os, it can do so through acpi-objects accessible through the acpi system
description tables.
4.1.1 Supporting ACPI in Vortex
The acpi standard consists of multiple components that allow the setup and
powermanagement of various devices. These components include acpi name-
space management, acpi table and device support, acpi event handling ca-
pabilities, and aml parser/interpreter functionality. The acpi Component
Architecture (acpica), developed by Intel Corporation, defines and imple-
ments these software components in an open-source package. One of the
major goals of this architecture is to isolate the operating system from the
code necessary to support acpi. Most of the acpica code base is indepen-
dent of the host os in which it exists, using only a thin translation layer to
access necessary os functionality. To port acpica to a new operating system,
only this os specific layer must be implemented. This is necessary because
every operating system is different, and the way an os provides necessary
services such as synchronization, input/output, or memory allocation, can
differ enormously [24].























not part of ACPI
Figure 4.1: Architecture of acpi.
the host operating system and the os-independent acpica components is
performed through the thin service layer. This is illustrated in figure 4.2. In
our case, the OS Service Layer (osl) is implemented with approximately
1000 lines of Vortex-specific C code.
The osl consists of two interfaces. These interfaces are used to support
access to the host os functionality as explained above, but also allow the host
os access to the functionality of the acpica subsystem, for example setup
and power management of devices. A sketch of this can be seen in figure
4.3.
4.1.2 CPU Performance and Power Management using ACPI
acpi provides three mechanisms for controlling the performance states of
processors. These are:
• Processor power states, named C0, C1, C2, ..., C𝑛.
• Processor clock throttling.




OS independent ACPICA 
Component
Requests to Host OS
Figure 4.2: The figure illustrates how the acpica-subsystem resides within and in-
teracts with the host operating system.
ACPICA Subsystem
OS Service Layer
OS independent ACPICA Component






Figure 4.3: The figure illustrates how the operating system service layer of the
acpica subsystem is structured into two interfaces, together facilitating
the necessary bindings between the subsystem and the host os.
4.1 ACPI 31
• Processor performance states, named P0, P1, P2, ..., P𝑛.
If supported by the platform, the os can use these mechanisms to save power
while attempting to meet performance constraints. Central to managing both
performance and power are the processor power states. These states, denoted
C0 through C𝑛, are used conserve power by halting the execution and shutting
down internal processor components. The C0 state is the only active state
(where execution occurs), and all other power states (from C1 up to Cn) are
sleep states. While in a sleep state, the processor is unable to execute instruc-
tions. Higher-numbered sleep states conserve more energy. Each sleep state
also has an associated entry and exit latency, which also typically increases
with energy savings. All processors that meet acpi standards must support
the C1 state (which corresponds to the hlt instruction), whereas all other
processor power states are optional. Further, the C1 (and C2 if supported)
state rely on the platform maintaining state and keeping caches coherent. In
addition, the latency of the C1 state must be kept sufficiently low that “ospm
does not consider the latency aspect of the state when deciding whether to use
it” [22]. Aside from putting the processor in a lower power state, entering C1
should have no software-visible consequences. The same goes for the optional
C2 state, differing only from the C1 state by its higher latency and greater
energy savings. The even deeper sleep states—C3 and beyond—rely on the
os to maintain cache coherency, for example by flushing and invalidating
the caches before entering the C3 state. When in the C0 power state, the use
of throttling and processor performance states enables fine grained control of
performance and energy expenditure.
The relationship between processor power- and performance states, as well
as clock throttling is illustrated in figure 4.4. While in the C0 state, cpu P-
states and clock throttling can be used to adjust the performance level. From
CO, C1–C𝑛 can be entered using entry methods defined in acpi objects. Note
that the C1 state is available through the hlt instruction. When in a sleep
state, C0 can again be entered if an interrupt occurs, or for higher-numbered
C-states, in the occurrence of a busmaster access.
Processor performance states are—as their name implies—used to adjust the
performance of a processing unit. As they all allow execution of instructions,
they reside within the C0 power state. The performance states are named P0
through P𝑛, where P0 is the state offering the highest level of performance.
As with C-states, a variable latency is associated with entering and exiting a
P-state. P-states offer ospm the opportunity to alter the operating frequency
for the processor. For instance, the Intel Xeon x5355 processors offers three
P-states (P0, P1, and P2), at 2.66, 2.33, and 1.99GHz respectively.























Figure 4.4: The figure illustrates how the the processor power- and performance
states, as well as throttling capabilities are organized. This figure is
loosely based on figure 8-39 of the acpi specification version 5.0.
To facilitate the above mentioned functionality, a myriad of different acpi
objects must be presented by the platform and used by the software imple-
menting ospm. Appendix A gives a brief introduction to some of the most
essential of these objects.
4.2 Intel Specific Advanced Power Management
acpi provides a general framework and specification for power management
and device configuration. Alas, this generality comes at the cost of reduced
flexibility and lack of the precision necessary to efficiently manage the avail-
able pm features of modern cpus. On Intel platforms, these shortcomings
are remedied by the introduction of vendor specific instructions. The use of
such instructions have enabled considerable power savings in mature oss.
One recent example of this is the shift away from the traditionally acpi-based
cpu governors, towards vendor-specific cpu drivers within the Linux kernel
[7]. Similarly, AMD-specific patches have been added to Linux cpu governors
to increase power-savings1. In the following sections, a brief introduction
will be given to the Intel-specific instructions used in the implementation of
rope.
1. http://comments.gmane.org/gmane.linux.kernel.cpufreq/9938
4.2 INTEL SPECIFIC ADVANCED POWER MANAGEMENT 33
4.2.1 The CPUID Instruction
The cpu Identification (cpuid) instruction returns processor identification
information. It is importantw.r.t. pm as it provides information aboutwhatpm
features are available for a given processor. Sometimes, information retrieved
via the cpuid instruction can be obtained through acpi tables as well, but
this is not always the case. Further, information obtained via cpuid and
acpi might be similarly named, but their values might differ2. Examples of
pm-relevant information available through cpuid are listed below:
• Processor-specific information regarding themonitor/mwait instruc-
tion pair, e.g. the number of available C-states and the size of monitor
lines.
• Information about architectural performance monitoring, for example
the number of available general purpose performance monitoring reg-
isters, and availability of different performance monitoring events3.
• Information regarding the platform topology, for example which logical
processors share the same cache.
Because it conveys a lot of information, the cpuid instruction is valuable
when attempting to save energy on platforms employing Intel cpus. In the
following section, themonitor/mwait instruction pair will be discussed in
more detail.
4.2.2 The MONITOR/MWAIT Instruction Pair
mwait relies on the use of hints to allow the processor to enter implementation-
specific optimized states. There are two usages for themwait instruction pair,
namely address-range monitoring and advanced power management. Either of
these rely on the use of the monitor instruction [25].
The monitor instruction allows the cpu to wait while monitoring a given
address-range. If the necessary extensions are provided by the platform, the
cpu can enter different C-states by passing hints tomwait. The availability of
mwait extensions for pm can be verified through cpuid [26]. In addition to
being used for pm, the monitor instruction can also be used for other purposes,
2. On our platform, the acpi tables list only the C1 and C3 power states, while cpuid
provides information about six states (C1 through C3, each with two sub-states).
3. The number of llcmisses used for regulating cpu P-states is one example of such events.
34 CHAPTER 4 IMPLEMENTATION
for instance spinlocks4. In fact, thread synchronization was mwait’s original
purpose [6].
According to the acpi specification, the C1 power state must always be avail-
able. This power state can be entered either by executing the hlt instruction
or, where available, by monitor/mwait with the correct hint. However,
using the hlt instruction in Vortex is problematic. As will be explained in
section 4.4.2, interrupts must be enabled in the idle function. Because of this,
an interrupt may alter the idleflag at any given time, for example, directly
following it being checked to verify whether the hlt instruction should be
issued. If this happens, the result will be that the core receives work (i.e. it is
interrupted), but instead of breaking out of the idle function, the cpu-core
is halted. This is disastrous as no new interrupt will arrive5. Rather, we use
monitor/mwait with hints to enter all C-states, including C1.
Code listing 4.1 shows the function implementing entry of C-states using the
monitor/mwait instruction pair. While looping on the boolean value of
the cpu-core’s idleflag, a monitor is armed on the address of this flag. Next,
mwait is performed on the address. The second hint argument determines
which C-state is entered.
The loop is necessary because of a phenomenon called false wakeups. moni-
tor/mwait works by monitoring a given memory location. However, it may
also trigger whenever a cache line containing this address is evicted, even
when the monitored address remains unchanged. To avoid such false wake-
ups, it is common to add padding around variables that are to be monitored.
Recall that the size of the monitor lines (and thus the size of the pads) can
be retrieved using the cpuid instruction.
4.3 The Share Algorithm
In this section we introduce the machine learning approach that we use to
select the best among a population of C-states. The algorithm belongs to the
multiplicative weight algorithmic family, which has proven itself to provide
good performance for several different on-line problems [35], [56], [57]. Al-
gorithms in this family are all based on the same basic premise; a master
4. https://blogs.oracle.com/dave/resource/mwait-blog-final.txt
5. It is possible to work around this. One solution is to inspect the previous instruction
pointer (on the stack) when entering the interrupt handler. If the interrupted instruction
lies within the code region performing the testing of the idleflag and execution of the
hlt, the interrupt handler could advance the instruction pointer such that the hlt
instruction is bypassed when returning from the interrupt handler.
4.3 THE SHARE ALGORITHM 35
1 inline void mwait_idle_with_hints(uint32 extensions, uint32 hints)
2 {
3 while (IDLEFLAG == TRUE ) {
4 if (IDLEFLAG == TRUE) {
5 asm_monitor((vaddr_t)&cpulocal[CPU_ID].cl_idleflag, NEXTENSION, NHINT);





Listing 4.1: Usage of MONITOR/MWAIT instruction pair.
algorithm relies on a set of subordinate algorithms, experts, for input. Every
trial, each of these experts make predictions, and the master’s goal is to com-
bine these predictions in the best way possible, minimizing some cost or error
over a sequence of observations. The master associates a weight with each of
the experts, and these weights represent the quality of each of the experts’
predictions. Using these weights, the master can select the best performing
expert, or produce a weighted average of the set of predictions.
After a trial has been completed, the weights of the experts are updated ac-
cording to the quality of their individual predictions. If the expert’s prediction
was incorrect, the weight of the expert is reduced. If the prediction was cor-
rect, the weight is not changed. Normally, this method causes the predictions
of the master algorithm to converge towards those of the best performing ex-
pert. However, this can be problematic. Across a series of sequences, the error
of an expert can be arbitrarily large, and because of this, its weight arbitrarily
small. This results in the master algorithm being unable to respond quickly
to changes.
The Share Algorithm, as defined by Herbster and Warmuth [42], derives its
name from a solution to the above mentioned problem. It allows small weights
to be recovered quickly, and does so through an additional update step called
the Share Update. In this step, each expert shares a fraction of its weight with
the other experts. This fraction can either be fixed, or variable. In the Variable
Share algorithm, the fraction of weight that is shared is proportional to the
error of the expert in the current trial. If an expert makes no error, it shares
no weight.
4.3.1 Definition
Let 𝑥1 to 𝑥𝑛 denote the predictions of 𝑛 experts. The weights of the experts,
denoted 𝜔1 to 𝜔𝑛, are initially set to 1/𝑛. For each trial, we use the notation
36 CHAPTER 4 IMPLEMENTATION
𝐿𝑜𝑠𝑠(𝑥𝑖) to refer to the loss, or error, of expert 𝑖. The share algorithm takes
two additional arguments: the learning rate, 𝜂 > 1, which controls the rate
at which the weights of poorly performing experts are reduced, and the share
parameter, 0 < 𝛼 < 1, which controls at which rate a poorly performing
expert is able to recover its weight when it starts performing well.
We use the Share Algorithm to select the best out of a set of experts, saving
energy by powering down cpu-cores. We borrow notation from [54], but our
definition contains a loss function tailored to our use-case. For each trial, the
algorithm completes the following steps:
1. Select the expert 𝑥𝑖 with the highest weight 𝜔𝑖.
2. Compute the loss of each expert 𝑥𝑖 at time 𝑡
𝐿𝑜𝑠𝑠(𝑥𝑖, 𝑡) = 𝑐wasting𝐸wasted + 𝑐latency𝑡unpredicted, (4.1)
where 𝐸wasted is the total amount of energy wasted by using only expert
𝑥𝑖 compared to the optimal combination of all available experts, and
𝑡unpredicted is the total unpredicted power-up time for the core under the
policy of expert 𝑥𝑖. The weighting parameters, 𝑐wasting and 𝑐latency, are
used to quantify the preference for avoiding latency and of being in
lower power states.
3. Reduce the weights of experts that are performing poorly
𝜔′𝑖 = 𝜔𝑖𝑒−𝜂𝐿𝑜𝑠𝑠(𝑥𝑖) (4.2)






1 − (1 − 𝛼)𝐿𝑜𝑠𝑠(𝑥𝑖)
)︁
, (4.3)














4.4 PER-CORE POWER STATE MANAGEMENT 37
The new weights, 𝜔final𝑖 , are used in the next trial. The above stated algorithm
runs with time complexity of 𝑂(𝑛), where 𝑛 is the number of experts. Also,
the experts could be executed in parallel [57]. Essentially, the bulk of the
cost of running the Share Algorithm can be attributed to the subordinate
expert algorithms; as long as these are efficient, the Share Algorithm will be
efficient.
4.4 Per-core Power State Management
This section covers the implementation of pm policies and Policy State Man-
agers (psms) governing the use of cpu power states (C-states). We describe
how these policies and their supporting structures are incorporated into the
Vortex kernel, and present implementation specific costs.
4.4.1 Aggressive Entry of C-states
Typically, whenever an os is out of work to perform, some kind of idle state
is entered. In Microsoft Windows, this functionality is implemented through
the System Idle Process [78, Ch. 5]. Linux offer similar functionality through
what is called idle, or swapper tasks. These are simply tasks with the lowest
possible priority, which means that they only will be scheduled if no other
tasks are runnable.
In Vortex, a cpu-core receives work through message queues. When no new
messages are present, the core is out of work and goes idle by invoking the
idle() function. As soon as a new message is presented to the core, it will
leave the idle function and start executing the received task.
The aggressive entry of a C-state is a very simple pm policy. In fact, which
C-state to enter is simply read from cpu-core local data structures, and this
variable forms the psm in its entirety. The only overhead from this policy
is that of reading the C-state, and transitioning in and out of it. The transi-
tion into the selected C-state is accomplished through the use of the moni-
tor/mwait instruction pair (see section 4.2.2). Listing 4.2 shows an excerpt
of the idle function modified to aggressively enter a chosen C-state.
4.4.2 Static Timeout Based C-state Entry
In previous work, increasing timeout values for dynamic timeout-based pm
has been used to reduce the latency experienced by the user [54], [39] It is
38 CHAPTER 4 IMPLEMENTATION






7 // Enable interrupts
8 _INTERRUPT_ENABLE_UNCONDITIONAL();
9
10 // Loop till something happens
11 while (IDLEFLAG == TRUE) {
12
13 // Ensure power management components activated
14 if (cpulocal[CPU_ID].cl_pm_active) {
15





21 // Disable interrupts
22 _INTERRUPT_DISABLE_UNCONDITIONAL();
23
24 // Dispatch CPUMUX thread
25 arch_thread_dispatch(&cpumux_tcbref[CPU_ID]);
26
27 while (1) ;
28 }
Listing 4.2: Idle function accommodating aggressive entry of C-states.
easy to create a scenario where this approach will be correct: whenever the
idle function is entered, we busy-wait for the duration of the timeout before
entering a deeper C-state. However, this wastes energy. According to the acpi
specification [22], the C1 power state should be cheap enough to enter and
leave that the os can ignore any costs associated with using it. Our platform
indicates that the cost is one microsecond.
By using the C1 power state, we can alleviate the energy issue of busy-waiting.
Instead of busy-waiting, we can start by entering C1 (which should incur little
cost), and then enter the deeper C-state when the timeout expires. This way,
we provide low latency close to that of busy-waiting while saving more energy.
We can implement this using timers. As soon as the idle function is entered,
a timer for the configured C-state is started and the cpu-core halted. If the
core is awoken due to the arrival of a message, the C-state timer is aborted.
If this do not happen and the C-state timer fires, the corresponding C-state
is entered.
Although conceptually simple, the problem is more complex when studied
closely. Within the idle function, interrupts must be turned on. If interrupts
were turned off, incoming work would be unable to trigger the reactivation of
the core. As mentioned above, as soon as the idle function is invoked, a timer
object will be created and activated. Following this, the cpu-core is halted
with the monitor/mwait instruction pair. At this time, there is only two
4.4 PER-CORE POWER STATE MANAGEMENT 39
possible ways for the core to leave the halted state.
1. Work arrives, and the core is awoken to process the task.
2. No work arrives before the timer fires, and the core enters a sleep state.
In the first case, the timer must be canceled before the core can go on to
performing calculations, otherwise it will trigger an interrupt when it fires.
To do this, interrupts must be disabled. An interesting situation occurs if the
timer fires while the core is in the process of canceling it (interrupts still
being disabled). In this case, the timer object is correctly removed, but since
the x86_64 platform lacks support for canceling requested interrupts, the
interrupt from the timer is queued. When interrupts are finally turned on,
this timer interrupt must be handled. In Vortex, such spurious interrupts are
simply ignored as no pending timeout (timer object) can be associated with
it. As long as these spurious interrupts does not happen very often, the cost in
terms of lost cpu-cycles can be ignored. However, when the timeout values
become very short, this situation will occur increasingly often. This is some-
what similar, although not as dramatic, as the live lock situation described in

















42500 43000 43500 44000 44500 45000 45500
Number of cycles
Cost of handling a spurious timer interrupt (99th percentiles)
Figure 4.5: The distribution of cycles necessary to handle a spurious timer interrupt.
Note that only values within the 99th percentile are included. This is
done for the sake of visualization.
The actual cost of handling a spurious interrupt in Vortex is shown in figure 4.5.
Note that to avoid over plotting, onlymeasurements within the 99th percentile
are included. Further, these figures are based on 10 000 measurements.
In addition to decreasing the opportunity of conserving power, the overhead
40 CHAPTER 4 IMPLEMENTATION
of the pm algorithm can affect the user-perceived performance of the system.
From the perspective of the user, additional overhead from entering the idle
state might go unnoticed, since the system is not actively used. However, any
overhead when leaving the idle loop is critical, as this will add to the user
perceived delay. As can be seen in the plots of figure 4.6, the added latency
of our solution is low, and. On average, it only adds only 103 to 104 cycles, or
about 0.5𝜇𝑠–5𝜇𝑠 if the cpu is run at 2ghz.
Code listing 4.3 shows an excerpt from the idle function modified to support
the static timeout based policy. Error handling and initialization code have
been left out for the sake of brevity.
4.4.3 Select Best Performing C-state
We use the Share Algorithm, as explained in section 4.3, to select the best
performing C-state at runtime. Although the algorithm is computationally effi-
cient and simple to implement, running it in a os-kernel complicates matters.
For example, floating point arithmetic can only be performed in the context
of a thread6. Further, since the code is run frequently, and in a time sensi-
tive environment, we have applied optimizations to reduce both the memory
footprint and cpu-consumption. The implementation of the pm policy and
psm for the selection of the best performing C-state is illustrated in figure
4.7. A thread updates the experts’ weights and selects the one performing
the best for use. This information is then stored in a per-core data structure,
and read from the idle function whenever an idle period occurs. Within the
idle function, C-states are entered and a residency-log is maintained. This
log is communicated to the thread which selects the best expert periodically.
Together, these components form a continuous feedback loop.
Measuring CPU-idleness
As mentioned in section 4.4.1, whenever a cpu-core is out of runnable tasks,
it enters the idle function. As a consequence, this function is ideal for in-
strumenting and logging the activity of cpu-cores. By logging timestamps
indicating when cpu power states have been entered or left, a high-fidelity
trace of the cpu activity can be generated. This approach prevents one critical
issue that can occur if for instance statistical sampling is used, and in addition,
avoids the extra overhead of having to run a thread or process performing
this sampling.
6. Why this is, is covered later in this section.
4.4 PER-CORE POWER STATE MANAGEMENT 41
1 void idle (void)
2 {




7 // Enable interrupts
8 _INTERRUPT_ENABLE_UNCONDITIONAL();
9
10 // Loop till something happens
11 while (IDLEFLAG == TRUE) {
12
13 // Ensure power management components activated
14 if (cpulocal[CPU_ID].cl_pm_active) {
15
16 //Post a C3 ready timer with a static timeout
17 vxerr = timer_post(&cpulocal[CPU_ID].cl_C3_timerref, cpulocal[CPU_ID].cl_C3_timeout, set_C3_ready_flag, "",
NULL);
18
19 // Now enter C1, awaiting work arriving or timer to fire




24 // When returning from the halt, this can either be because of...
25 // 1) The C3 timer having fired −> enter C3.
26 // 2) Because some other thread is ready for work −> abort the timer
27 if (cpulocal[CPU_ID].cl_C3_ready && (IDLEFLAG == TRUE) ) {
28 cpulocal[CPU_ID].cl_C3_ready = (bool_t)FALSE;
29 c3_entered = arch_timestamp_usec();
30
31 mwait_idle_with_hints(NEXTENSIONS, C3_HINT);
32 } else {
33
34 // Abort the C3 timer
35 // If the timer has fired in between our check and now, set the C3 ready flag to false
36 if (!VxO_REFLOCK(&timer_lock, &cpulocal[CPU_ID].cl_C3_timerref))
37 cpulocal[CPU_ID].cl_C3_ready = (bool_t)FALSE;
38 else {











50 syslog(LOG_DEBUG, "Idle on core %d resuming cpumux", CPU_ID);
51
52 // Dispatch CPUMUX thread
53 arch_thread_dispatch(&cpumux_tcbref[CPU_ID]);
54
55 while (1) ;
56 }
Listing 4.3: Idle function with code for entering C-states after static timeout.















0 5000 10000 15000 20000 25000
Number of cycles












250 300 350 400 450 500
Number of cycles













0 5000 10000 15000 20000 25000 30000 35000
Number of cycles












0 5000 10000 15000 20000 25000
Number of cycles














0 2000 4000 6000 8000 10000 12000 14000 16000
Number of cycles
Cost of canceling timer (99th percentiles)
Figure 4.6: The above plots displays added overheads within the idle function origi-
nating from the static timeout based policies. Note that for the sake of
visualization, only values within the 99th percentile are included. Note
that all the graphs are based on 100 000 measurements.
4.4 PER-CORE POWER STATE MANAGEMENT 43




 Perform Power State sampling on per 
core basis.
 Perfom Power State transitions 
according to timeouts and Power 
State advices.
 Periodically post runlength encodings 
to update thread.
Function:Idle()Per cyle Power 
State monitoring
 Core local information 
Request queue  Post runlength ecoded  trace of Power States
  Update Share Algorithm expert 
weights (floating point arithmetic).
  Select best expert.
Thread:update_expert()
 Dequeue runlength ecoded 
trace of Power States
Core local information
Choosen Expert
Perform Power State 
transitions
Select Best Expert
Figure 4.7: The implementation of the policy selecting the best performing C-state.
One of the problems with sampling, is being sure that samples are collected
frequently enough. If a work item issued to a processor is sufficiently small, the
core can finish executing the task within the time-window of a single sample.
This situation is illustrated in figure 4.8. In essence, this means that the trace
generated from the sampling process will be incomplete, and no record of
the execution of the task exists. This, in turn, might lead to significant errors
when evaluating the efficiency of different power saving algorithms working
on such traces—in effect rendering the experts weights meaningless.
Instead, we log timestamps of entry and exit of C-states in the idle function
itself. This way, we are able to maintain a high fidelity log of time spent by
the core in all C-states.




Figure 4.8: The figure illustrates the problems that can occur if sampling is used to
measure cpu activity.
Run Length Encoding
To reduce the memory overhead of the above mentioned log keeping, we
employ a technique called run-length encoding (rle). This is a lossless com-
pression algorithm that works well for streams of data where sequences
of the same value (so called runs) occur. A simple example is the string
"1111111222222333333". Usingrle, this string would be represented as "716263",
indicating that the stream consists of seven ones followed by six two’s and
then six three’s. When we generate our traces, the occupied C-state is used
as the value, and the time spent in the power state is the sequence length.
Updating the C-state trace of a cpu-core thus amounts to the constant time
operation of appending these two numbers to a buffer. Whenever the buffer is
full, it is copied and sent to the C-state selection/update thread using message
passing. We employ cpu-state traces with microsecond granularity. When
compared to storing a non-compressed stream of integers at microsecond
fidelity, the use of rle results in a a compression rate of over 99% for real
traces corresponding to low, normal, and heavy load respectively. The cost of
sending rle C-state traces is shown in figure 4.9, and increases linearly with
trace length.
Selecting the Best C-state
The C-state expert update thread is the component that calculates which C-
state to use at a cpu-core at any given time. There is only one such thread,
and it manages the weights of all experts for all available cpu-cores in the
system. It receives the run-length encoded power state traces for the different
cores asynchronously via message passing. When this happens, it will trigger
the recalculation of the weights and best performing C-state for the core that
4.4 PER-CORE POWER STATE MANAGEMENT 45
10 20 30 40 50 60 70 80 90 100
















Cost of Sending RLE C-state Trace
Figure 4.9: Average cost of sending a rle C-state traces of varying lengths. The
error bars show the standard deviation. All numbers are calculated from
10 000 measurements.
sent the trace.
The reason for handling weight updates in a separate thread is twofold. First,
the complexity of the idle function is kept low, and it is only loosely coupled to
the expert selection algorithm. Second, and more important: floating-point
arithmetic can only be performed in the context of a thread. Vortex uses
lazy saving of floating point registers, instead of saving them on each context
switch7. This means that the sse unit is set up to throw an exception on the
first sse instruction that is executed after each context switch. The exception
can only be handled if a thread context is available, because the exception
handler will save and restore floating point registers using this context.
The calculation of the best performing C-state and the updating of experts
weights is performed according to the formal description of the Share Algo-
rithm given in section 4.3. In our implementation, each of the experts cor-
respond to using a single C-state only. Selecting the best performing expert
thus implies selecting the best performing C-state. As described in section
4.3, the behavior of the Share Algorithm is tunable. Four different parame-
ters provide flexibility with regards to how aggressively power is saved. For
instance, the learning rate and share parameter makes it possible to adjust
7. Lazy saving of registers is an optimization that makes the common case, i.e. that floating
point context is unnecessary, fast.
46 CHAPTER 4 IMPLEMENTATION
how sensitive the algorithm should be to changes in the workload, while the
weighting parameters for wasted energy and added latency allows for adjust-
ing which property is more important. This flexibility is desirable in several
situations. For instance, in the face of a power failure, it could be possible to
forsake latency service level objectives (slos) while maintaining availability
by tuning the weights from a low latency setting and over to an aggressive
power saving setting.
To facilitate the calculation of weights, support for mathematical functions
such as pow() and exp() had to be added to the kernel. Typically, the exponen-
tial functions provided by standard math libraries are highly accurate. This
accuracy, however, often comes at the cost of performance. In machine learn-
ing tasks, such as weight calculation in the Share Algorithm, approximations
to the exponential function are often sufficient [80]. This allows much time
to be saved. Because of this, we have implemented the approximation created
by Schraudolph [80], and later modified by Cawley [17], and use it for weight
calculation in our implementation of the Share Algorithm. The approximation
works by manipulating the components within the ieee-754 floating-point
representation. This approach has the advantage of avoiding the need for
a memory-consuming lookup table, as is commonly used in approximations
to the exponential function. It is also significantly faster than lookup based
implementations [80]. Our tests indicate a 12%–16% performance increase
w.r.t. execution speed, when compared to the C standard math library.
The overhead of any pm algorithm is critical, as too expensive algorithms
might result in reduced energy savings. We have measured the cost of ex-
ecuting a weight update with our implementation of the Share Algorithm,
and concluded that the computational overhead (in the order of 5–20 mi-
croseconds) is negligible. The length of the rle trace, as well as the number
of experts are the key factors determining this cost. Figure 4.10 shows the
measured overhead for varying rle trace lengths and different numbers of
experts. The figure shows that the necessary calculation time increases lin-
early with trace length. The cost of additional experts is moderate for small
numbers. According to [39], the Share Algorithm performs nearly as good
with 10 experts as with 100, so measurements with more than 10 experts
are deferred. Further, no more than a single expert is necessary per C-state
presented by the platform, so in practice the number of experts is low.
When leaving the idle function, a critical issue w.r.t. latency is that of sending
rle C-state traces to the update thread. How often this happens is a trade
off between responsiveness, and the cost of sending the trace. If the traces
are shorter, a new best C-state can be computed more often. The penalty, is
that the frequency of relatively expensive send operations increases as well.
The common case of not sending a trace is inexpensive: a constant time
4.4 PER-CORE POWER STATE MANAGEMENT 47
10 20 30 40 50 60 70 80 90 100





























Figure 4.10: The computational overhead of recalculating the Share Algorithm ex-
pert weights. Notice that these numbers include the receiving of the
rle trace structures, as well as the cost of freeing all associated buffers.
The graphs shows means and standard deviations of 100 000 measure-
ments.
operation updating the rle C-state trace is all that is necessary. Figure 4.11
shows the dynamic selection of best performing C-state for different cores in
the system. Because the different cores execute different loads, they select
their appropriate (best performing) C-state independently. Code listing 4.4
contains an excerpt of the idle function modified to support the entering of
the best performing C-state, as well as the generation and sending of rle
C-state traces.
48 CHAPTER 4 IMPLEMENTATION
1 void idle (void)
2 {




7 // Enable interrupts
8 _INTERRUPT_ENABLE_UNCONDITIONAL();
9
10 // Loop till something happens
11 while (IDLEFLAG == TRUE) {
12
13 // Ensure power management components activated
14 if (cpulocal[CPU_ID].cl_pm_active) {
15
16 // Update RLE C−state trace
17 rle_add_sequence(cpulocal[CPU_ID].cl_core_state_rle, (int64)CSTATE_TYPE_C0, (int64)(arch_timestamp_usec()
− cpulocal[CPU_ID].cl_previous_state_change));
18 if ( rle_is_full (cpulocal[CPU_ID].cl_core_state_rle)) {
19
20 rle_copy = copy_rle(cpulocal[CPU_ID].cl_core_state_rle);
21 enqueue_rle(CPU_ID, rle_copy);
22




27 // Sample timestamp so we can log how long the core was halted
28 cpulocal[CPU_ID].cl_previous_state_change = arch_timestamp_usec();
29
30 // Enter C−state according to what performs best




35 // Update RLE C−state trace
36 rle_add_sequence(cpulocal[CPU_ID].cl_core_state_rle, (int64)cstate, (int64)(arch_timestamp_usec() − cpulocal[
CPU_ID].cl_previous_state_change));
37 cpulocal[CPU_ID].cl_previous_state_change = arch_timestamp_usec();
38 if ( rle_is_full (cpulocal[CPU_ID].cl_core_state_rle)) {
39
40 rle_copy = copy_rle(cpulocal[CPU_ID].cl_core_state_rle);
41 enqueue_rle(CPU_ID, rle_copy);
42










53 syslog(LOG_DEBUG, "Idle on core %d resuming cpumux", CPU_ID);
54
55 // Dispatch CPUMUX thread
56 arch_thread_dispatch(&cpumux_tcbref[CPU_ID]);
57
58 while (1) ;
59 }
Listing 4.4: Idle loop modified to support entry of best performing C-state.













Figure 4.11: The figure shows how different C-states are found to be performing the
best at runtime. Notice that during the generation of this trace, only C1
and C3 was used.
4.5 Per-core Performance State Management
Performance management of cpus concerns the problem of selecting the
correct cpu frequency for the current workload. In essence, this means re-
ducing the core operating frequency when the load is low, and increasing it
under periods of higher load. Because of the quadratic relationship between
power consumption, voltage, and frequency described in section 2.4.3, pre-
viously proposed solutions for reducing power consumption have involved
running jobs as slowly as possible while still respecting quality of service
(qos) [82], [72]. While this kind of approach has proven itself useful—and
certainly still can be—modern cpus differ significantly in architecture and
design from those only a few years old. On more modern cpus, the power
consumed when in the idle state is significantly lower than on older models.
If the power consumption in the idle state is low enough, net power savings
can be achieved by running the cpu at full speed when there is work to be
done. By finishing the work quickly, it is possible to maximize the time idle.
This phenomenon is called “race to halt”, or “race to idle”. In the Linux 3.9
kernel, Intel cpu P-state drivers were introduced to take advantage of exactly
this [7]. Throughout the rest of this section we describe the implementation
of our per-core performance state management algorithms.
50 CHAPTER 4 IMPLEMENTATION
4.5.1 Quantifying CPU Utilization
Quantifying the utilization of a cpu-core can be difficult. As pointed out
in previous work, task execution speed is often to some extent limited by
i/o accesses or the speed of the memory bus [51], [50]. While keeping the
cpu frequency high (“racing to idle”) is efficient when the workload is cpu-
intensive, this is not true if the workload is, for instance, bound by the speed
of the memory bus. This situation is sometimes referred to as performance
saturation, which simply means that the system is unable to make use of all
the available clock cycles [51].
In our implementations, we follow the approach taken by Isci et al. [45] to
measure the “memory boundedness” of the currently executing workload.
More specifically we use pmcs to obtain the number of last-level cache (llc)-
misses and the number of 𝜇-ops8 retired per cpu-core. The ratio between
these, hereafter referred to as “mem/𝜇-ops” is used to classify the workload
into different phases. The choice of this metric is based on it being invariant
to the frequency at which the core is running [45]. This property is strictly
necessary, as our performance adjustment algorithm will employ dynamic
voltage scaling (dvs) to increase and decrease the performance level of the
cores, thus altering their frequency in the process.
We use “mem/𝜇-ops” to classify the workload into different phases similarly
to [92] and [45]. While the first of these works calculates dvs settings for
different application regions based on a formulation of performance loss, the
latter translates these measures to “mem/𝜇-ops” rates. We borrow from [45]
for our phase classifications, but make necessary changes to accommodate our
hardware. Specifically, we determine the thresholds for phase classifications
experimentally by running various workloads, both memory and cpu-bound.
The “mem/𝜇-ops” to phase mappings are listed in table 4.1. Phase 1 corre-
sponds to a highly cpu-bound phase of execution where it is desirable to “race
to idle”. Phase 2 matches mixed loads, while phase 3 corresponds to a highly
memory bound phase where performance saturation is occurring, and the
cpu frequency can be reduced without significant loss in performance.
4.5.2 Global Phase History Table Predictor
We implement our first per-core performancemanagement functionality based
on the use of a Global Phase History Table (gpht) predictor. This approach
is similar to that of [45], and based on a branch prediction technique which
8. In computer cpus, micro-operations (𝜇𝑜𝑝𝑠) are detailed low-level instructions used to
implement complex machine instructions [33].
4.5 PER-CORE PERFORMANCE STATE MANAGEMENT 51
“mem/𝜇-ops” Phase Number P-state
< 1.25 1 (race to idle) 0
[1.25, 1.75] 2 1
> 1.75 3 (highly memory-bound) 2
Table 4.1: Mappings from “mem/𝜇-ops” ratios to execution phases.
has proven to perform very well [94]. The technique, called Two-level Adap-
tive Branch Prediction uses two levels of history to make decisions, the 𝑛 last
branches that have been encountered, and the recent behavior for the specific
patterns of these 𝑛 branches. This is a online statistical machine learning
technique, and it relies on the existence of recurrent patterns in execution.
Although designed for branch prediction in cpus, the concepts used can be
directly translated to the power management domain; instead of capturing
history regarding the recent branches and which of these to take, informa-
tion about recent performance states and what frequency to use next can be
stored. One of the key properties of the gpht predictor is that it is completely
agnostic to the workloads being executed. No analysis of the workloads or
training of the algorithm has to be performed. It thus lends itself naturally to
environments where workloads can exhibit high variability and change over
time.
The Global Phase History Table is illustrated in figure 4.12, and consists of the
following components:
• Global Phase History Register (gphr): A global shift register that
tracks the last 𝑛 observed phases (as determined by their “mem/𝜇-ops”
ratios). The length of the history is given by the gphr depth. At each
sampling point, the gphr is updated, and it functions as a sliding win-
dow containing the 𝑛 last samples. At each sample point, the contents
of the gphr is extracted and used as a key to index a Pattern History
Table.
• Pattern History Table (pht): The pht contains a number of previ-
ously observed patterns (keys from the gphr) and their predicted next
phase. These patterns and predictions are updated using a least recently
used (lru) scheme to avoid having the table grow indefinitely. If a key
from the gphr is missing (it has not been encountered before or has
been evicted from the table), the last observed phase is stored as the
prediction. At the end of each sampling period, the actual phase that
has been observed is inserted in the pht. In this manner, the Global
Phase History Table predictor will do a lookup on the recent history, and
52 CHAPTER 4 IMPLEMENTATION
make a prediction corresponding to the phase that was last observed
















11231...32                  1
12333...11                  1
13322...22                  2
         .                           .
         .                           .
         .                           .                         
33321...11                  3







Figure 4.12: The implementation of the Core Performance State Manger.
All components of the gpht predictor are implemented within the Vortex
kernel. The gphr is implemented using a circular buffer, while the implemen-
tation of the pht is more involved. To achieve 𝑂(1) complexity for retrieving
and updating predictions, a dictionary is used to allow efficient mappings
from gphr obtained keys to predictions. To allow for efficient lru eviction
from the pht, the actual predictions are stored in a lru-queue implemented
using linked data structures. This is illustrated in figure 4.14. As the retrieval
and updating of predictions are operations that are performed frequently, it is
critical that they are inexpensive. Figure 4.13 displays the cost of performing
these actions. The plots are created from 10 000 data samples. However, note
that for the sake of visualization, only values within the 99th percentile are
included. As can be seen from these plots, the costs of the operations are low;
a lookup and update operation costing on average only ~2600 and ~4000
cycles respectively.
The sampling, phase prediction retrieval, updating of predictions, and perfor-
mance state transitions are all controlled from a thread (see code listing 4.6).
There are several reasons for this choice:
• Since performance management is performed on a per cpu-core basis,












0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Number of Cycles
















0 2000 4000 6000 8000 10000 12000 14000
Number of Cycles
Cost of Updating a Phase prediction in PHT
Figure 4.13: The above plots show the cost distribution for the retrieval and updating
of predictions from the pht. 10 000 measurements were used as basis
for each of the plots, but for the sake of visualization only values within
the 99th percentile are included.
using threads is a simple way to avoid having to protect the gpht data
structures with locks. Because each core runs a separate thread, there
can exist no race conditions.
• The classification of phases relies on high precision floating point oper-
ations. Because of the need for accuracy, fixed point approaches cannot
be used9. As explained in section 4.4.3, only threads have the necessary
context to perform floating point arithmetic.
• Sampling of the pmcs that define different execution phases must be
done periodically. If, for instance, the pmcs were sampledwhen entering
or exiting the idle state (a scheme similar to that used for per-core
power management), it would become impossible to increase the cpu
frequency when utilization is growing. A steady increase in the amount
of work would result in the core always being busy, never entering the
idle loop. This, in turn, would keep the core operating at the frequency
it was running at before the influx of work occurred. Threads lend
themselves naturally to running periodic tasks, as they can simply be
suspended in between sampling points.
However, the choice of using threads is not without problems. The fact that
the thread will perform work periodically can result in spoiled opportunities
for energy savings if not handled properly; there are situations where the
performance management threads might wake up halted or sleeping cpu
cores.
9. Fixed point arithmetic libraries can be used to emulate floating point arithmetic. Such
libraries are often used either because the processor lacks support for floating point
operations, or to increase performance [89]. However, this performance comes at the
cost of accuracy.
54 CHAPTER 4 IMPLEMENTATION
1
2 void idle (void)
3 {





9 // Ensure power management components activated
10 // If not, busy wait
11 if (pm_is_active) {
12
13 //Start by suspending execution of performance state selection thread
14 if (thread_active) {
15
16 if (VxO_REFLOCK(&pstate_timer_lock, &cpulocal[CPU_ID].cl_pstate_selector_thread_ref)) {
17
18 tcb = (tcb_t*) VxO_REFGETOBJ_NOCHECK(&cpulocal[CPU_ID].cl_pstate_selector_thread_ref);
19
20 timer = (timer_t*) VxO_REFGETOBJ_NOCHECK(&tcb−>tcb_resumetimer);
21 scheduled_timeout = timer−>ti_timeout;
22 vxerr = thread_timer_clear(&cpulocal[CPU_ID].cl_pstate_selector_thread_ref);
23













37 } else {
38
39 //Busy wait with interrupts enabled
40 _INTERRUPT_ENABLE_UNCONDITIONAL();






47 //Resume execution of performance state selection thread
48 if (pm_is_active && thread_active) {
49 if (VxO_REFLOCK(&pstate_timer_lock, &cpulocal[CPU_ID].cl_pstate_selector_thread_ref)) {
50
51 tcb = (tcb_t*) VxO_REFGETOBJ_NOCHECK(&cpulocal[CPU_ID].cl_pstate_selector_thread_ref);
52
53 timeout_value = MAX(scheduled_timeout − arch_timestamp_usec(), 1);
54 vxerr = timer_post_fmt(&tcb−>tcb_resumetimer, timeout_value, thread_resume, "rI", (objref_t *)&cpulocal[
CPU_ID].cl_pstate_selector_thread_ref, THREAD_FLAG_TIMEOUT | THREAD_FLAG_SUSPENDBALANCE);
55








64 // Dispatch CPUMUX thread
65 arch_thread_dispatch(&cpumux_tcbref[CPU_ID]);
66
67 while (1) ;
68 }
Listing 4.5: Idle function modified to support P-state selection.
4.5 PER-CORE PERFORMANCE STATE MANAGEMENT 55
11231...32                  
12333...11                  
13322...22                  
         .                           
         .                           
         .                           
         .                           
33321...11                  








. . . . .
LRU Queue
Dictionary
Figure 4.14: The figure illustrates the implementation of the gpht predictor. Due to
the use of both a dictionary and lru-queue,𝑂(1) complexity is achieved
for both prediction lookups and updates.
To avoid such unnecessary wakeups, we add code on entry and exit from
the idle function that simply suspends and reschedules the thread (see code
listing 4.5). This of course adds latency, but our tests show the overhead to
be negligible (see figure 4.15). In addition, there is a cost associated with
running the thread periodically. Again, this cost is in the order of ~5000 to
~30 000 cycles, with an average cost of ~15 000. Out of these, the cost of
updating and querying the gpht predictions amount to 26% and 17% of the
cycles on average. The remaining 57%, or approximately 9000 cycles, include
the reading of pmcs, phase classification, and loop overhead.
The enforcement module is the final component of the per-core performance
management scheme. The transitions between different cpu P-states is per-
formed using acpi-objects presented by the platform. A detailed description
of this process is given in appendix A.
Figure 4.16 illustrates how different P-states are chosen by cores according
to their experienced load. For a closer description of this workload, see sec-
tion 5.1.4. Under this workload ApacheBench (ab) is run with a workload
trace corresponding to high load (see section 5.1.3 for details). This results
in Apache worker threads being dispatched to all cores. This is a memory
bound task, as indicated by the cores seldom entering P0. Core 5, in addi-
tion to serving Apache worker threads, runs a compute intensive task that
calculates prime numbers and sleep repeatedly. This is clearly visible by the
core entering P0 whenever the compute intensive task is scheduled. Core 3
runs the same cpu intensive task as core 5, but for shorter durations. This
56 CHAPTER 4 IMPLEMENTATION
1 vxerr_t pstate_selector_thread(void)
2 {







10 //Make prediction and clock CPU core according to this
11 predicted_phase = gpht_predictor_get_prediction(cpulocal[CPU_ID].cl_gpht_predictor);
12 if (predicted_phase != gpht_predictor_get_current_phase())
13 {
14 vxerr = gpht_predictor_change_performance_state(predicted_phase);
15





21 //Put the thread to sleep for desired interval
22 thread_suspend_timeout(&cpulocal[CPU_ID].cl_pstate_selector_thread_ref, PSTATE_SELECTOR_SLEEPTIME);
23
24 //Read out performance counter registers and update predictor accordingly
25 mem_to_uops_ratio = llc_miss_to_uops_ratio();
26
27 observed_phase = predictor_classify_ratio(mem_to_uops_ratio);
28
29 vxerr = gpht_predictor_update_predictor(cpulocal[CPU_ID].cl_gpht_predictor, observed_phase);
30

















2000 4000 6000 8000 10000 12000
Number of Cycles











0 5000 10000 15000 20000 25000 30000
Number of Cycles
Overhead when leaving Idle Function attributable to P-state Predictor
Figure 4.15: The above plots show the overhead added to the entry and exit of the
idle function following the introduction of per-core performance state
management. The plots show averages over all cores, where 1000 mea-
surements have been obtained for each. For the sake of visualization
only values within the 99th percentile are included.
4.5 PER-CORE PERFORMANCE STATE MANAGEMENT 57
Time (𝜇𝑠)
Description Avg. # Cycles 2.0ghz 2.33ghz 2.66ghz
Get prediction from pht 2628 1.31 1.13 0.99
Update prediction in pht 4000 2.00 1.71 1.50
One iteration in gpht 15456 7.73 6.63 5.82
Idle function entry overhead 4214 2.11 1.81 1.58
Idle function exit overhead 12434 6.22 5.34 4.67
Table 4.2: Summary of Core Performance Management costs and overheads. The
table shows averages over all cores, where 10000 measurements have
been obtained for each. For the sake of visualization only values within
the 99th percentile are included.
results in the core only seldom reaching a combined workload that is cpu













Figure 4.16: The figure illustrates how cores experiencing different loads select their
P-states at runtime.
4.5.3 Naive Forecaster
In addition to the gpht predictor, we also implement what is known as a
naive forecaster. A naive forecaster simply predicts that the next phase will be
58 CHAPTER 4 IMPLEMENTATION
identical to the previous one.
We implement our naive forecaster with a single thread per core. This thread
stores the observed phase, and uses it as the prediction for the next interval.
When the interval is over, it observes the actual mem/𝜇-ops ratio from the
interval, classifies it as a phase, and stores it. A code excerpt showing this
thread is given in listing 4.7. Similarly to the gpht predictor, the idle loop
is modified to stop the thread whenever a C-state is entered, and resume it
when execution continues.
Figure 4.17 shows a plot of the cost of running the naive forecaster loop one
iteration. The cost is substantially lower than for the gpht predictor.






7 while (1) {
8
9 predicted_phase = observed_phase;
10 if (predicted_phase != gpht_predictor_get_current_phase()) {
11 vxerr = gpht_predictor_change_performance_state(predicted_phase);
12 }
13
14 //Put the thread to sleep for desired interval
15 thread_suspend_timeout(&cpulocal[CPU_ID].cl_pstate_selector_thread_ref, PSTATE_SELECTOR_SLEEPTIME);
16
17 //Read out performance counter registers and update predictor accordingly
18 mem_to_uops_ratio = llc_miss_to_uops_ratio();
19 observed_phase = predictor_classify_ratio(mem_to_uops_ratio);
20 }
21 }
Listing 4.7: Naive Forecaster.
4.6 Energy Efficient Scheduling
Energy efficient scheduling is the last of the three axes we defined for cpu
pm. Scheduling algorithms have been used to reduce power consumption in
various domains. For instance, to schedule vms at different physical hosts
in clouds [55]. Much work has also gone into energy efficient scheduling in
real time systems [38], [72], [95], where dvs is employed while still meeting
deadlines. Other approaches exploit knowledge about the platform topology,
by attempting to keep as many components as possible powered down at
all times [83]. In this section, we describe the implementation of the energy
efficient scheduling algorithm employed in rope.













0 5000 10000 15000 20000 25000 30000 35000
Number of Cycles
Cost of one iteration in Naive Forecaster loop
Figure 4.17: Overhead of naive forecaster.
4.6.1 Topology Agnostic Dynamic Round-Robin
This section describes the implementation of a topology agnostic algorithm
for placing work on cpu-cores. The intuition behind the algorithm is that if
as few cores as possible are active at any given time, power can be saved. The
algorithm is naive in that it does not consider the topology of the platform.
This means that no effort is made to keep caches warm, or to limit the number
of packages with active cores.
The algorithm is hooked directly into the existing scheduling framework in
Vortex. More specifically, it is executed as a load balancing action, which is
invoked periodically. The algorithm attempts to limit the number of cores
in use by filling one and one core, only adding new cores to the current
working set whenever the average load reaches a given threshold. In dynamic
round-robin (drr), cpu-cores can be in one of three states:
• Active: Tasks are scheduled on active cores in a round-robin fashion.
• Retiring: Cores that are retiring can not receive new tasks, but will
continue running tasks that have been assigned to the core already. Over
time, tasks will be migrated away from cores that are in the retiring
state. After being in the retiring state some predefined amount of time,
a core will enter the inactive state.
• Inactive: Tasks are not scheduled on inactive cores.
In [55], the authors use a similar classification of vms to perform scheduling
60 CHAPTER 4 IMPLEMENTATION
of vm instances. For convenience, we chose to use the same terminology,
although our solution differ from theirs in how we enforce and chose our
thresholds.
Figure 4.18 shows an illustration of the structure of the dynamic round-robin
(drr) scheduler. As mentioned, tasks are scheduled on the active cores round-
robin. If no cores are active, or the average utilization of all active cores is
greater than some threshold Θrecruit, additional cores are added to the set of
active cores. When additional cores must be recruited, cores in the retiring
state are targeted first. This is because they are less likely to have entered a
sleep state, and waking a sleeping core has a greater associated latency than
using one that is already active. Only in the case that no cores are retiring,
will inactive cores be awoken and added to the set of active cores.
Cores enter the retiring state whenever their utilization drops below some
threshold, Θretiring. Cores remain in this state a predefined amount of time,
𝑇retiring, before entering the inactive state.
Using the three parameters Θrecruit, Θretiring, and 𝑇retiring, the behavior of the
scheduler can be tuned. The value of Θrecruit controls the number of cores






𝐶𝑖 , 𝐶 ∈ Active Queue (4.7)
For instance, a value ofΘrecruit = 90 means that the average load over all active
cores must reach 90% before another core is moved into the active set. The
Θretiring parameter determines how sensitive the active set is to fluctuations
in the workload. A high value will result in cores moving between the active
and retiring state frequently, while a low value will keep the set of active
cores more stable. Finally, 𝑇retiring will determine how long cores remain in
the retiring state. If the value is too large, cores are less likely to enter the
inactive state and power may be wasted. However, too small values might
also be wasteful as the result will be frequent state transitions.
Figure 4.19 shows the utilization of different cores over time as the total
system-wide load increases. The figure shows how cores are recruited grad-
ually, and only when cores reach a level of utilization that warrants this
(Θrecruit). The very low levels that are observable for some of the cores origi-
nate from interrupt- and message processing10. Our scheduler only consider
threads, and does not control these activities.
10. Recall that all communication in Vortex is via asynchronous message passing.
4.6 ENERGY EFFICIENT SCHEDULING 61
CPU ProxyCPU ProxyCPU Proxy










Figure 4.18: Implementation of energy efficient scheduler.



















Figure 4.19: cpu utilization over time under dynamic round-robin scheduling.














200 250 300 350 400
Number of Cycles











60 80 100 120 140 160 180
Number of Cycles
















200 300 400 500 600 700 800 900
Number of Cycles













200 400 600 800 1000 1200 1400 1600 1800
Number of Cycles
Cost of unretiring cores
Figure 4.20: The above plots displays overheads for different operation within the
energy efficient scheduler implementation. Note that for the sake of
visualization, only values within the 99th percentile are included. Note
that all the graphs are based on 10 000 measurements.
Figure 4.20 contains plots detailing the cost of various operations in the topol-
ogy agnostic round-robin scheduler. All operations are cheap, and the most
expensive, unretiring cores, costs less than 1200 cycles on average.
5
Evaluation
In this chapter we evaluate rope and its pm policies. We start by describ-
ing our experimental methodology, platform, and workloads. Following this,
we will evaluate C-, and P-state management policies, before turning our
attention to energy efficient scheduling.
5.1 Methodology
Throughout the evaluation of our pm-policies, we will use both internal and
external measures. While internal measures, such as predictor accuracy, are
certainly interesting, external measures such as power consumption and user-
perceived performance are the ones that really matter. Based on this, we
evaluate our policies and algorithms using workloads designed to mimic real-
life usage, but also run experiments that highlight the behavior of our policies
through internal measures.
When evaluating the efficiency of pm-policies in the setting of cloud comput-
ing, the amount of power saved is not the only important factor: different
slos govern how much performance that is necessary according to different
metrics such as latency, throughput, and availability. A typical cloud service
whose qos is reliant on exactly these metrics, is web hosting. As a conse-
quence, this is one of our services of choice when benchmarking our pm
algorithms. In addition, we measure the performance characteristics of our
63
64 CHAPTER 5 EVALUATION
solutions using a industry standard database benchmark, tpc-c.
5.1.1 Experimental Platform
We run our experiments on a a set of Hewlet Packard ProLiant BL460c G1
blade servers. These blades are equippedwith twin Intel Xeon 5355 processors
running at a peak frequency of 2.66ghz, have 16gb of pc2-5300 ddr2 ram,
and a single 10K sas hard drive. Note that seemingly identical blades might
contain different hardware1. Thus, care has been taken to ensure identical
hardware for all test systems. In addition,we use a set of similarly specced Dell
PowerEdge M600 blade servers as compute nodes for generating load for our
test systems. Each of the HP test systems are connected via a 1Gbit Ethernet
link to a HP 1:10G Ethernet blade switch, while the Dell blades are connected
to an Ethernet Pass-Through switch from the same vendor. In turn, these two
switches are interconnected with 4 × 1Gbit connections. To avoid network
bandwidth being an inhibiting factor, we limit the number of simultaneously
benchmarked systems to 4, each of these being granted exclusive access to
one of the 1Gbit links via the use of Virtual LANs (vlans).
If not otherwise stated, all experiments including power consumption mea-
surements are performed using a set of workload traces detailed in the fol-
lowing sections. For all web-based benchmarks, we run the server software
within a lightweight vm os providing the Linux 3.2.0 abi [68], [69]. This, in
turn, is run on top of Vortex. When executing workloads consisting of http
requests, we run an instance of Apache Web Server version 2.4. Database
transaction workloads are run against a mysql Community Server 5.6.17
instance.
5.1.2 Measuring Power Consumption
Even with homogeneous hardware, inter node variability is ever present. We
measure the power consumption every second on a per node basis using the
Intelligent Platform Management Interface (ipmi) protocol. The different
blade servers used as test systems exhibit power consumptions of 182W–196W
(a difference of over 5%), even when completely idle. When ram from differ-
ent vendors was used, the differences were even larger. Our tests indicate that
the variance in power consumption is closely related to the physical device
bays of the server enclosure. Thermal issues are unlikely to be the culprit as
fan speeds throughout the entire enclosure vary only slightly.
1. For instance, our blades contains ram from two different vendors. In addition, one of
these were “green”, consuming considerably less energy than the other.
5.1 METHODOLOGY 65
To compensate for this variability, all power consumption measurements are
averaged over a minimum of 15 runs. These runs are in turn spread over at
least 3 different nodes. By using this approach we avoid the scenario where
different algorithms and configurations are given an unfair advantage or
impediment simply as the result of node assignment. Averages also serve as a
good measure for performance and energy consumption when working with
large numbers of servers, as can be expected in a dc setting.
5.1.3 ApacheBench Workload Traces
To ensure that fair comparisons of different algorithms are possible, we eval-
uate these by executing a set of workload traces. These workloads are gen-
erated using ApacheBench (ab)2, which is a program for testing web servers.
ab allows us to measure performance in terms of throughput, availability,
response times at different percentiles, and several additional metrics. To
measure web server performance, ab can be run with different arguments.
The arguments employed when generating our traces are as follows:
• url: The url to a file to be downloaded by ab.
• Number of requests: The number of requests ab should perform towards
the web server.
• Concurrency: The concurrency level of the requests, that is, the number
of simultaneous requests towards the web server.
In addition, we produce a spiky behavior by sleeping a randomly selected
amount of time from within a predefined range. To assure that our pm al-
gorithms are tested for varying workloads, we create three different traces
corresponding to low-, normal-, and high load. For all of these, the files to be
requested are selected randomly from a set containing files with sizes 1kb,
2kb, 4kb, 16kb, 32kb, 2mb, and 4mb. The number of requests, concurrency
level, and sleep time are all selected randomly from within predefined ranges,
as listed in table 5.1.
After these traces have been generated, they are stored so that they can be
replayed. This is performed using a simple Python3 script which simply reads
each entry, sleeps the required amount of time, and executes ab with the
parameters corresponding to the current entry.
2. http://httpd.apache.org/docs/2.2/programs/ab.html
3. https://www.python.org/
66 CHAPTER 5 EVALUATION
Workload Description Concurrency Requests Sleep time
Low Very low and spiky
load.
[1, 𝑅] [1, 50] [1, 30]








[50, 𝑅] [100, 500] [0.1, 1.0]




[100, 𝑅] [500, 1000] [0.01, 0.1]
Table 5.1: Summary of ab Workload trace generation. All sleep times are listed in
seconds. The 𝑅 symbol is used to illustrate that the maximum possible
concurrency is limited by the randomly chosen number of requests.
5.1.4 Prediction Test Workload
For some of our experiments, the ab traces described above are not suitable.
In these cases, we use a somewhat different workload which is more variable.
The reason we do this is that the ab traces are highly memory bound, and thus
almost exclusively contain phases corresponding to the lowest performance
state. This makes for a bad test when attempting to compare certain algo-
rithms, for example ones that predict P-states. Instead we use a population of
compute intensive threads (calculating prime numbers) with different charac-
teristics spread over some of the cores, with a background load corresponding
to the most work-intensive ab trace.
5.1.5 HammerDB and TPC-C Benchmark
We use the tpc-c database benchmark to measure intrusiveness and per-
formance degradation resulting from our pm algorithms. tpc-c is an on-
line transaction processing (oltp) benchmark featuring multiple transaction
types, and complex database- and execution structures.
tpc-c is approved by the Transaction Processing Performance Council (tpc),
and simulates a complete environment where a population of terminal oper-
ators executes transactions against a database. The benchmark is centered
5.1 METHODOLOGY 67
around the principal activities (transactions) of an order-entry environment.
These transactions include entering and delivering orders, recording pay-
ments, checking the status of orders, and monitoring the level of stock at
multiple warehouses [3].
We use HammerDB4 to execute tpc-c against our test servers. HammerDB is
a tool that provides a simple graphical interface for configuration and bench-
marking of databases5. Figure 5.1 shows a screenshot from a benchmarking
session. HammerDB provides two measures of performance, the number of
transactions per minute (tpm), and the number of new-order transactions
per minute (nopm), which is the performance metric of tpc-c. For all tests
using tpc-c and HammerDB, we use a configuration of 5 warehouses and
5 virtual users. We sample data from at least 3 separate physical hosts, and
calculate averages and standard deviations from a minimum of 15 measure-
ments.
Figure 5.1: Screenshot of running tpc-c with HammerDB.
5.1.6 Fine Grained Workload
Our final workload allows fine-grained control over the total load in the sys-
tem. This is especially useful when evaluating our energy efficient scheduling
algorithm.
4. http://hammerora.sourceforge.net/
5. HammerDB currently supports Oracle, Microsoft sql Server, mysql, PostgreSQL, and
Redis databases.
68 CHAPTER 5 EVALUATION
The workload is generated by spawning new cpu-intensive threads, each cor-
responding to a configurable amount of cpu consumption. This is done until
a global (also configurable) limit for cpu load is reached. For example, new
threads, each using 5% of the compute resources of a core, can be spawned
until the total system-wide cpu load is 500% (corresponding to five cores
being fully utilized).
The threads generating the actual load are busy waiting and sleeping repeat-
edly in a pattern that sums to the total configured cpu consumption.
5.2 CPU Power State Management
In this section, we describe our experiments involving policies for managing
cpuC-states. Whenmeasuring power consumption and total completion time,
we sum the samples over the entire trace. This is done to better illustrate the
differences between different policies. We measure instantaneous latency,
as perceived by a user, per request. For each workload we also measure the
power consumed and experienced latencywhen not using any pm at all. These
measurements serve as baselines for comparing the different policies.
5.2.1 Aggressively Entering C-states
Aggressive entry of different C-states is the simplest of our policies for man-
aging cpu C-states. Our platform supports a total of six different C-states (as
enumerated by the use of the cpuid instruction). These are C1, C2, and C3—
each with two sub-states. We were unable to measure any difference between
sub-states belonging to the same major C-state. Further, neither cpuid nor
any freely available documentation on our cpus provide any information
about the latencies of entering these states6. For simplicity, we use only the
deepest C-states. Notice that for C1, this corresponds to the C1E state, which
is entered automatically by our cpus if all cores on a chip are halted.
Figure 5.2 shows the measured power consumption of entering various C-
states as soon as an idle period occurs. As is clear from the figure, enabling
direct entry of any C-state results in significant energy savings. Because of
the large standard deviations, it is impossible to separate entering C1 or C2
from each other with any degree of certainty. However, entering C3 directly
6. acpi does provide some latencies, but these do not correspond with the ones used by
Ubuntu Linux 12.04 when installed on our platform. We use the largest values, assuming
that these correspond to the deepest sub-states.
5.2 CPU POWER STATE MANAGEMENT 69
























Figure 5.2: Mean power consumed over 1 hour long ab web traffic traces. The
amount of power consumed is summed over the entire trace to better
show differences between the different C-states. The error bars show
standard deviations.
seems to consume less energy on average.
Although a substantial amount of energy is saved for all workloads, the rela-
tive difference compared to using no pm is especially large for the low load.
This is because this workload trace features relatively long idle periods. As
described in section 2.2, such workloads allow for considerable reductions in
energy consumption as the cost of transitioning in and out of different power
states is by far outweighed by the energy savings. The relative savings for
different workloads are displayed in table 5.2. As can be seen, the relative
energy savings is over 4% better for entering C3 directly, than if C2 or C1 is
entered.
Workload Relative Energy Savings
Workload C1 C2 C3
Low 24.4% 23.3% 28.2%
Normal 13.5% 12.6% 17.8%
High 12.2% 11.3% 17.4%
Table 5.2: Relative energy savings for different workloads when entering C-states
directly. The percentages are calculated from the means shown in figure
5.2.
The loss of performance resulting from the use pm policies is also important
when deciding on a policy. From logs obtained from runs of our abworkloads,
we calculate the mean excess completion time for different policies. For each
70 CHAPTER 5 EVALUATION
of the 1 hour traces, we sum the time before 99% of requests are served. Again,
the summation is performed tomake any difference easier to detect. Figure 5.3
shows the mean excess completion time when entering C-states directly, i.e.,
the difference between the use of a policy and no pm. Interestingly, entering
C1 seems to increase the completion time slightly on average, while it is
difficult to say anything in particular for entering C2 or C3. Our conclusion is
that entering C-states directly results in close to no increase—but does induce
some variability—in total completion time. Note that this summed completion
time does not correspond to added instantaneous latency, as experienced by
a user issuing a request. Table 5.3 shows the relative increase in completion
time and variability for entering C-states directly, as opposed to using no
pm.






















Figure 5.3: The figure shows the added completion time induced by aggressively
entering different C-states. The measure of added completion time is
time before 99% of requests are served, and the plots show the difference
between entering C-states aggressively and no pm.
To evaluate instantaneous latency, we examine the ab logs on a per-ab-run
basis. For each workload and policy, we calculate the mean for each of the
individual ab runs (over all the traces), and calculate the difference from
the mean obtained when using no pm policy. Because the different ab traces
contain a varying number of requests, we normalize the difference per request.
These differences are shown in figure 5.4. Notice that even though the length
of the workloads in terms of ab runs are different, they all have the same
1 hour duration. Negative values would correspond to execution with pm
being faster than running the system at maximum performance. Clearly, these
measures are artifacts. For each ab run, all outliers, |𝑥| > 2𝜎, 𝜎 being the
standard deviation, are removed to reduce the amount of noise. We see that
the added latency only very rarely exceed one millisecond, and most of the
time is close to zero.
5.2 CPU POWER STATE MANAGEMENT 71









































Figure 5.4: The figure shows added latency by aggressively entering different C-
states. The latency is given in ms, and normalized per request. Notice
that the different workloads contain varying numbers of ab runs, but all
have the same 1 hour duration.
Considering the amount of power saved as well as the added latency, di-
rectly entering C3 seems to be the most desirable of the policies examined so
far.
Excess completion time Added variability
Workload C1 C2 C3 C1 C2 C3
Low 7.3% 7.4% 0.4% 0.0% 13.0% 38.9%
Normal 0.7% 0.4% 0.5% 0.0% 27.6% 29.2%
High 0.7% −0.2% −0.4% 0.0% 12.4% 47.3%
Table 5.3: Summary of performance degradation due to entering C-states directly.
Percentages are calculated using means and standard deviations, and are
relative to using no pm.
5.2.2 Static Timeout Based C-state Entry
Next, we consider entering a C-state following a period of inactivity. In the
light of our findings in the previous section, we limit our experiments to only
consider entering the C3 power state on timeouts, rather than examine all
possible permutations. We perform the same measurements as in the previous
section, and do this for timeout values from 1𝜇𝑠–1000𝜇𝑠. This choice of values
is motivated by real data center (dc) workloads seldom containing long idle
periods, but rather very many and short ones [59]. If any energy is to be saved,
the timeouts must be relatively short.
72 CHAPTER 5 EVALUATION
The power consumption of entering C3 after a static timeout is shown in
figure 5.5. All timeouts exhibit considerable power savings when compared
to using no pm. Although the standard deviations are prohibitively large
to say anything with certainty, the power consumption seems to increase
slightly with the timeout duration. As explained in section 2.2, this should be
expected as the amount of potential sleep time that is wasted increases with
the timeout. The slight decrease for a timeout of 1000𝜇𝑠 could be attributable




























































Figure 5.5: Mean power consumed over 1 hour long ab web traffic traces. The
amount of power consumed is summed over the entire trace to better
show differences between the different timeout values.
We next go on to consider excess completion time resulting from the use of
different timeout values. Figure 5.6 shows the mean excess latency over the
1 hour ab traces. For all timeout values, the variability is significantly lower
than when using no pm, meaning that timeout-based policies are consistently
slower. As can be seen, the excess completion time increases dramatically
with increasing timeout values. These results were surprising. Several works
mention increasing timeout values as a means to reduce the extra latency
generated by a pm policy [54], [39]. Recall from section 4.4.2 that with our
implementation of entering C-states following a static timeout, a timer is
created before the core is halted. The only possible ways for the core to leave
this halted state are:
1. The arrival of work, meaning that the pending timer should be canceled.
2. The timer fires, meaning that a sufficiently long period of inactivity has
occurred, and a deeper C-state (C3) should be entered.
5.2 CPU POWER STATE MANAGEMENT 73
Measuring the ratio of canceled timers for our different timeout values, it is
apparent that as the timeout value increases, so does the number of canceled
timers. This is illustrated in figure 5.7.
Another key point is that when the timeout value increases, more and more
energy is wasted waiting for the timeout to expire, i.e., the benefit of the pm
policy diminishes. At the same time, we reap more and more of the negative
effects. If the timeout value is short, the cost of handling the timer is hidden
from the users because the system is inactive. The cost of canceling a timer is
not hidden, as the core has already been awoken by the arrival of work. Thus,
as the number if canceled timers increase, so does the latency experienced
by the users.
In [59], the authors claim that adding delays smaller than 1ms when exiting
the idle loop does not result in significant impacts on response times. As
can be seen in the plots of figure 4.6, the added latency of our static timeout
implementation never approaches 1 ms even in the worst case, and on average,
adds only 103 to 104 cycles, or about 0.5𝜇𝑠 - 5𝜇𝑠 if the cpu is run at 2ghz.
Non the less, we observe severe effects on latency for even small increases in






















































Figure 5.6: Mean excess completion time from entering C3 following a period of
inactivity. The measure of completion time is the time before 99% of
requests are served, and the plots show the difference between entering
C3 after a timeout, and using no pm.
Finally, we turn to the instantaneous user-perceived latencies. For the sake
of visualization, we plot only the latency for entering C3 after timeouts of
5𝜇𝑠, 500𝜇𝑠, and 1000𝜇𝑠. As can be seen in figure 5.8, longer timeouts incur
considerable extra latency. This is contrary to results from [59], where the
authors claim that delays less than 1ms when exiting the idle loop are of little
74 CHAPTER 5 EVALUATION
Excess Completion Time
Workload 1𝜇𝑠 5𝜇𝑠 25𝜇𝑠 50𝜇𝑠 250𝜇𝑠 500𝜇𝑠 1000𝜇𝑠
Low 5.8% 2.9% 2.6% 3.6% 4.9% 9.0% 14.0%
Normal 2.3% 1.1% 0.2% 3.8% 10.1% 20.1% 31.9%
High 2.0% 1.9% 1.7% 5.2% 13.7% 24.1% 44.3%
Table 5.4: Summary of performance degradation due to entering C-state following
a static timeout. Percentages are calculated from means, and are relative
to using no pm.
consequence. We conclude that when attempting to save power by putting
cpu-cores to sleep under normal data center loads using static timeouts,
shorter timeout values are preferable.
5.2.3 Select Best Performing C-state
The last C-state management policy we consider, is that of selecting the best
performing C-state at runtime. As mentioned in section 4.3, the Share Algo-
rithm is tunable via several parameters. Namely
• 𝑐latency: The weighting of added latency in the cost function. A high
value for 𝑐latency means that latency will be avoided.
• 𝑐wasting: The weighting of “wasted” energy. A high value for 𝑐wasting will
result in deeper C-states being entered.
• 𝛼: The share parameter, which controls how fast a poorly performing
expert is able to recover its weight when it starts performing well.
• 𝜂: The learning rate, which controls the rate of which the weights of
poorly performing experts are reduced.
We want our solution to react quickly to spiky workloads, and thus set a high
learning rate of 𝜂 = 25, while 𝛼 is kept low with a value of 0.001. The Share
Algorithm is robust with regards to the 𝜂 and 𝛼 parameters, so in the case
that our choice is non-optimal, this will have only a limited effect [39]. We
test our solution with constant 𝜂 and 𝛼, but vary the relative weighting of
excess latency and wasted energy in three different configurations. The first
of these weigh excess latency double that of wasted energy (see table 5.5).
The second have equal weights for the two, while the third weigh wasted
energy double that of excess latency. Figure 5.9 contains plots showing the
5.2 CPU POWER STATE MANAGEMENT 75























Figure 5.7: Ratio of posted timers that are canceled before firing. Note that this is
not a problem for the lowest ab load, and it is thus not included. Error
bars show standard deviations.









































Figure 5.8: Mean user-perceived latency added by using static timeout policies. The
latency is given in ms, and normalized per request. Notice that the dif-
ferent workloads contains varying numbers of ab runs, but all have the
same one hour duration.
76 CHAPTER 5 EVALUATION






Table 5.5: Description of Share Master configurations. Configuration A corresponds
to weighing latency and wasted power 2:1. Configuration B weighs the
two equally, while C has a weighting ratio of 1:2.
As illustrated by the plots, and numbers in table 5.6, the differences in power
consumption are small. However, when we turn to latency it becomes clear
that configuration A, which weighs added latency double that of wasted en-
ergy, results in significantly lower total completion times. Likewise, the instan-
taneous latency is lowest for configuration A. Based on these observations,
configuration A is used in the rest of the experiments.
Power Savings Excess Completion Time
Workload A B C A B C
Normal 16.6% 15.6% 15.9% 2.0% 5.1% 4.2%
Table 5.6: Summary of Share Algorithm performance with different configurations.
Configuration A corresponds to weighing latency and wasted power 2:1.
Configuration B weighs the two equally, while C has a weighting ratio of
1:2. Percentages are calculated from means, and are relative to using no
pm.
5.2.4 Comparison of C-state Management Polices
For comparison, we plot the total average power consumption of the best per-
forming C-state policies together in figure 5.10. The figure shows clearly that
any of the policies implemented in rope will reduce the power consumption
significantly. Also, it appears that entering C3 directly is the best solution,
as this policy results in the lowest average power consumption over all the
different workload traces. This can be seen more clearly in table 5.7, which
lists mean relative reduction (when compared to using no pm) in power con-
sumption, and the amount of average excess completion time for the same
policies as plotted in figure 5.10. The Share Algorithm selection of the best
performing C-state, on average, performs the worst. This is true both with
regards to power consumption, and excess completion time. Also, entering
C3 directly conserves almost as much energy as the best performing static
5.2 CPU POWER STATE MANAGEMENT 77




















Average Power Consumption, summed over 1h trace
Normal Load
(a) Power consumption


















Average added time before 99% of requests served, summed over 1h trace
Normal Load
(b) Excess completion time






























Figure 5.9: Comparisons of Share Algorithm Configurations. The three different
corresponds to weighing latency and wasted energy 2:1, 1:1, and 2:1.
78 CHAPTER 5 EVALUATION
policy for both low and high load ab traces (difference is 0.2%), and the most
for the high load trace. When looking at the excess total calculation time, it
is clear that entering C3 directly performs the best on average.
























Figure 5.10: Mean power consumed over 1 hour long ab web traffic traces. The
amount of power consumed is summed over the entire trace to better
show differences between the different pm policies.
When looking at the instantaneous per-request latencies (see figure 5.11), it
it is clear that entering C3 directly results in the lowest additional latencies.
Critically, in several parts of the trace, using both the Share Algorithm and the
static timeout policy with 5𝜇𝑠 timeout result in latency spikes, while entering
C3 directly, does not.
We conclude this section by noting that entering C3 directly is the C-state
management policy best fit for our workload traces. It proves to be the least
intrusive w.r.t latency, and is one of the top candidate policies for all work-
loads when power savings are considered. In addition, it is both conceptually
and implementation-wise the simplest of the C-state management policies in
rope, and thus fits with our design principles. As a result, we chose entering
C3 directly as the default C-state management policy.
5.2 CPU POWER STATE MANAGEMENT 79
Power Savings
Workload C3 Directly Select Best (A) Static, 5𝜇𝑠 Static, 25𝜇𝑠
Low 28.2% 26.9% 28.4% 28.4%
Normal 17.8% 16.6% 17.7% 18.0%
High 17.4% 13.3% 16.3% 16.5%
Excess Completion Time
Workload C3 Directly Select Best (A) Static, 5𝜇𝑠 Static, 25𝜇𝑠
Low 0.4% 1.3% 2.9% 2.6%
Normal 0.5% 2.0% 1.1% 0.2%
High −0.4% 3.5% 1.9% 1.7%
Table 5.7: Comparison of C-state management policies. All percentages are calcu-
lated from means, and are relative to using no pm.







Select Best (A) 5us C3 Directly

































Figure 5.11: Comparison of user experienced latency when employing different cpu
C-state management policies. The latency is given in ms, and normal-
ized per request. Notice that the different workloads contain varying
numbers of ab runs, but all have the same one hour duration.
80 CHAPTER 5 EVALUATION
5.3 CPU Performance Management
As with C-states, we measure our P-state management polices with regards
to power consumption, total completion time, and instantaneous latency. We
compare these to each other to determine which solution performs the best.
When estimating the prediction quality of our solutions, we use the workload
defined in section 5.1.4.
5.3.1 GPHT Predictor
We start by evaluating the implementation-specific properties of our gpht
predictor, and do this using internal measures such as the prediction accuracy,
and Pattern History Table (pht) hitrate. We define these internal measures
in the following way:
• Prediction accuracy: The percentage of predictions made by the gpht
predictor that proved to be correct.
• pht hitrate: The number of lookup operations that resulted in a hit,
divided by the total number of lookup operations, i.e. : LookupsHitLookupsTotal
We measure both the accuracy and hitrate for different cores where we gen-
erate a synthetic load as described in section 5.1.4.
In [45], and [94], the authors study the properties of a gpht predictor for
varying pht-sizes. We verify our implementation by measuring the hitrate
and accuracy for varying pht-sizes. Our results are shown in figure 5.12, and
corroborate the findings in [94], that accuracy increases with the hitrate. This
is especially visible for core 5. We also see that the accuracy increases only
slightly with pht-size, which matches the findings in [45].
The above results indicate an tradeoff: increase the number of pht entries to
achieve higher accuracy, or keep the pht-size small to avoid using unneces-
sary memory. We select a pht-size of 128 entries, as it seems a fair tradeoff.
Throughout the rest of the experiments, we use this size, and a Global Phase
History Register (gphr)-depth of 8.
5.3.2 Comparison of Prediction Accuracy
Next, we compare the naive forecaster to the gpht predictor described in
the previous section. As shown in table 5.8, the naive forecaster consistently
5.3 CPU PERFORMANCE MANAGEMENT 81















































Figure 5.12: Evaluation of gpht accuracy and hitrate. Both plots created using syn-
thetic workloads similar to those listed in table 5.8, and plotted in figure
4.16.
outperforms the more complex gpht predictor. We find this result surprising,
and investigate the issue further in the following paragraphs.
Many different accuracy metrics have been proposed for evaluating algo-
rithms that produce forecasts based on historical values. However, several of
the commonly used metrics are ill-suited for use with real data, as pointed
out by Hyndman and Koehler in [44]. The reason why many common mea-
sures are unfit for evaluating accuracy in real situations, is that occurrences
of zero- and small values result in division by zero problems and very large
values. Some examples of commonly used measures that exhibit problematic
behavior for real data are:
82 CHAPTER 5 EVALUATION
Accuracy
Load Naive Forecast gpht Predictor
Low synthetic 39.6% 36.6%
High synthetic (bursty) 85.8% 84.7%
ab High load 99.9% 99.9%
High Synthetic (stable) 99.6% 99.5%
Table 5.8: Prediction accuracies of P-state prediction algorithms. We find it surprising
that the naive forecaster performs so well.
• Mean Squared Error/Root Mean Squared Error - sensitivity to outliers.
• Measure based on percentages - infinite or undefined for occurrences
of zero, extremely skewed distributions when observed value is close to
zero.
• Measures based on relative error - cannot be used across data series.
In stead the authors of [44] suggest the use of mean absolute scaled error
(mase), which is suitable for all situations, also real data where zero, nega-



















𝑖=2 |𝑌𝑖 − 𝑌𝑖−1|
(5.1)
Where 𝑌𝑖 is the observed value at time 𝑖, and 𝑒𝑡 is the error made by the fore-
caster being evaluated, compared to the forecast given by the naive forecaster
at the time 𝑡.
Whenmase< 1, the algorithm being evaluated results in, on average, smaller
errors than the naive forecaster, i.e. predicting the next interval to be equal
to the previous one. If mase> 1, no improvement over the naive forecast is
achieved. The only circumstances for which themase can be either undefined
or infinite, is when all the historical data points being used in the estimation
are equal7. Other key properties are the ease atwhich results can be compared,
and that it is applicable across data series with different scales. Figure 5.13
shows a plot of the mase values of different prediction algorithms for our
7. We use this result when deciding on benchmarks for evaluating predictor accuracy.
5.3 CPU PERFORMANCE MANAGEMENT 83
prediction trace. Note that only the data from a single core was used to
generate this plot.
Figure 5.13: The figure shows the mase of various prediction algorithms. Notice
that this plot only contains data from one core, which runs a compute
intensive and bursty synthetic load. The mase is calculated with 𝑛 =
2000.
Using logs collected from runs with ourworkload for prediction quality testing,
we also evaluate the merits of a moving- and exponential moving average
predictors offline. These are common statistical prediction algorithms, and
their definitions are given in equation 5.2 and 5.3 respectively.
The simple moving average (sma) is given by:
𝑆𝑀𝐴 =
𝑝𝑚 + 𝑝𝑚−1 + ... + 𝑝𝑚−(𝑛−1)
𝑛
(5.2)
where 𝑛 is the number of data points to consider. While exponential moving
average (ema) is defined recursively as:
𝑆1 = 𝑌1 for 𝑡 > 1, 𝑆𝑡 = 𝛼𝑌𝑡−1 + (1 − 𝛼)𝑆𝑡−1 (5.3)
Where 𝛼 represents the degree of weighting decrease (0 < 𝛼 < 1), 𝑌𝑡 is the
value at time 𝑡, and 𝑆𝑡 is the ema at time 𝑡.
For all our experiments, we use window length equal to the gphr in our
gpht predictor, i.e. 𝑛 = 8. For exponential moving average, we use 𝛼 = 0.1.
Results for cores running different loads are given in table 5.9
The table shows that the naive forecaster performs the best w.r.t accuracy
in three out of four cases. The only workload where any of the statistical
84 CHAPTER 5 EVALUATION
Naive Forecast gpht Predictor Moving Avg. Exp. Moving Avg.
Load Accuracy Accuracy mase Accuracy mase Accuracy mase
Low synthetic (Bursty) 39.6% 36.6% 1.05 46.8% 0.86 51.0% 0.70
High synthetic (Bursty) 85.8% 84.7% 1.03 58.0% 3.70 51.0% 4.17
ab High load 99.9% 99.9% Undefined 99.9% Undefined 99.9% Undefined
High Synthetic (Stable) 99.6% 99.5% Undefined 99.5% Undefined 99.5% Undefined
Table 5.9: Prediction accuracies and mase values for P-state prediction algorithms.
Notice that mase is undefined only if all historical data points are equal.
This means that the ab trace is unfit for testing predictors, at it represents
the trivial case. mase calculated with 𝑛 = 1000.
approaches seems to have any merit, is the low utilization synthetic load,
which corresponds to very short bursts of prime number generation, and short
but frequent sleep-times (1ms and 1ms, respectively).
5.3.3 Comparison of P-state Management Polices
We now turn to the external measures. As shown in figure 5.14, there is
little difference in the power consumption when using the different P-state
management policies. However, the gpht predictor performs better when
we consider the total completion time. As shown by figure 5.14, the naive
forecast results in considerably larger completion times. The same is true for
the instantaneous per-request latencies, as shown in figure 5.16. Table 5.10
sums these observations.
Power Savings Excess Completion Time
Workload Naive Forecast gpht Naive Forecast gpht
Low 15.9% 15.9% 9.0% 0.6%
Normal 15.1% 13.4% 10.0% 0.5%
High 15.4% 15.2% 7.5% −1.7%
Table 5.10: Comparison of P-state management policies. All percentages calculated
from means, and are relative to using no pm.
Although the gpht predictor performed worse than the naive forecaster
accuracy-wise, it results in much better latencies. Recall that the naive fore-
caster has substantially lower overhead, 3752 compared to 15456 cycles on
average. It appears that the circumstances under which predictions are cor-
rect, is more important than the average accuracy, i.e., if a prediction is wrong
when the load is low, this is less detrimental to performance than if the wrong
prediction is made under high load. We speculate that the gpht predictor is
able to react to patterns in the workload, reducing the latency, perhaps at the
expense of wrong predictions during periods of lower load. This could explain
5.3 CPU PERFORMANCE MANAGEMENT 85
























Figure 5.14: Power consumed over 1 hour long ab web traffic traces. The amount
of power consumed is summed over the entire trace to better show
differences between the use of only P-states and no pm.






















Figure 5.15: Excess latency from using available P-states to save energy. The measure
of latency is time before 99% of requests are served, and the plots show
the difference between using P-states, and no pm.
86 CHAPTER 5 EVALUATION




























Figure 5.16: Instantaneous latency of using P-states for power management.
the gpht predictor using slightly more power than the naive forecaster.
To conclude, there is little difference in the energy savings between the naive
forecaster and the gpht predictor. However, we select the gpht predictor as
the default P-state management policy in rope, as it results in significantly
lower latencies and total completion times for our workload traces.
5.4 Core Parking Scheduler
Finally, we turn to evaluate our energy efficient dynamic round-robin (drr)
scheduler. Recall that the main goal of running our energy efficient schedul-
ing algorithm, is maximizing the number of cores that are sleeping. drr
accomplishes this by packing tasks on a subset of the cores until a threshold,
𝜃recruit is reached for the active cores. Further, cores whose utilization is below
𝜃retire, are moved into the retiring state, where they will remain for a given
time, 𝑇retire, before ending up as inactive cores. If a new task arrives, it will
be assigned to an already active core if this is possible. If the total utilization
is greater than 𝜃recruit, a core will:
1. Be unretired, if any cores are currently in the retiring state.
2. Be reactivated, and awoken from the inactive state.
Our implementation of an energy efficient drr scheduler only considers
threads. This is important, as the omni-kernel architecture (oka) is based on
5.4 CORE PARKING SCHEDULER 87
communication through message passing. Because our scheduler is not used
for this purpose, the ab workload traces are unfit to test how well our sched-
uler packs work onto cores. This is because only a tiny fraction of the total
workload (the Apache user-level threads) is under the schedulers control. We
have experimented with using the drr scheduler for message processing and
other Omni-kernel resource as well, but currently this branch of Vortex is not
stable enough to support our long-running experiments. Conceptually, how-
ever, the oka is designed with the idea that plug-in-schedulers, for instance
energy efficient ones, can be used for all resources.
Instead of the ab workload traces, we use a synthetic trace as described in
section 5.1.6 when evaluating the properties of our drr implementation.
Note however, that we do use the ab traces to evaluate the intrusiveness of
the scheduler.
5.4.1 Internal Measures
We use the number of wakeups, i.e. the number of times a core is awoken
from a C-state, C𝑛, 𝑛 > 0, as an internal measure for how well our drr
scheduler is able to pack tasks onto cores. If the number of wakeups is low,
this means that many cores are idle, and that the active cores are used to
execute arriving tasks. We run a synthetic workload where we fork of a new
thread consuming 2% of a cpu-core’s compute resources every two seconds.
We do this until we reach a total system utilization of 300%, at which point
we read how many wakeups have occurred.
We vary 𝜃retire and 𝑇retire, and as shown in figure 5.17 a), the number of wake-
ups is highly dependent on 𝜃retire. As 𝜃retire increases, the number of wake-
ups decreases. Also visible in a) is a tendency of diminishing returns. At
𝜃retire > 80, only slight decreases are obtained. However, the latency of using
the drr scheduler will continue to increase, as more and more threads are
queued for execution on the subset of active cores. How long threads are
queued will depend both on the properties of the thread population, and the
configured maximum time slice. For example, if 10 threads are queued, and
each of these use their 5ms time slice fully, this would result in 10·5𝑚𝑠 = 50𝑚𝑠
of queuing time. Plot b) shows that the number of wakeups is less dependent
on the retirement time threshold, and there does not seem to be any general
pattern in how 𝑇retire alters the number of wakeups.
88 CHAPTER 5 EVALUATION
40 50 60 70 80 90 100













Energy Efficient Scheduler (DRR)
Vortex Standard Scheduler
(a) Varying 𝜃recruit, constant 𝑇retire = 500000 and 𝜃retire = 50.















Energy Efficient DRR, Θrecruit=85
Energy Efficient DRR, Θrecruit=70
Vortex Standard Scheduler
(b) Varying 𝑇retire, constant 𝜃recruit = 85 and 𝜃retire = 50.
Figure 5.17: Properties of energy efficient drr scheduler. All values are means cal-
culated for at least 15 measurements over four different physical hosts.
5.4 CORE PARKING SCHEDULER 89
5.4.2 Comparison with Standard Vortex Scheduler
In this section we compare our drr implementation to the default thread
scheduler in Vortex. This scheduler works by assigning a thread to a core in
a round-robin fashion the first time it is run. The thread will then continue
to run on this core until is is finished or destroyed.
Recall that the energy efficient drr scheduler is intended to compliment
existing C- and P-state management policies. When testing our scheduler, we
use the following common configuration:
• C-state policy: Enter C3 directly.
• P-state policy:gpht predictorwithgphr depth=8,andpht-size=128.
We run three different workloads, each created by forking of a new thread
consuming 2% of a cpu-core’s compute resources every two seconds. This is
done for varying global utilizations of 150%, 350%, and 500%.
















Average Power Consumption, summed over 1h trace
Normal Scheduler
Energy Efficient Scheduler (DRR)
Figure 5.18: Power consumption of synthetic loads using energy efficient drr sched-
uler.
Figure 5.18 shows the power consumption with the default anddrr scheduler.
As can be seen, the energy efficient drr scheduler consumes less power on
average for all three workloads. Table 5.11 lists these differences in percent-
ages: relative to using the default scheduler, energy efficient drr consumes
3.5%–4.2% less power on average.
90 CHAPTER 5 EVALUATION




Table 5.11: Relative energy saving when using energy efficient scheduler. Percentages
are calculated from means, and are relative to using the Vortex standard
scheduler.
5.4.3 Effects of Energy Efficient Dynamic Round-RobinScheduling
We now describe the effects of introducing the energy efficient drr to Vortex.
Figure 5.19 shows the amount of power consumed over the one hour long ab
traces for three combinations of pm policies that are facilitated by rope. Also
plotted, is the power consumption if no pm is performed. As expected, when
serving ab workload traces, there is no decrease in power consumption when
introducing energy efficient drr scheduling. As mentioned, this is because
most of the workload is outside the schedulers control.
No PM C3 Directly C3 Directly + P-states (GPHT) C3 Directly + P-states (GPHT) 
























Figure 5.19: Mean power consumed over 1 hour long ab web traffic traces. The
amount of power consumed is summed over the entire trace to better
show differences between the different pm policies.
With regards to total completion time, energy efficient drr does not seem
to have a significant impact when compared to using only a combination of
5.4 CORE PARKING SCHEDULER 91
entering C3 directly and the gpht P-state predictor. A plot of this is given
in figure 5.20. The same is true for instantaneous latency, which is plotted in
figure 5.21.
C3 Directly C3 Directly + P-states (GPHT) C3 Directly + P-states (GPHT) 

























Figure 5.20: Mean excess completion time from using different combinations of pm
policies. The measure of completion time is the time before 99% of
requests are served.
Table 5.12 shows the relative power savings and added completion times for
the three combinations of pm policies. The numbers do not indicate any detri-
mental effect of introducing energy efficient drr when serving ab workload
traces. Also, as shown in table, 5.11 energy can indeed be saved by the sched-
uler. When interpreting these numbers it is important to notice that ab is
a closed-loop benchmark. This means that if ab is run with a concurrency
level of 𝑛, one connection must finish before the next is initiated. This means
that very small delays will impact the throughput directly, and can thus sum
to relatively large numbers. In effect, this means that the measure of excess
completion time might exaggerate the actual performance implications to
some extent.
92 CHAPTER 5 EVALUATION







C3 Directly + GPHT + Scheduler (DRR) C3 Directly + GPHT C3 Directly

































Figure 5.21: Mean user-perceived latency added by using various pm policies. The
latency is given in ms, and normalized per request. Notice that the
different workloads contain varying numbers of ab runs, but all have
the same one hour duration.
Power Savings
Workload C3 Directly C3 Directly + gpht Predictor C3 Directly + gpht + drr
Low 28.2% 27.8% 27.5%
Normal 17.8% 22.7% 22.1%
High 17.4% 22.0% 22.3%
Excess Completion Time
Workload C3 Directly C3 Directly + gpht Predictor C3 Directly + gpht + drr
Low 0.4% 20.9% 7.8%
Normal 0.5% 16.9% 14.3%
High −0.4% 6.0% 7.8%
Table 5.12: Comparison of selected pm policies in rope. All percentages calculated
from means, and are relative to using no pm.
5.5 PERFORMANCE COMPARISON OF POWER MANAGEMENT POLICIES 93
5.5 Performance Comparison of PowerManagement Policies
While we use ab traces to measure power consumption and performance in
terms of latency, we turn to tpc-c and HammerDB for measuring throughput.
As described in section 5.1.5, tpc-c is a benchmark for relational databases
that mimics the usage patterns of operators making queries and updates in a
series of warehouses.
Table 5.13 contains the results from the tpc-c benchmark run with Ham-
merDB. There are some key points in this table that we want to empha-
size:
• Employing no pm results in the highest performance.
• When using only P-states to conserve energy, the naive forecaster per-
forms better than the gpht predictor. This is to be expected, as both
tpm and nopm are measures of throughput, and the overhead of the
naive forecaster is significantly lower than for the gpht. That is, while
thegpht predictormanages to provide better response times, the naive
forecaster gives higher throughput.
• Both of the P-state management policies result in lower levels of perfor-
mance than the C-state management policies. We believe this is because
of the intervals at which we adjust the frequencies of cpu cores, which
is every 100ms. By only entering C3 directly, the C-state management
policies are able to race-to-halt, finishing of any work as quickly as pos-
sible. The P-state policies, on the other hand, risk remaining in at a
prohibitively low frequency for the duration of the sampling interval.
We think that simply reducing the time between P-state management
predictions will alleviate this issue.
• When comparing the C-state management policies to each other, it
is difficult to make any claims with regards to performance, but the
standard deviations, and thus the variability, seems to be larger when
entering C3 directly (see also the two combinations of C- and P-states).
• The introduction of the energy efficient drr scheduler reduces perfor-
mance considerably. This is not unexpected, as the scheduler is naive
in its nature. No action is taken to keep caches warm, and threads risk
being shuffled between cores frequently. Further, because the threshold
for recruiting additional cores is relatively high (𝜃recruit = 85), threads
can experience increased queue times. In addition, the implementation
94 CHAPTER 5 EVALUATION
itself is not optimal. For instance, a global lock is used to ensure mu-
tual exclusion when obtaining scheduling decisions. As system-wide
utilization increases, this lock will be prone to contention.
pm-policy tpm nopm
No pm 732615 ± 108647 11208 ± 1660
P-states Only (gpht) 630601 ± 81778 9133 ± 2376
P-states Only (naive forecast) 653254 ± 69413 9991 ± 1058
C3 Directly 690642 ± 130245 10554 ± 1982
Select Best Performing 698429 ± 88799 11208 ± 1660
C3 Directly + P-states (gpht) 663692 ± 120747 10155 ± 1840
C3 Directly + P-states (naive forecast) 635293 ± 82986 9723 ± 1273
C3 Directly + P-states (gpht) and Scheduler 381512 ± 32747 5834 ± 500
Table 5.13: Summary of tpc-c performance for various pm algorithms (mean ± std.
deviation).
5.5.1 Summary
In this chapter we have give detailed descriptions of our experiments and the
properties of the pm policies available in rope. We have seen that some of
our results contradict those presented in previous work. In chapter 7, we will
discuss our findings in more detail.
6
Related Work
This chapter describes related work. Power management of computer systems
is a vast field, and we limit our review to works regarding power management
(pm) of cpus, and energy efficient scheduling. We also discuss some of the
other approaches to tackling the issues of energy efficiency in cloud computing
and data centers (dcs).
6.1 Power Management of CPUs
We start by discussing a selection of previous work focusing on pm ofcpus.
When adjusting cpu core frequencies, we use the “mem/𝜇-ops” ratio to clas-
sify workloads into different execution phases similarly to Isci et al. [45]. In
their work, the authors perform statistical machine learning by employing
a gpht predictor. This predictor is instantiated within Linux as a loadable
kernel module (lkm), and is used to make dvs decisions. While we focus on
multi-core x86, Isci et al. target a uni-core mobile platform. Further, the entire
operation of the solution proposed Isci et al. takes place within a performance
monitoring interrupt (pmi) handler.
Likewise, Moeng and Melhem [61] use statistical machine learning to adjust
cpu frequency and voltage. Using a decision tree, their machine learning
algorithm maximizes energy efficiency. Like our solutions, each core indepen-
95
96 CHAPTER 6 RELATED WORK
dently sets its frequency and voltage. Further, cache access- and miss rates
obtained from pmcs are used to classify phases of execution. The obtained
data is then manipulated and compressed into a decision tree, which is used
to decide frequency and voltage at runtime. Unlike our gpht solution, their
algorithm is not trained online. Instead, the system is trained prior to de-
ployment by running different workloads and having cores run with random
frequencies and voltages.
A somewhat similar approach is taken by AbouGhazaleh et al. [9]. pmcs
are used to sample metrics such as the number of cache accesses and last-
level cache (llc) misses. These are used to split the workload into execution
phases, which in turn are used to select frequency and voltage. Similarly to
[61], a supervisedmachine learning technique is used to train the system prior
to deployment. Apart from being pre-trained, their solution differs from ours
in two distinct ways. First, their solution relies on a power-aware compiler
in order to characterize execution phases. Second, they focus on embedded
processors with multiple clock domains. In contrast, all policies in rope are
independent of the applications being executed.
In [74], Rajmani et al. employ a scheme very similar to our gpht predictor
for adjusting performance. A three part solution built on monitoring, predic-
tion, and effectuation is used to select cpu P-states obtained via acpi tables.
Also, pmcs are used to characterize the workload. Their solution differ from
ours in that the employed prediction scheme is based on a static set of equa-
tions. Thus, it is unable to adapt to the workload over time. Further, their
entire solution is controlled by a user-level application that accesses drivers
to monitor workload behavior and adjust the cpu frequency. rope, on the
other hand, resides entirely within the Vortex kernel.
Finally, in [47], Jung and Pedram present a pm framework for multi-processor
systems that predicts a performance state, and then extracts an optimal pm
decision from a pre-computed policy table. The predictions are made using
a Bayesian classifier, which has been pre-trained using supervised learning.
The authors main rationale for choosing this solution is that it minimizes
overhead related to the classification process. The rope gpht predictor is
an online algorithm, and sacrifices latency for the ability to dynamically adapt
to any workload.
6.1.1 Share Algorithm
The Share Algorithm has been employed for power management on multiple
occasions. In [39], an approach similar to ours is used to perform adaptive
disk spin-down for mobile computers. They find the algorithm to be capable
6.2 ENERGY EFFICIENT SCHEDULING 97
of successfully adapting to the recent disk activity, performing better than
other previously known approaches.
In [54], the authors target power management of the cpu by employing the
Share Algorithm. They also employ rle to minimize the memory footprint,
but their solution differs from ours in that they only target power manage-
ment for mobile computers at the cpu-package level. They also use statistical
sampling when generating their rle traces. We focus on power management
on a per-core basis, enabling finer granularity in the pm decision making
process. Both [39] and [54] use the Share Algorithm to create a weighted
sum of expert outputs—each being a static timeout value. Our experiments
indicate this solution to be unfit for our problem, as the reasoning behind
their cost functions implies naively busy-waiting for timeouts to expire.This
would increase power consumption unnecessarily.
A solution much like ours is used in [28] to support dynamic power man-
agement of both hard disk drives and a wlan card. Like in our policy for
selecting the best performing C-state, the authors use the Share Algorithm
to select a single best expert to use at any point in time. The same approach
is also used in [37] to select one among several available caching policies.
Dhiman and Rosing use the the Share Algorithm to select the best performing
frequency/voltage pair [29]. One expert is used to represent each of these
pairs, and pmc readings similar to ours are used when calculating the loss of
each of the experts. However, the pmc readings are logged and maintained
per-task. This enables the selection of frequency and voltage based on the
currently executing task’s observed properties.
6.2 Energy Efficient Scheduling
Next, we look at energy efficient scheduling.
In [91], Weissel and Bellosa propose an energy-aware scheduling policy based
on the use of event counters. By exploiting the information present in pmcs,
the scheduler is able to determine the appropriate dvs setting. This is done
on a per-thread basis, and the scheduler scales the cpu voltage each time
a thread is scheduled for execution. The chosen frequency is calculated by
evaluating the event rates in the recent history of the thread. In contrast,
our energy efficient drr scheduler only considers the utilization of the cpu-
cores in order to assign tasks. It is also completely decoupled from the dvs
management.
Cai et al. [16] use two techniques; (i) meeting point thread characterization,
98 CHAPTER 6 RELATED WORK
and (ii) thread delaying; to reduce energy consumption in parallel applica-
tions. The first technique is used to characterize threads as either critical or
non-critical. Critical threads—the ones that run slowest—should be run at the
maximum available frequency to avoid performance degradation, while dvs
can be used to scale down cores executing non-critical threads. In the ideal
situation, threads on different cores are executed at the frequencies that allow
them to reach their synchronization point simultaneously. The authors extend
this in [73], where threads with similar criticality are scheduled onto the same
core via thread migration. This increases the chances that threads reach the
synchronization point at the same time, allowing even greater power sav-
ings. Again, the tight coupling between scheduling and dvs is fundamentally
different from what we have implemented in rope.
In [93], Ye and Xu propose a machine learning based dpm framework for
scheduling tasks on multi-core processors. The scheduler, which is based on
reinforcement learning, assigns tasks to cores such that an tradeoff between
idle times and performance is achieved. Lin et al. [55] presents a system
for energy efficient scheduling of vms that is similar to our energy efficient
topology agnostic drr. Their vm-scheduler relies on the same structure of
active-, retiring-, and inactive hardware. They also use similar mechanisms
for determining when to move hosts between the different states. In many
ways, our scheduler can be seen as a re-purposing of this work, although the
massive difference in timescales (their vms are load balanced between hosts
in the order of minutes and run for many hours) lead us to enforce and choose
our thresholds differently. Many of the ideas and concepts used in our energy
efficient scheduler are also discussed by Siddha et al. [83] in their proposed
power saving Linux scheduler.
A large body of work targets how to best run tasks in time sensitive envi-
ronments while reducing the power consumption. A dvs capable real-time
scheduler is presented by Pillai and Shin in [72]. Their scheduler provides the
energy savings of conventional dvs approaches, while preserving deadline
guarantees. In [38], Gruian provide hard deadline guarantees by employ-
ing both online and offline decision making. Yuan and Nahrstedt presents
GRACE-OS, an energy efficient soft real-time cpu scheduler in [95]. GRACE-
OS integrates dvs in a soft real-time scheduler that controls when, how fast,
and how long to execute tasks. Scheduling decisions are made according to
probability distributions of cycle demands. These distributions are obtained
via online profiling.
6.3 POWER MANAGEMENT IN DATA CENTERS AND CLOUD 99
6.3 Power Management in Data Centers andCloud
Much work has gone into designing solutions that intelligently place work-
loads at different physical hosts within data centers. In [64], Moore et al.
leverage information about hot and cold locations within dcs to create tem-
perature aware scheduling algorithms. Their proposed solution places incom-
ing workloads intelligently throughout a dc, and is able to reduce cooling
costs with a factor of two when compared to location-agnostic solutions. In
[13], Bash and Forman study the effect of placing tasks at cooling efficient
locations within a dc. They propose a method for ranking servers according
to cooling-efficiency, and experimentally validate that there is substantial po-
tential for energy savings if the cooling characteristics of the dc is taken into
consideration.
In [43], the authors explore power-efficient consolidation and distribution
of vms. They claim that maximum power-efficiency can be obtained if the
attributes of applications running within vms are taken into consideration,
and that maximum power-efficiency is achieved when vms are consolidated
such that resources available at a physical host machine are fully utilized.
They propose a load balancing algorithm that combine jobs that consume
different resources on the same physical hosts.
Sharifi et al. propose an energy-aware vm scheduling algorithm in [81]. This
algorithm is formulated using a set of objective functions describing a con-
solidation fitness metric. Their proposed solution minimizes the total energy
consumption of physical hosts in a whole dc while only incurring marginal
performance degradation.
In [32], Femal and Freeh employ a distributed algorithm to allocate power, as
opposed to work, to different physical hosts. Their goal is to increase aggregate
performance while distributing the available global power under a set of
operating constraints. One such constraint could be the globally available
power, as limited by the actual circuits. In this case, the globally available
power is allocated to hosts according to their contribution to processing the
aggregate workload. This way, maximum throughput can be achieved while
respecting upper bounds for power consumption.
Kotla et al. [51] use execution characteristics of the work currently running
to predict the achievable performance at available frequency settings. They
slow the nodes non-uniformly in response to their performance demands,
essentially running each cpu at the lowest possible frequency able to meet
performance requirements. Using this approach, they are able to maintain
100 CHAPTER 6 RELATED WORK
user perceived performance while responding to fluctuations in the amount of
available energy. Similarly, Sharma et al. [82] provide power-awareqos in dc
webservers. Their solution minimizes energy consumption while meeting per-
class delay constraints. Using a feedback loop, they regulate cpu frequency
and voltage to keep the resource utilization around a schedulability bound. By
enforcing this bound, they ensure that deadlines are met, effectively running
the tasks as slowly as possible while still preserving qos.
With VirtualPower [66], the authors present a system that integrates pm with
virtualization technologies commonly employed in dcs. By exposing “soft”
versions of the underlying hardware power states, each guest os is allowed to
implement its own pm policy. These decisions are then globally coordinated
by VirtualPower, and the updates made to soft vm power states are mapped to
actual power states or allocation of virtualized hardware. Using this scheme,
power consumption is minimized while the ability to meet application re-
quirements is retained.
In [34], Goiri et al. present Parasol: a prototype green energy data center, and
GreenSwitch: a model-based framework for dynamically scheduling work-
loads and selecting which energy source to use. Parasol features multiple
power sources, namely a set of solar panels, a battery bank, and a standard
grid connection. GreenSwitch predicts future workloads and renewable en-
ergy production, which in turn are used together with current battery levels to
generate energy- and workload schedules by a solver. These schedules deter-
mine how much energy should be used in the future. Using GreenSwitch and
a set of MapReduce workloads the authors demonstrate that intelligent man-
agement of energy sources and clever placement of workloads can results in
significant reductions in energy costs. Aksanali et al. also leverage predictions
about future available green energy in [10]. Their work presents an adaptive
dc job scheduler which based on the short term predictions of available wind-
and solar energy scale the number of scheduled jobs. In addition to reducing
the average task completion time for batch jobs, the consumed amount of
non-renewable energy is reduced.
In [58], Mathew et al. propose a technique to power down content delivery
network (cdn) servers during periods of low load. Their solution seeks to
maximize power savings while minimizing both client perceived performance
degradation, and wear and tear on hardware resulting from to excessive on-
off server transitions. Their experiments show great prospect: reducing energy
consumption by 55% while still maintaining service level agreements (slas)
and minimizing server wear and tear.
Chihi et al. [18] also scale the number of active vms in a cloud up and down
automatically. This scaling is based on a prediction module built on a neu-
6.4 REDUCTION OF CO2 EMISSIONS 101
ral network. By auto-scaling the available cloud resources according to the
actual workload, server utilization is increased and power consumption is
reduced.
In [59], Meisner et al. introduce the PowerNap server architecture. In stead
of focusing on traditional and complex solutions for fine-grained control of
power- and performance states, load balancing, and so on, PowerNap aims
to reduce energy consumption by the use of non-traditional hardware. By
responding to instantaneous load, the system rapidly transitions between a
normal high-performance state, and a near-zero power idle state. As soon as
a server exhausts its work queue, it transitions into the nap state, consuming
almost no power. When work arrives, processing is started in the nap state
and a transition to the high-performance state is initiated instantly. Using
simulations with real traces of normal server loads, the authors show results
superior to traditional dvs approaches with respect to both energy savings
and response times.
Using Elastic Tandem Machine Instances, Dürr couples low power system on
chip (soc) machines and high powered vm instances [31]. The low power
machine is always on, serving low load and ensuring availability. When load
rises, the high performance vm is activated, and the workload is transfered
to it using software defined networking techniques. The approach is proved
viable using a real prototype.
6.4 Reduction of CO2 Emissions
With the Stratus system [30], the authors use Amazon ec2 cloud instances to
model load balancing between differentdcs in order to reduce CO2 emissions
and electricity costs. They leverage the fact that the cost of electricity as well
as the greenhouse gas emissions will vary over time and with type of power
plant. The authors use graph algorithms to guide routing of requests towards
different dcs based on the priorities of the cloud operator.
In [62] Moghaddam et al. utilize intelligent live migration of vms in Virtual
Private Clouds (vpcs) to reduce carbon footprint of cloud operation. They
leverage that energy used by different dcs come from different sources, and
thus result in varying greenhouse gas emissions. Using a genetic algorithm for
consolidating vms across dcs, they minimize the carbon footprint by placing




In this chapter, we start by discussing some of our findings from chapter 5.
We then turn our attention to the omni-kernel architecture (oka), and de-
scribe how we believe pm and energy efficiency can be accommodated within
this architecture. Following our discussion, we will list our contributions and
achievements. We then conclude the thesis.
7.1 Discussion7.1.1 Findings
During our experiments and evaluations, we have encountered two situations
where our results differ from those presented in previous work. In this section,
we will discuss each of these in more detail.
Shallow Power States Are Easier to Use
Benini et al. argue in [14] that shallow power states with low associated entry
and exit costs are easier to use and more important than deeper power states
whose associated costs are higher. We do not argue the correctness of this
103
104 CHAPTER 7 DISCUSSION AND CONCLUDING REMARKS
conjecture in general, but rather note that our experiments do not suggest
this for our cpus. Specifically, our experiments indicate that entering C3—the
most expensive of our platform’s cpu C-states—results in both larger power
savings and better performance than using the shallower (and cheaper) C1
and C2 states (see Section 5.2.1 for numbers).
We have been unable to find any work documenting why this is, but observe
that Microsoft Windows aggressively enter the deepest available C-state in
their core parking functionality [79, Ch. 8]. We doubt that this choice has
been made arbitrarily.
One possible reason that entering deep C-states is often avoided, might be
that the acpi specification demands that the ospm keeps state and flushes
caches when states deeper than C2 are entered [22]. However, on modern
cpus, this is no longer necessary as the platform manages such state itself
[2]. Also, the effects of flushing caches when entering deep C-states might
not be very severe for typical web-based workloads [59].
Latencies Below 1ms Are OK
In [59], the authors claim that excess latencies of duration less than 1ms are
of little concern if they occur while exiting the idle function. Our experiments
contradict this. We find that when leaving the idle loop, even very short excess
latencies (for instance the 1–2𝜇𝑠 it takes to cancel a timer) have a significant
detrimental effect on both total completion time and latency experienced by
clients (see Figure 5.6 and Table 5.4 for numbers).
7.1.2 Power Management in the Omni-kernel Architecture
The oka provides unprecedented control over resource allocation and con-
sumption. Just as meticulous accounting is applied to cpu cycles, consumed
memory, throughput, etc., resource records can be be kept for any metric
relevant to pm. By developing models that allow mappings from metrics to
consumed energy, the oka implicitly provides total knowledge of:
1. Where energy has been consumed.
2. Who is responsible for the power consumption.
Schedulers can leverage this information to make informed decisions about
how a task should be executed, for instance by adjusting the frequency of
cpus according to the properties of the currently executing thread [91]. Sim-
7.2 CONTRIBUTIONS AND ACHIEVEMENTS 105
ilarly, information present in resource consumption records can be used to
determine energy efficient schedules by reducing the number of state transi-
tions, e.g. by scheduling tasks with similar frequency needs following each
other. By using metrics such as the number of llc-misses, tasks can be placed
by the scheduler so that minimal contention of shared resources occurs.
The oka is built around the notion of resources, and each of these is gov-
erned by a scheduler. As such, different resources can employ energy efficient
schedulers tailored specifically to accommodate the needs of the resource.
For example, some resources might be i/o bound, while others are compute
intensive. In this case, the scheduler governing the i/o bound resource could
reduce cpu frequencies, while a scheduler for a computationally expensive re-
source might increase the frequency—racing to halt. The necessary profiling
of the resources could be based on the resource records, and be performed
online.
7.2 Contributions and Achievements
In this thesis we have described the design and implementation of rope, a
system for ospm in the Vortex os kernel. We have added acpi support, and
also functionality for Intel-specific pm to the Vortex kernel.
rope contains a selection of pm policies for cpus that use cpu C- and P-
states. We have evaluated these policies, and determined which performs
the best. Our evaluations have uncovered discrepancies between claims in
previous work and our experimental observations. We also implemented a
naive energy efficient scheduler. The implemented functionality is modular,
and policies can be chosen according to the expected workload.
Through evaluation of our policies, and the scheduler in particular, we have
shown that the oka is dependent on multiple power aware schedulers if
maximum energy efficiency is to be achieved. This is due to different resources
being governed by different schedulers, and normal workloads rarely being
constrained to a single resource.
With rope, we have achieved power savings in the order of 17%–28% for
realistic web-based workloads. While we have achieved a 17% reduction with
no detectable performance degradation, the higher energy savings come at
the cost of increased latency. However, we argue that our numbers might
exaggerate the actual performance impact.
106 CHAPTER 7 DISCUSSION AND CONCLUDING REMARKS
7.3 Future Work
rope has now been deployed in the Vortex kernel, and we are already saving
energy in our server racks. However,we would like to continue development of
the functionality necessary to do energy efficient scheduling of more than just
threads. As was demonstrated by running our ab workloads, the approach of
only scheduling threads is futile when the workloads result in high resource
consumption inside the kernel.
Another natural step would be to implement a topology-aware scheduler that
is able to keep caches warm, and the number of active cpu packages to a
minimum. Such a scheduler could also be expanded in the direction of using
resource records to monitor energy consumption of tasks and tenants. This
could open up for new and interesting approaches to pm in Vortex.
Through the evaluation of pm policies, we have identified that a key aspect
of achieving energy efficiency in the oka is to provide power-aware sched-
ulers for all resources. We particularly wish to look into which resources
are cpu/memory bound, and whether batch processing of messages can be
used together with scheduler-directed dvs to reduce energy consumption
further.
7.4 Concluding Remarks
Wehave successfully deployedrope—a system forospm in the Vortex kernel.
rope has been designed and implemented according to our design principles.
We have evaluated all the pm policies available in rope and quantified their
impact on both power consumption and performance. As such, the user can
make informed choices about energy/performance tradeoffs.
Finally, as part of the Vortex kernel, rope will continue to be in active use
by faculty and students.
A
ACPI Objects andNamespace
This appendix details some of the central acpi objects used throughout the
Vortex pm functionality. Also, a brief introduction to the acpi namespace is
included.
A.1 The ACPI Namespace
acpi is presented by the platform to theos as a single hierarchical namespace
of objects (see figure A.1). These objects allow device detection and configu-
ration, and several classes exist. The contents of objects are varied, but most
refer to data variables, control methods, or functions provided by platform
bios. In the following section, various objects used in the implementation of
rope are explained.
107
108 APPENDIX A ACPI OBJECTS AND NAMESPACE
Figure A.1: Example of ACPI namespace. Figure borrowed from the acpi specification
v. 5 [22].
A.2 ACPI ObjectsA.2.1 The _OSC Object
The operating system capabilities(_OSC) object is an optional control method
that can be used by ospm to notify the platform of support for different
features. In the case of processors, such information could be whether ospm
wish to handle various features such as P-state coordination itself or rely on
the platform and hardware to handle this for it.
The _OSC object is used as follows:
1. The _OSC object for a device is located in the acpi-namespace.
2. The arguments are set, including a uuid, and capabilities buffer for-
mated specifically to match the device.
3. The method is evaluated, and the status can be verified by the returned
values. If the bits in the capabilities buffer were set in such a way that
ospm code indicated the capability of handling any features, the plat-
form may respond by generating new acpi objects for such features in
the namespace.
A.2 ACPI OBJECTS 109
It should be noted that the _OSC object is used for the same purpose as the
_PDC object which it replaced in acpi version 3.0. A detailed description of
the _OSC method and its usage can be found in [22, p. 282–291]. For details
regarding the format of the capabilities buffer, device specific documentation
must be consulted.
A.2.2 The _PSS Object
The performance supported states(_PSS) object is an optional object indicat-
ing the number of supported processor performance states. When evaluated,
this object returns a list containing information about available performance
states including the internal cpu core frequency, power dissipation, control
register values needed to transition into performance states, and status regis-
ter values enabling the ospm code to verify that transitions were successful.
An example of an acpi-package returned by the _PSS object is illustrated in
figure A.2.
A.2.3 The _PPC Object
The performance present capabilities(_PPC) object is a dynamic method used
by the ospm code to obtain information about which performance states
that can be used at any given point in time. When evaluated, the object
returns a single integer describing the highest(lowest numbered) performance
state that is currently available. For instance, if the value 1 is returned 𝑃0 is
unavailable, while 𝑃1 through 𝑃𝑛 are available.
To perform a transition, the _PPC object must first be used to obtain informa-
tion regarding which P-states can be entered. The ospm software can then
choose any available performance state, and obtain the information necessary
to transition into it from the _PSS and _PCT objects. That is, which control
and status values to be read and written to which registers. This process is
illustrated in figure A.3.
A.2.4 The _PCT Object
The performance control(_PCT) object is an optional object that can be used by
ospm to transition a processor into a given P-state. Evaluation of this object
returns an acpi-package containing a control- and status register.
Theospm software can enter a desired a performance state by writing a value
specific to said state into a performance control register. When doing this,






























Figure A.2: The figure shows an example of a acpi amlpackage returned when
evaluating a _PSS object. A detailed description of the different fields
can be found in the acpi specification version 5.0 [22, p. 409].
A.2 ACPI OBJECTS 111
Evaluate _PPC-
object
Obtain highest available performance state (Ex. 1)
Consult _PSS 
entries
Chose P1, have to use 




 obtain CR1 and SR1












2666                      103000   100            100                   144           144
2332                        90618    100            100                   145           145
1999                        79209    100            100                   146           146
CoreFrequency:    Power:    Latency:    BM Latency:    Control:    Status:
CR0                           SR0
CR1                           SR1
CR2                           SR2
Control Register:    Status Register:
Figure A.3: The figure illustrates the process of entering a performance state.
ospm must evaluate the _PPC object described above to obtain the currently
available performance states. The control value to write can be obtained by
evaluating the _PSS object. To validate that the transition was performed
successfully, the returned status register must be read. If the transition was
indeed successful, the value read will match the one present in the status
field of the _PSS entry corresponding to the performance state.
A.2.5 The _PSD Object
The P-state dependency-object(_PSD) is used byospm code to obtain informa-
tion regarding P-state cross logical processor dependencies. This information
is crucial since different processors and platforms exhibit different depen-
dency characteristics. For instance, different cores on a multi-core chip may
or may not be dependent on each other.
112 APPENDIX A ACPI OBJECTS AND NAMESPACE
A.2.6 The _CST Object
The optional _CST object provides on of two alternative methods for ospm
software to transition a processor into different processor C-states. This al-
ternative method allows the platform designers to support more than three
C-states. When evaluating the _CST object, an acpi aml package containing
an integer denoting the number of supported C-states and the same number
of subpackages—each describing one of these C-states— is returned. This is
illustrated in figure A.4.
The C1 state is mandatory, while all other C-states are optional. Each of the
subpackages detailing a C-state contains relevant information such as the
type(C1, C2, ..., Cn) determining the entry semantics, the worst-case latency
when entering/exiting the state, and the average power consumption of the
processor when placed in the power-state. It also contains a register descriptor
field,which is used to determine the entrymethod for the C-state. For instance,
which registers to use, whether these are functional fixed hardware (ffh)
or system i/o registers, and exactly how to use these. A detailed description




















Figure A.4: The displays an example returned package from evaluation of the _CST
object. Note that the C1 power state is mandatory,while all other C-states
are optional.
Bibliography
[1] Greenhouse Gas Equivalencies Calculator. http://carbonfootprint360.
com/p/Greenhouse-Gas-Equivalencies-Calculator.html. Accessed 2014-04-
01.
[2] Linux Cross Refrence, Intel ACPI C-state driver. http://lxr.free-electrons.
com/source/arch/x86/kernel/acpi/cstate.c. Accessed 2014-05-14.
[3] Overview of the TPC Benchmark C: The Order-Entry Benchmark. http:
//www.tpc.org/tpcc/detail.asp. Accessed 2014-05-01.
[4] H.R.5646 - To study and promote the use of energy efficient com-
puter servers in the United States. http://beta.congress.gov/bill/109th-
congress/house-bill/5646, July 2006. Accessed 2014-03-23.
[5] EU to Study Energy Use by Data Centers. http://www.pcworld.com/
article/129322/article.html, February 2007. Accessed 2014-03-23.
[6] Monitor/mwait. http://blog.andy.glew.ca/2010/11/httpsemipublic.html,
November 2010. Accessed 2014-05-01.
[7] Add P state driver for Intel Core Processors. https://lwn.net/Articles/
536017/, February 2013. Accessed 2014-03-19.
[8] Facebooks Carbon & Energy Impact. https://www.fb-carbon.com/pdf/
FB_carbon_enegergy_impact_2012.pdf, June 2013. Accessed 2014-04-
01.
[9] Nevine AbouGhazaleh, Alexandre Ferreira, Cosmin Rusu, Ruibin Xu,
Frank Liberato, Bruce Childers, Daniel Mosse, and Rami Melhem. Inte-
grated CPU and L2 Cache Voltage Scaling Using Machine Learning. In
Proceedings of the 2007 ACM SIGPLAN/SIGBED Conference on Languages,
Compilers, and Tools for Embedded Systems, LCTES ’07, pages 41–50, New
York, NY, USA, 2007. ACM.
113
114 BIBLIOGRAPHY
[10] Baris Aksanli, Jagannathan Venkatesh, Liuyi Zhang, and Tajana Rosing.
Utilizing Green Energy Prediction to Schedule Mixed Batch and Service
Jobs in Data Centers. In Proceedings of the 4th Workshop on Power-Aware
Computing and Systems, HotPower ’11, pages 5:1–5:5, New York, NY, USA,
2011. ACM.
[11] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex
Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the Art
of Virtualization. In Proceedings of the Nineteenth ACM Symposium on
Operating Systems Principles, SOSP ’03, pages 164–177, New York, NY,
USA, 2003. ACM.
[12] Luiz André Barroso and Urs Hölzle. The Case for Energy-Proportional
Computing. Computer, 40(12):33–37, December 2007.
[13] Cullen Bash and George Forman. Cool Job Allocation: Measuring the
Power Savings of Placing Jobs at Cooling-efficient Locations in the Data
Center. In 2007 USENIX Annual Technical Conference on Proceedings of the
USENIX Annual Technical Conference, ATC’07, pages 29:1–29:6, Berkeley,
CA, USA, 2007. USENIX Association.
[14] Luca Benini, Alessandro Bogliolo, and Giovanni De Micheli. A survey of
design techniques for system-level dynamic power management. IEEE
TRANSACTIONS ON VLSI SYSTEMS, 8(3):299–316, 2000.
[15] Pat Bohrer, Elmootazbellah N. Elnozahy, Tom Keller, Michael Kistler,
Charles Lefurgy, Chandler McDowell, and Ram Rajamony. The Case
for Power Management in Web Servers, 2002.
[16] Qiong Cai, José González, Ryan Rakvic, Grigorios Magklis, Pedro Cha-
parro, and Antonio González. Meeting Points: Using Thread Criticality
to Adapt Multicore Hardware to Parallel Regions. In Proceedings of the
17th International Conference on Parallel Architectures and Compilation
Techniques, PACT ’08, pages 240–249, New York, NY, USA, 2008. ACM.
[17] Gavin C. Cawley. On a Fast, Compact Approximation of the Exponential
Function. Neural Comput., 12(9):2009–2012, September 2000.
[18] Hanen Chihi, Walid Chainbi, and Khaled Ghedira. An energy-efficient
self-provisioning approach for cloud resources management. SIGOPS
Oper. Syst. Rev., 47(3):2–9, November 2013.
[19] Eui-Young Chung, Luca Benini, Alessandro Bogliolo, Ro Bogliolo, Yung-
Hsiang Lu, and Giovanni De Micheli. Dynamic Power Management for
BIBLIOGRAPHY 115
Nonstationary Service Requests. In In Design Automation and Test in
Europe, pages 77–81, 2002.
[20] AMD Corporation. Cool ’n’ Quiet™Technology Installation Guide for
AMD Athlon™64 Processor Based Systems, Revision 0.04, Jun. 2004.
[21] AMD Corporation. Amd PowerNow!™Technology - Dynamically Man-
ages Power And Performance, Informational White Paper, Revision A,
Nov. 2000.
[22] Compaq Computer Corporation andRevision B. AdvancedConfiguration
and Power Interface Specification, Revision 5, 2011.
[23] Intel Corporation. Intel Processor Vendor-Specific ACPI - Interface Spec-
ification, Revision 005, 2006.
[24] Intel Corporation. ACPI Component Architecture User Guide and Pro-
grammer Reference, Revision 5.17, 2013.
[25] Intel Corporation. Intel™64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 2A Instruction Set Reference, A-M, 2014.
[26] Intel Corporation. Intel™64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 3B System Programming Guide, 2014.
[27] Intel Corporation. Enhanced Intel®SpeedStep®Technology for the In-
tel®Pentium®M Processor, Mar. 2004.
[28] Gaurav Dhiman and Tajana S. Rosing. Dynamic power management
using machine learning. In Proceedings of the 2006 IEEE/ACM interna-
tional conference on Computer-aided design, ICCAD ’06, pages 747–754,
New York, NY, USA, 2006. ACM.
[29] Gaurav Dhiman and Tajana Simunic Rosing. Dynamic Voltage Frequency
Scaling for Multi-tasking Systems Using Online Learning. In Proceedings
of the 2007 international symposium on Low power electronics and design,
pages 207–212. ACM, 2007.
[30] Joseph Doyle, Robert Shorten, and Donal O’Mahony. Stratus: Load
Balancing the Cloud for Carbon Emissions Control. IEEE Transactions
on Cloud Computing, 1(1):1, 2013.
[31] Frank Durr. Improving the efficiency of cloud infrastructures with elas-
tic tandem machines. In Cloud Computing (CLOUD), 2013 IEEE Sixth
116 BIBLIOGRAPHY
International Conference on, pages 91–98. IEEE, 2013.
[32] Mark E. Femal and Vincent W. Freeh. Boosting Data Center Performance
Through Non-Uniform Power Allocation. In ICAC, pages 250–261. IEEE
Computer Society, 2005.
[33] Agner Fog. The microarchitecture of Intel, AMD and VIA CPUs. An op-
timization guide for assembly programmers and compiler makers. Copen-
hagen University College of Engineering, 2011.
[34] Íñigo Goiri, William Katsak, Kien Le, Thu D. Nguyen, and Ricardo Bian-
chini. Parasol and GreenSwitch: Managing Datacenters Powered by
Renewable Energy. In Proceedings of the Eighteenth International Confer-
ence on Architectural Support for Programming Languages and Operating
Systems, ASPLOS ’13, pages 51–64, New York, NY, USA, 2013. ACM.
[35] Richard Golding, Peter Bosch, and Carl Staelin. Idleness is Not Sloth,
1995.
[36] Erlend Graff. Initial Design and Implementation of a Windows VM OS
for Vortex. Bachelor thesis, Department of Computer Science, University
of Tromsø, 2014.
[37] Robert B. Gramacy,Manfred K. Warmuth, Scott A. Brandt, and Ismail Ari.
Adaptive Caching by Refetching. In In Advances in Neural Information
Processing Systems 15, pages 1465–1472. MIT Press, 2002.
[38] Flavius Gruian. Hard Real-time Scheduling for Low-energy Using
Stochastic Data and DVS Processors. In Proceedings of the 2001 Inter-
national Symposium on Low Power Electronics and Design, ISLPED ’01,
pages 46–51, New York, NY, USA, 2001. ACM.
[39] David P. Helmbold, Darrell D. E. Long, Tracey L. Sconyers, and Bruce
Sherrod. Adaptive disk spin-down for mobile computers. Mobile Net-
works and Applications, page 297, 2000.
[40] Sebastian Herbert and Diana Marculescu. Variation-Aware Dynamic
Voltage/Frequency Scaling. In High Performance Computer Architecture,
2009. HPCA 2009. IEEE 15th International Symposium on, pages 301–312.
IEEE, 2009.
[41] Nikolas Roman Herbst, Nikolaus Huber, Samuel Kounev, and Erich Am-
rehn. Self-adaptive Workload Classification and Forecasting for Proac-
tive Resource Provisioning. In Proceedings of the 4th ACM/SPEC Interna-
BIBLIOGRAPHY 117
tional Conference on Performance Engineering, ICPE ’13, pages 187–198,
New York, NY, USA, 2013. ACM.
[42] Mark Herbster and Manfred Warmuth. Tracking the Best Expert. In
Machine Learning, pages 286–294. Morgan Kaufmann, 1995.
[43] Courtney Humphries and Paul Ruth. Towards Power Efficient Consoli-
dation and Distribution of Virtual Machines. In Proceedings of the 48th
Annual Southeast Regional Conference, ACM SE ’10, pages 75:1–75:6, New
York, NY, USA, 2010. ACM.
[44] Rob J Hyndman and Anne B Koehler. Another look at measures of
forecast accuracy. International Journal of Forecasting, pages 679–688,
2006.
[45] Canturk Isci, Gilberto Contreras, andMargaret Martonosi. Live, Runtime
Phase Monitoring and Prediction on Real Systems with Application to Dy-
namic Power Management. In Proceedings of the 39th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO 39, pages 359–
370, Washington, DC, USA, 2006. IEEE Computer Society.
[46] M. Ciraula E. Fang S. Johnson Bujanos R. Kumar D. Wu M. Braganza
J. Dorsey, S. Searles and S. Meyers. An Integrated Quad-Core Opteron
Processor, 2007.
[47] Hwisung Jung and Massoud Pedram. Supervised Learning Based Power
Management for Multicore Processors. Trans. Comp.-Aided Des. Integ.
Cir. Sys., 29(9):1395–1408, September 2010.
[48] Jonathan G. Koomey. Growth in Data Center Electricity Use 2005 to
2010, 2011.
[49] Jonathan G. Koomey, Christian Belady, Michael Patterson, and Anthony
Santos. Assessing trends over time in performance, costs, and energy
use for servers, 2009.
[50] Ramakrishna Kotla, Anirudh Devgan, Soraya Ghiasi, Tom Keller, and
Freeman Rawson. Characterizing the Impact of Different Memory-
Intensity Levels. In In IEEE 7th Annual Workshop on Workload Char-
acterization (WWC-7, 2004.
[51] Ramakrishna Kotla, Soraya Ghiasi, Tom Keller, and Freeman Rawson.
Scheduling Processor Voltage and Frequency in Server and Cluster Sys-
tems. In Proceedings of the 19th IEEE International Parallel andDistributed
118 BIBLIOGRAPHY
Processing Symposium (IPDPS’05) - Workshop 11 - Volume 12, IPDPS ’05,
pages 234.2–, Washington, DC, USA, 2005. IEEE Computer Society.
[52] Åge Kvalnes. The Omni-Kernel Architecture: Scheduler Control Over All
Resource Consumption in Multi-Core Computing Systems. PhD thesis.
[53] Åge Kvalnes, Dag Johansen, Robbert van Renesse, Fred B. Schneider, and
Steffen Viken Valvåg. Omni-Kernel: An Operating System Architecture
for Pervasive Monitoring and Scheduling. Technical Report IFI-UiT 2013-
75, Department of Computer Science, University of Tromsø, 2013.
[54] Branislav Kveton and Shie Mannor. Adaptive timeout policies for fast
finegrained power management. In In Proceedings of the 19th Conference
on Innovative Applications of Artificial Intelligence, 2007.
[55] Ching-Chi Lin, Pangfeng Liu, and Jan-Jan Wu. Energy-efficient Virtual
Machine Provision Algorithms for Cloud Systems. In Proceedings of the
2011 Fourth IEEE International Conference on Utility and Cloud Comput-
ing, UCC ’11, pages 81–88, Washington, DC, USA, 2011. IEEE Computer
Society.
[56] Nick Littlestone. Learning quickly when irrelevant attributes abound:
A new linear-threshold algorithm. In Machine Learning, pages 285–318,
1988.
[57] Nick Littlestone and Manfred K. Warmuth. The Weighted Majority Al-
gorithm. Inf. Comput., 108(2):212–261, February 1994.
[58] Vimal Mathew, Ramesh K Sitaraman, and Prashant Shenoy. Energy-
aware Load Balancing in Content Delivery Networks. 2011.
[59] David Meisner, Brian T. Gold, and Thomas F. Wenisch. The PowerNap
Server Architecture. ACM Trans. Comput. Syst., 29(1):3:1–3:24, February
2011.
[60] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York,
NY, USA, 1 edition, 1997.
[61] Michael Moeng and Rami Melhem. Applying Statistical Machine Learn-
ing to Multicore Voltage &#38; Frequency Scaling. In Proceedings of the
7th ACM International Conference on Computing Frontiers, CF ’10, pages
277–286, New York, NY, USA, 2010. ACM.
[62] Fereydoun Farrahi Moghaddam, Mohamed Cheriet, and Kim Khoa
BIBLIOGRAPHY 119
Nguyen. Low carbon virtual private clouds. In Cloud Computing
(CLOUD), 2011 IEEE International Conference on, pages 259–266. IEEE,
2011.
[63] Jeffrey C. Mogul and K. K. Ramakrishnan. Eliminating Receive Livelock
in an Interrupt-driven Kernel. ACM Trans. Comput. Syst., 15(3):217–252,
August 1997.
[64] Justin Moore, Jeff Chase, Parthasarathy Ranganathan, and Ratnesh
Sharma. Making Scheduling "Cool": Temperature-aware Workload
Placement in Data Centers. In Proceedings of the Annual Conference
on USENIX Annual Technical Conference, ATEC ’05, pages 5–5, Berkeley,
CA, USA, 2005. USENIX Association.
[65] Ripal Nathuji and Karsten Schwan. Reducing system level power con-
sumption for mobile and embedded platforms. In In Proceedings of the
International Conference on Architecture of Computing Systems (ARCS,
2005.
[66] Ripal Nathuji and Karsten Schwan. Virtualpower: Coordinated Power
Management in Virtualized Enterprise Systems. In Proceedings of
Twenty-first ACM SIGOPS Symposium on Operating Systems Principles,
SOSP ’07, pages 265–278, New York, NY, USA, 2007. ACM.
[67] Audun Nordal, Åge Kvalnes, and Dag Johansen. Balava: Federating
Private and Public Clouds. In 2011 IEEE World Congress on Services,
pages 569–577, 2011.
[68] Audun Nordal, Åge Kvalnes, and Dag Johansen. Paravirtualizing TCP. In
6th international workshop on Virtualization Technologies in Distributed
Computing, pages 3–10, 2012.
[69] Audun Nordal, Åge Kvalnes, Robert Pettersen, and Dag Johansen.
Streaming as a Hypervisor Service. In 7th international workshop on
Virtualization Technologies in Distributed Computing, 2013.
[70] G. A. Paleologo, L. Benini, A. Bogliolo, and G. De Micheli. Policy Opti-
mization for Dynamic Power Management. In Proceedings of the 35th
Annual Design Automation Conference, DAC ’98, pages 182–187, New York,
NY, USA, 1998. ACM.
[71] C. D. Patel, C. E. Bash, R. Sharma, and M. Beitelmal. Smart cooling of
data centers. In Proceedings of IPACK, 2003.
120 BIBLIOGRAPHY
[72] Padmanabhan Pillai and Kang G. Shin. Real-time Dynamic Voltage
Scaling for Low-power Embedded Operating Systems. SIGOPS Oper.
Syst. Rev., 35(5):89–102, October 2001.
[73] Cai Qiong, José González, Grigorios Magklis, Pedro Chaparro, and Anto-
nio González. Thread Shuffling: Combining DVFS and ThreadMigration
to Reduce Energy Consumptions for Multi-core Systems. In Low Power
Electronics and Design (ISLPED) 2011 International Symposium on, pages
379–384. IEEE, 2011.
[74] Karthick Rajamani, Heather Hanson, Juan Rubio, Soraya Ghiasi, and
Freeman Rawson. Application-Aware Power Management. In Workload
Characterization, 2006 IEEE International Symposium on, pages 39–48.
IEEE, 2006.
[75] Dinesh Ramanathan and Rajesh Gupta. System Level Online Power
Management Algorithms, 2000.
[76] Parthasarathy Ranganathan, Phil Leech, David Irwin, and Jeffrey Chase.
Ensemble-level Power Management for Dense Blade Servers. In Proceed-
ings of the 33rd Annual International Symposium on Computer Architec-
ture, ISCA ’06, pages 66–77,Washington, DC, USA, 2006. IEEE Computer
Society.
[77] Zhiyuan Ren, Bruce H. Krogh, and Radu Marculescu. Hierarchical Adap-
tive Dynamic Power Management. In Proceedings of the Conference on
Design, Automation and Test in Europe - Volume 1, DATE ’04, pages 10136–,
Washington, DC, USA, 2004. IEEE Computer Society.
[78] Mark Russinovich, David A. Solomon, and Alex Ionescu. Windows Inter-
nals, Part 1. Microsoft Press, Redmond, WA, USA, 6th edition, 2012.
[79] Mark Russinovich, David A. Solomon, and Alex Ionescu. Windows Inter-
nals, Part 2. Microsoft Press, Redmond, WA, USA, 6th edition, 2012.
[80] Nicol N. Schraudolph. A Fast, Compact Approximation of the Exponen-
tial Function. Neural Computation, 11:11–4, 1998.
[81] Mohsen Sharifi, Hadi Salimi, and Mahsa Najafzadeh. Power-efficient
distributed scheduling of virtual machines using workload-aware con-
solidation techniques. The Journal of Supercomputing, 61(1):46–66, 2012.
[82] Vivek Sharma, Arun Thomas, Tarek Abdelzaher, Kevin Skadron, and Zhi-
jian Lu. Power-aware QoS Management in Web Servers. In Proceedings
BIBLIOGRAPHY 121
of the 24th IEEE International Real-Time Systems Symposium, RTSS ’03,
pages 63–, Washington, DC, USA, 2003. IEEE Computer Society.
[83] Suresh Siddha, Venkatesh Pallipadi, and Asit Mallick. Chip Multi Pro-
cessing aware Linux Kernel Scheduler. In Linux Symposium, page 329,
2006.
[84] Tajana Simunic, Luca Benini, Peter Glynn, and Giovanni De Micheli.
Event-Driven Power Management. IEEE TRANS. COMPUTER-AIDED DE-
SIGN, 20:840–857, 2001.
[85] Mani B. Srivastava, Anantha P. Chandrakasan, and R. W. Brodersen. Pre-
dictive system shutdown and other architectural techniques for energy
efficient programmable computation. IEEE Trans. Very Large Scale Integr.
Syst., 4:42–55, March 1996.
[86] Carl W. Steinbach. A Reinforcement-Learning Approach to Power Man-
agement, 2002.
[87] The Climate Group. SMART 2020: Enabling the low carbon economy in
the information age. Technical report, 2008.
[88] Dimitris Tsirogiannis, Stavros Harizopoulos, and Mehul A. Shah. An-
alyzing the Energy Efficiency of a Database Server. In Proceedings of
the 2010 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’10, pages 231–242, New York, NY, USA, 2010. ACM.
[89] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the
speed of neural networks on cpus. In Proc. Deep Learning and Unsuper-
vised Feature Learning NIPS Workshop, 2011.
[90] David Wang. Meeting green computing challenges. In Electronics Pack-
aging Technology Conference, 2008. EPTC 2008. 10th, pages 121–126. IEEE,
2008.
[91] AndreasWeissel and Frank Bellosa. Process Cruise Control: Event-driven
Clock Scaling for Dynamic Power Management. In Proceedings of the
2002 International Conference on Compilers, Architecture, and Synthesis
for Embedded Systems, CASES ’02, pages 238–246, New York, NY, USA,
2002. ACM.
[92] Qiang Wu, Margaret Martonosi, Douglas W. Clark, V. J. Reddi, Dan Con-
nors, Youfeng Wu, Jin Lee, and David Brooks. A Dynamic Compilation
Framework for Controlling Microprocessor Energy and Performance. In
122 BIBLIOGRAPHY
Proceedings of the 38th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO 38, pages 271–282,Washington, DC, USA, 2005.
IEEE Computer Society.
[93] Rong Ye and Qiang Xu. Learning-Based Power Management for Multi-
Core Processors via Idle Period Manipulation.
[94] Tse-Yu Yeh and Yale N. Patt. Alternative Implementations of Two-level
Adaptive Branch Prediction. In Proceedings of the 19th Annual Inter-
national Symposium on Computer Architecture, ISCA ’92, pages 124–134,
New York, NY, USA, 1992. ACM.
[95] Wanghong Yuan and Klara Nahrstedt. Energy-efficient soft real-time
cpu scheduling for mobile multimedia systems. In Proceedings of the
Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03,
pages 149–163, New York, NY, USA, 2003. ACM.


