A Workload Adaptive Voltage Scaling Multiple Clock Domain Architecture by Biermann, David
A WORKLOAD ADAPTIVE VOLTAGE SCALING
MULTIPLE CLOCK DOMAIN ARCHITECTURE
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulﬁllment of the Requirements for the Degree of
Doctor of Philosophy
by
David Alan Biermann
January 2007c  2007 David Alan Biermann
ALL RIGHTS RESERVEDA WORKLOAD ADAPTIVE VOLTAGE SCALING
MULTIPLE CLOCK DOMAIN ARCHITECTURE
David Alan Biermann, Ph.D.
Cornell University 2007
This thesis presents a comprehensive system for allowing a Multiple Clock
Domain (MCD) processor to adapt to its workload in an eﬃcient manner. We
present adaptive techniques at both the architecture and software levels. These
techniques allow our system to either meet speciﬁed throughput demands while
consuming as little energy as possible, or to stay within an average power budget
while providing the highest possible throughput.
We ﬁrst present an architecture-level adaptive system. This system adapts
the voltage/frequency conﬁguration of the MCD processor to meet the workload
of a single application eﬃciently. As its input, this system can take either a
throughput goal to meet using the least possible energy, or an average power
level to remain below. We also show that our adaptive system can give an MCD
processor increased tolerance to changes in performance and power dissipation due
to variations.
Next, we extend this adaptivity to multiprogrammed workloads. We present
a scheduling algorithm that considers the throughput goals of each running ap-
plication. Using this feedback, it schedules the applications in such a way as to
minimize total energy consumption without altering the throughput of the indi-
vidual applications.Finally, we present a system that allows individual applications to determine
their throughput by comparing their actual progress to their desired progress rate.
This system acts as a bridge between our architectural and inter-program adaptive
systems. Each application’s desired throughput is used in two ways. First, this
throughput becomes the target throughput for our architecture-level system. Sec-
ond, this throughput information provides the feedback that allows our scheduler
to determine how it should schedule the application workload to minimize energy.BIOGRAPHICAL SKETCH
David Biermann was born on February 27, 1978 at Durham General Hospital
in Durham, NC. His parents are Dr. Alan Biermann and Dr. Alice Gordon. Alan is
a retired Duke University professor and former Chair of Duke’s Computer Science
Department. Alice is a four-term county commissioner and former member of the
Psychology Department at the University of North Carolina. David grew up in
Chapel Hill with his parents and older sister Jennifer, who is a Duke alum and
practicing physician assistant.
After attending high school at Durham Academy, David enrolled at Duke Uni-
versity in the fall of 1996 in the School of Engineering. Despite spending most
of his time fanatically supporting the basketball team, he managed to graduate
Magna Cum Laude and with honors, receiving a B.S.E. in Electrical Engineering
and Computer Science in May 2000. In the fall of 2000, David came to Cornell and
joined the Computer Systems Lab in the Department of Electrical and Computer
Engineering. Under the supervision of Rajit Manohar he received the M.S. degree
in 2002 for work on an asynchronous data cache controller. In September 2006 he
received his Ph.D., working under Prof. Manohar and Prof. David Albonesi.
iiiFor my parents.
ivACKNOWLEDGEMENTS
Thanks ﬁrst and foremost to Rajit Manohar, who has advised and mentored me
for my six years at Cornell. Special thanks to David Albonesi who co-advised me
for much of the work in this thesis. Also thanks to my committee members, Jos´ e
Mart´ ınez and Martin Burtscher. Great thanks to my parents, who are my oldest
and most ardent supporters. Clint and Avneesh, the ﬁrst people who befriended
me at Cornell, thanks for all the good times. I should also thank Mark Heinrich,
who basically convinced me to come here. Thanks, of course, to the rest of the
AVLSI group, past and present: Chris, Fang, Filipp, Sandra, Song, Teifel, and
Virantha. And last but certainly not least; Kyna, my partner in life, thank you
for waiting for me to ﬁnish this thing.
vTABLE OF CONTENTS
1 Introduction 1
1.1 Architectural Adaptivity . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 OS Level Adaptivity . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 8
2.1 Multiple Clock Domain Processors . . . . . . . . . . . . . . . . . . 8
2.2 Architectural Techniques for Variations . . . . . . . . . . . . . . . . 10
2.3 OS Dynamic Voltage and Frequency Scaling . . . . . . . . . . . . . 11
3 Modeling MCD Processors 15
3.1 Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Throughput Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Critical Path Model . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Critical Path Predictor . . . . . . . . . . . . . . . . . . . . . 22
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Adaptive MCD Architecture 27
4.1 Best Step Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Simulation Framework . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Finding the Optimal DVFS Interval . . . . . . . . . . . . . . . . . . 33
4.5 Adapting to Meet a Throughput Target . . . . . . . . . . . . . . . . 34
4.6 Adapting to Meet an Energy Target . . . . . . . . . . . . . . . . . . 36
4.7 Adapting to Variations . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.8 State Space Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8.1 Local Minima Search . . . . . . . . . . . . . . . . . . . . . . 40
4.8.2 Comparing BestStep to the optimal . . . . . . . . . . . . . 43
5 Modeling Multiprogrammed Workloads 45
5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Energy Consumed by a Single Process . . . . . . . . . . . . . . . . 46
5.3 Minimizing Inter-Application Energy . . . . . . . . . . . . . . . . . 49
5.3.1 Intuitive Meaning of the Scheduling Algorithm . . . . . . . . 51
6 Inter-Program Adaptivity 53
6.1 Rate-Matching Throughput Control . . . . . . . . . . . . . . . . . . 54
6.1.1 API Description . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.2 Implementation Considerations . . . . . . . . . . . . . . . . 58
6.2 Operating System Scheduler . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Bochs Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vi6.3.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.2 Single Application Performance . . . . . . . . . . . . . . . . 64
6.3.3 Multiprogrammed Workloads . . . . . . . . . . . . . . . . . 66
6.4 SESC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7 Conclusion 80
A SESC Simulator 83
A.1 Multiprogramming . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.2 GALS Processor Support . . . . . . . . . . . . . . . . . . . . . . . . 83
A.3 Critical Path Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.4 Dynamic Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . 85
A.5 Leakage Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bibliography 86
viiLIST OF TABLES
3.1 Summary of Dependence Edges . . . . . . . . . . . . . . . . . . . . 19
3.2 Average Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . 21
3.3 Critical Node Predictor Accuracy versus Trace Length . . . . . . . 24
4.1 Architectural Parameters . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Average Et2 Improvement with one leaky Domains versus Monolithic. 39
4.3 Summary of Local Minima . . . . . . . . . . . . . . . . . . . . . . 43
6.1 New application state introduced by the throughput adaptive system. 59
6.2 New commands utilized by the throughput-adaptive system. . . . . 60
6.3 List of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 Audio ﬁles used as input for benchmarks. . . . . . . . . . . . . . . 63
6.5 Energy Savings versus Baseline Scheduler . . . . . . . . . . . . . . 75
6.6 3 Application Workloads . . . . . . . . . . . . . . . . . . . . . . . . 79
viiiLIST OF FIGURES
1.1 MCD Domain Conﬁguration . . . . . . . . . . . . . . . . . . . . . 4
3.1 Critical Path Model, from Fields et al. [15] . . . . . . . . . . . . . 18
3.2 Predicted versus Actual Throughput for a Sample Application . . . 22
3.3 Predicted Count / Actual Count versus Trace Length (for vpr) . . 23
3.4 Critical Path Predictor . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Et2 Improvement for varying interval lengths. . . . . . . . . . . . . 32
4.2 Et2 Improvement versus Monolithic with high throughput. . . . . 33
4.3 Et2 Improvement versus Monolithic with low throughput. . . . . . 34
4.4 Throughput Improvement versus Monolithic, meeting an energy
target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Et2 Improvement with one leaky domain versus Monolithic. . . . . 38
4.6 mesa Energy Space (left) and Throughput Space (right) . . . . . . 41
4.7 swim Energy Space (left) and Throughput Space (right) . . . . . . 42
4.8 gcc Energy Space (left) and Throughput Space (right) . . . . . . . 42
4.9 Energy Increase of BestStep alorithm compared to optimal . . . . 44
6.1 Results of calibration against measured data. . . . . . . . . . . . . 65
6.2 Performance of DVS Algorithms on a Single Application . . . . . . 66
6.3 RMTA with two applications running simultaneously. . . . . . . . . 67
6.4 RMTA with three applications running simultaneously. . . . . . . . 68
6.5 mpg123 (left) and go (right) . . . . . . . . . . . . . . . . . . . . . . 68
6.6 Go/mpg123 (left) and 2 GSM encoders (right) . . . . . . . . . . . 69
6.7 toast, mpg123 and Go . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.8 Two applications running with K1
K2 = 1 . . . . . . . . . . . . . . . . 72
6.9 Two applications running with K1
K2 = 2 . . . . . . . . . . . . . . . . 73
6.10 Two applications running with K1
K2 = 3 . . . . . . . . . . . . . . . . 74
6.11 Results for K1
K2 = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.12 Results for K1
K2 = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.13 Results for K1
K2 = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.14 Energy Savings for 3 Application Workloads . . . . . . . . . . . . . 78
ixChapter 1
Introduction
This thesis presents a Multiple Clock Domain (MCD) architecture that is able
to adapt to meet its workload demands in an eﬃcient manner. By eﬃcient, we
mean that the processor can either meet its throughput demand in a way that
minimizes energy consumption, or stay within an energy budget while providing
the greatest possible throughput. We ﬁrst present an architecture-level adaptive
system that allows the processor to adapt its voltage/frequency conﬁguration to
meet the demands of a single application in the most eﬃcient way possible. We
then extend this work to multiprogrammed workloads. We present an adaptive
scheduling algorithm that considers the throughput demands of the running appli-
cations and schedules them in such a way as to minimize the energy consumption
without altering the throughput. In both cases, we ﬁrst present the theoretical
background that leads us to our solution, then present the completed system and
its performance.
This kind of adaptivity has become increasingly important in modern micropro-
cessor architectures. Energy consumption has long been a concern for embedded
applications where battery life is essential. As device sizes shrink and power dis-
sipation per unit area increases dramatically, energy has now become a ﬁrst class
concern for all computer architects [49]. In addition, modern microprocessors are
expected to perform a wide variety of applications, from data driven applications
like DVD or MPEG players to control-intesive applications like compilers. The
optimal architecture for one category of application will diﬀer from that of an-
other. Even within a single application, diﬀerent phases of execution may place
12
diﬀerent demands on the processor. This problem is further exacerbated in a
multiprogrammed environment where the processor will be expected meet the re-
quirements of a wide variety of applications dynamically. In this context, it makes
sense to design processors that are able to quickly adapt to meet their workload
requirements in an energy eﬃcient manner.
1.1 Architectural Adaptivity
As advances in semiconductor technology bring smaller feature sizes and higher
performance, they also bring new challenges. One emerging problem is the increase
in process parameter variations in the newest generation of microprocessors [4],
[34]. This only serves to exacerbate the existing problems of increasing power
density and much greater leakage current, which are already ﬁrst class design
concerns at both the circuit and microarchitectural level. In addition to dealing
with leakage current and power dissipation, microprocessor designers will now need
to take on the additional burden of dealing with within-die variations, which can
cause frequency and leakage currents to vary signiﬁcantly within a single chip.
Multiple Clock Domain (MCD) processors have been studied extensively, pri-
marily as a means of managing design complexity and reducing energy consumption
[31], [57], [58] . The elimination of a global clock network removes one of the great-
est challenges in modern circuit design, namely that of distributing a higher and
higher frequency clock across an increasing number of transistors. MCD processors
also have advantages in terms of energy-eﬃciency. The ability to run each domain
at a diﬀerent voltage and frequency means that less critical domains can be run at
lower frequencies. This opportunity to do ﬁner-grained frequency and voltage scal-
ing oﬀers opportunities for greater energy savings than in a processor with a single3
global clock. The drawback of MCD processors it the overhead of communicating
between domains. The necessity of synchronization between domains mean that
an additional latency may be incurred on any communication action that crosses
a domain boundary.
More recently, MCD processors have been viewed as a means of addressing
some of the problems created by within-die variations. If one domain has reduced
performance, either in terms of frequency or leakage, it might be possible to com-
pensate for this deﬁciency by either raising or lowering the voltage of that domain,
isolating the problem from aﬀecting the entire processor.
We propose an adaptive MCD architecture that is able to reconﬁgure its volt-
age and frequency to meet a speciﬁc energy or throughput goal. We base our
architecture on the work presented by Semeraro et al. in [58], shown in Figure 1.1.
This MCD conﬁguration divides the processor into four domains: Front-End, In-
teger, Floating-Point and Load/Store. The Front-End contains all the fetch and
issue logic, the ROB and the L1 I-Cache. Integer, Floating-Point and Load/Store
domains contain the functional units, and associated issue queues as well as the
register ﬁles in the case of the Integer and Floating-Point domains. Both the L1
D-Cache and uniﬁed L2 cache are within the Load/Store unit.
Optimal dynamic voltage and frequency scaling (DVFS) decisions require feed-
back on processor energy and throughput. To this end, we present a model of MCD
processor performance that allows us to predict the impact of changing a single
domain on the overall system throughput. We also present a model of per-domain
energy usage. We use the feedback from these models in two ways: to meet a
throughput target with the least possible energy or to meet an energy constraint
with the best possible throughput. We also show that our system is inherently4
L2 Cache
L1 D-Cache
LS Queue
Main Memory
FP Queue Int Queue
FP FUs/Regs Int FUs/Regs
Fetch
ROB, Rename, Dispatch
L1 I-Cache
Integer Floating-Point
Load-Store
Front-End
External
Figure 1.1: MCD Domain Conﬁguration
tolerant of variations. Because energy and throughput information is tracked dy-
namically, changes in domain performance due to variations can be accounted for
when making reconﬁguration decisions.
This thesis makes four primary contributions to the design of MCD processors.
• MCD Throughput Modeling: We present a simple model for predicting the
throughput of an MCD processor. Our model uses a critical path predictor
to assign a portion of the execution time to each clock domain. Using this
information it can predict the change in overall throughput due to changing
the frequency of one domain. We show that this model is highly eﬀective for
predicting throughput when changes in frequency are small.
• Inter-Domain Interaction Modeled Dynamically: All previous online DVFS
algorithms for MCD processors use per-domain information for making fre-
quency scaling decisions. Although oﬄine algorithms have made use of ex-
tensive analysis of inter-domain interaction, no online algorithm has yet been
proposed that makes use of this type of information. Our algorithm captures5
information on the interaction between domains dynamically in the form of
critical path information, and uses this information in the application of our
DVFS algorithm.
• Adaptation to a Speciﬁc Frequency or Energy Target: In any DVFS algo-
rithm, some throughput is sacriﬁced in the interest of reduced energy con-
sumption. However, architectural DVFS systems have typically selected an
arbitrary small performance penalty to be acceptable [39], [57], [70]. We
expand upon this prior work by creating a system that is designed to meet a
speciﬁed throughput goal in the most eﬃcient way possible. Previous work
has shown how the operating system can attempt to determine the correct
system throughput for the purposes of DVFS [3], [18]. Such approaches could
be used in concert with our system, with the architectural DVFS algorithm
attempting to meet the throughput goal set by the operating system. This
thesis will also present the details of one such system: an API for allowing
applications to specify their throughput need. In addition to meeting a spe-
ciﬁc throughput goal, our system can also be set to stay below a maximum
power budget, while providing the greatest possible throughput.
• Tolerance to Variations: In addition to adapting to a speciﬁc throughput or
energy goal, our algorithm is also sensitive to changes in performance and
energy consumption due to process variations. Since our algorithm takes
into account the dynamic throughput and per-domain energy consumption,
any change in the performance of any domain due to process variations is
factored into the choice of voltage/frequency conﬁguration.6
1.2 OS Level Adaptivity
We will also extend adaptivity to the level of multiprogrammed workloads.
This adaptive system makes the following contributions. First, we present a model
of multiprogrammed workloads that shows how applications can be scheduled to
minimize the overall energy consumption without impacting performance. We then
present a system that allows individual applications to specify their throughput
goals. Under this system, applications can compare their actual progress with their
desired rate of progress. They are then able to raise or lower their throughput goal
accordingly. This throughput goal is used by the architectural adaptive system
mentioned in the previous section. The throughput goal is also passed as feedback
to the operating system. This throughput feedback allows us to build a scheduler
based on our model of multiprogrammed workloads. We show that this scheduler is
able to save 11.8% energy when compared to an ordinary priority scheduler across
a diverse set oﬀ workloads.
1.3 Thesis Structure
The rest of this thesis is structured as follows. In Chapter 2, we discuss related
work on MCD architectures as well as relevant work on dynamic voltage scaling.
We also discuss how our work builds upon this prior work. In Chapter 3 we present
a model of the throughput and energy consumption for an MCD processor that
is simple yet accurate. In Chapter 4 we present an adaptive architectural system
based on this modeling, and present results for this system. In Chapter 5 we extend
our modeling to multiprogrammed workloads. In Chapter 6 we demonstrate a
system for allowing applications to specify their throughput needs, and discuss a7
scheduler based on the model presented in Chapter 5. Finally, in Chapter 7 we
discuss our results and conclusions.Chapter 2
Related Work
In this chapter we present work relevant to the architecture and inter-program
adaptive systems presented in this thesis. We divide this work into three categories.
(1) Work related to MCD processors and the use of DVFS in MCD processor
design. (2) Work discussing the architectural impact of process variations. (3)
Work related to DVS at the application and inter-program level.
2.1 Multiple Clock Domain Processors
Multiple-Clock domain processors were proposed by Semeraro et al. in [58]
and Iyer et al. [31]. They propose a four-domain partition, with the front-end in
one domain and the primary functional blocks of the processor (Integer, FP and
Load/Store) each separated into their own domains. Further work by Semeraro
et al. [60] demonstrates the feasibility of MCD style processors by showing that
many of the synchronization delays introduced can be hidden and do not greatly
decrease processor performance. Talpes et al. [64] present an analysis of how dif-
ferent clock domain conﬁgurations and circuit design styles can impact the energy
and throughput performance of an MCD processor. A number of diﬀerent MCD
conﬁgurations have also been proposed. Iyer et al. proposed a ﬁve domain conﬁg-
uration, with the front end split into two domains [31]. More recently, Zhu et al.
have proposed a conﬁguration with the L2 cache and reorder buﬀer separated into
their own domains as well [75].
Previous work on frequency and voltage adaptive MCD systems have fallen
into two broad categories: oﬄine schemes that use proﬁling or compile-time infor-
89
mation to do detailed program analysis, and online schemes that are driven with
dynamic program information. Magklis et al. propose a proﬁle based dynamic
voltage scaling scheme for MCD processors [39]. Oﬄine proﬁling identiﬁes major
phases within the application. For each of these phases, a cycle accurate simulator
is used to create a directed acyclic graph (DAG) of primitive events within the
processor. This graph is analyzed to identify slack along the edges of the graph.
This analysis attempts to determing the minimum frequency that the domains of
the MCD processor can be run at without dropping below a speciﬁed performance
level. Notable online methods include work by Semeraro et al. [57] and Wu et
al. [70], [71]. In all of this work, queue utilization is used as the primary input
for making DVFS decisions. Semeraro et al. use a heuristc algorithm for mak-
ing DVFS decisions [57]. This algorithm steps up the frequency and voltage of
a domain if the queue occupancy is increasing. If the queue occupancy is stable,
the voltage and frequency is allowed to slowly step down. Wu et al. expand upon
this technique [70]. They set target queue occupancies for each of the execution
domains. Their algorithm uses two input signals: (1) The diﬀerence between the
target queue occupancy and the actual queue occupancy. (2) The rate of change
of the queue occupancy. For each domain, these signals are combined with the
current throughput of the domain to determine the new throughput. In later work
they proposed a revised version of this algorithm that adapts the interval length
between voltage changes [71]. A slightly diﬀerent approach is taken by Semeraro
et al. [14]. They propose an adaptive MCD processor that can trade speed for
complexity in each domain by resizing various on-chip hardware structures, such
as caches, issues queues and the branch predictor. Initially each domain begins
operating at maximum frequency, with minimum sized structures. If the program10
requires larger caches, or exhibits high levels of parallelism, the hardware struc-
tures can be upsized, at the cost of reduced frequency. The overall goal is to
maximize the throughput of each domain.
Our work diﬀers from this prior work in several respects. All of the online
techniques proposed use domain-speciﬁc information for making DVFS decisions.
Only the oﬄine, proﬁle-based technique uses inter-domain information [39]. We
propose an online technique that utilizes inter-domain information in the form of
critical path information. A number of the online schemes use queue utilization
as the primary input to their DVFS algorithms [57], [70], [71]. As a result, only
the frequencies of the execution units are scaled. By capturing the interaction
between the front-end and the execution units using critical path information, we
are able to make informed decisions about scaling the frequency and voltage of the
front-end as well.
2.2 Architectural Techniques for Variations
Although much work has been devoted to circuit level solutions to deal with
process variations, relatively little work has been done on using architectural tech-
niques to address this problem. Borkar et al. [5] and Bowman et al. [6] discuss
how variations will aﬀect processor frequencies in future generations. Skadron et
al. discuss how within-die variations will impact future generations of multi-core
processors [24]. Their analysis suggests that random within-die variations will
not have a signiﬁcant impact on architectural design, because they will eﬀect all
achitectural components equally. However, systematic within die variations, may
cause a signiﬁcant diﬀerence in the performance of diﬀerent on-chip components.
Marculesu et al. speciﬁcally address how MCD processors may have advantages11
in terms of dealing with variability [42]. Because MCD processors allow diﬀerent
domains to operate at diﬀerent speeds, the impact of a single critical path that
is slowed due to variations can be isolated. This means that the expected aver-
age frequency of each domain is higher than the expected frequency of an entire
synchronous processor. Marculescu et al. study this advantage and quantify the
potential beneﬁts.
2.3 OS Dynamic Voltage and Frequency Scaling
For any dynamic voltage adaptation scheme to be feasible, the hardware
must support operation at multiple voltage levels. Current commercial proces-
sors support a small number of discrete voltage/frequency adjustment options.
Intel’s Mobile Pentium III with SpeedStep has two levels of operation [25] and
AMDs Mobile Athlon 4 with Power NOW! [1] has ﬁve levels. This small range of
voltage adjustment can only support very course grain voltage adjustments, such
as lowering the voltage when a system switches to battery power. Transmeta’s
Crusoe processor is one of the few commercial processors to support ﬁne-grain
voltage/frequency adjustments. However, unlike our voltage-adaptive scheme the
Crusoe’s voltage/frequency adjustments are not directly driven by the application.
Crusoe’s power management software monitors power consumption by sampling
CPU sleep states and using heuristics to adjust the voltage and frequency of the
processor [20]. Some low-power embedded processors, such as the Intel 80200 [26],
also support ﬁne-grain voltage/frequency adjustments. The existence of these ca-
pabilities in modern processors has spurred researchers to examine algorithms for
voltage/frequency management.
Previous work on voltage adaptation has focused on operating system tech-12
niques that choose an operating voltage which minimizes idle time. Early work
by Weiser et al. showed the potential beneﬁts of voltage scaling [69]. They looked
at two schemes, FUTURE and PAST, that examine idle time in scheduling win-
dows to determine the voltage setting for the next epoch and compared these to
OPT, the optimal strategy. This work was further extended by Govil et al. which
examined many other candidate strategies for voltage scheduling using the same
framework as Weiser [21]. They showed that when evaluating a range of appli-
cations with a single scheduling policy, simple strategies achieved energy savings
that were comparable to those obtained by more sophisticated strategies. Martin
examined the eﬀect of non-ideal battery behavior and memory performance, and
uses this to formulate a more sophisticated model of the eﬀect of voltage/frequency
scaling on system lifetime [47]. Grunwald et al. experimentally evaluate diﬀerent
voltage scaling policies on Itsy, a prototype hand-held computer [22]. They con-
clude that none of the policies proposed to date work well in the general case.
These papers share a set of common characteristics: (i) They investigate single
system heuristics that have to operate well across a wide range of applications;
(ii) They are all coarse grained and interval based. The system re-evaluates the
voltage setting only when the scheduler is invoked. The choice of the interval
is determined by the scheduler, independent of application needs; (iii) They are
driven entirely by system idle time, which is not directly related to application
needs. Inferring computation needs from idle time measurements is complicated
by phenomena like deceptive idle times [29], where applications might remain idle
due to outstanding I/O requests. These three characteristics have limited the ef-
ﬁciency of such schemes. In contrast, our rate-matching based approach enables
application-controlled voltage adaptation, facilitates scheduling at ﬁner granular-13
ity, and is directly driven by application progress.
Several recent papers have presented OS level DVS systems that automati-
cally detect program needs and set performance levels accordingly. Flautner et al.
present a system that analyzes program behavior from the schedulers perspective
to ﬁnd trends and predict future performance [18], [19]. They also consider user in-
put events to insure that interactive applications are suﬃciently responsive. Lorch
et al. present a similar system [38] that also incorporates PACE, their method for
calculating a future performance schedule based on a probabilistic model of past
behavior. One of the interesting diﬀerences in their approach is that it starts with
a low voltage and then increases performance as necessary to meet program de-
mands [37]. This diﬀers from most papers, which attempt to maintain a constant
voltage level.
There has been much work (both theoretical and simulation based) on opti-
mal voltage scheduling policies based on complete information about application
deadlines, arrival times, and computation workload. Hong et al. look at the volt-
age scheduling problem given full information about periodic tasks [23]. Pering
et al. describe the design of a low-power microprocessor system that incorporates
dynamic voltage scaling [51]. They build on a real-time OS infrastructure, and
assume that application deadlines and computation needs are available to their
scheduler. Ishihara et al. make the same assumptions and approach the optimal
voltage scheduling problem through linear programming [28]. Manzak et al. pro-
vide techniques to compute the optimal task voltages for a number of tasks that
have a common, global deadline [40],[41]. More recent work makes similar as-
sumptions when implementing their real-time dynamic voltage scaling algorithms
[52],[74]. Pouwelse et al. ([53],[54]) present a scheduling algorithm that uses a14
heuristic algorithm to match the throughputs of the running processes. This body
of work relies on explicit application deadline information, and total application
execution time for at least one reference voltage. In practice, these two metrics, es-
pecially the latter, may be diﬃcult to obtain. Furthermore, having the application
compute an accurate estimate of future workload is likely to incur an unaccept-
able performance penalty. In contrast, our rate-matching based approach does not
require that the application make any estimates of its future behavior.
Simunic et al. present techniques that use a change-point detection algorithm
to detect the diﬀerence in arrival and service rates for an MPEG player and MP3
player [62]. They assume the presence of a power manager that can monitor these
rates and the number of frames decoded by the two players. Their technique also
uses an oﬀ-line calculation to determine thresholds for the change-point detection
algorithm. Our rate-matching based approach, while similar in spirit, relies only on
run-time information and does not require an oﬀ-line calculation phase. Further, it
is not limited to applications with input and output queues that can be monitored
by dedicated hardware.Chapter 3
Modeling MCD Processors
In this chapter we present a general ﬁrst-order model of the energy and through-
put of a multiple clock domain (MCD) processor. If we are to make rational
decisions about how to best trade oﬀ energy and performance, we must ﬁrst un-
derstand how each domain impacts the overall performance of the processor as a
whole. This model must be both instructive enough to provide insight into the pro-
cessor’s behavior, and simple enough to be analyzed rapidly at run-time. To meet
this goal we use a simple yet accurate ﬁrst order model of energy and throughput.
We ﬁrst discuss the energy model, which is both simpler and more general than
the throughput model. We then discuss the throughput model, which is highly
accurate for small shifts in frequency around a speciﬁc operating point, but does
not hold for any arbitrary mix of domain frequencies.
3.1 Energy Model
We assume that the total energy consumed by the processor performing a
speciﬁc amount of work will be equal the sum of the energies consumed by each
domain. The energy for a single domain operating at voltage V for time T is
considered to consist of a dynamic component that is proportional to V 2 and a
static component that is proportional to both V and T [10]. So for domain i, the
energy consumed performing a unit of work can be expressed as follows, where ei
and li are constants:
Ei = eiV
2
i + liViTi (3.1)
1516
The total energy is then expressed as:
Eprocessor = E1 + E2 + ... + Ei (3.2)
In order to make use of this energy model, we must be able to determine the
ei and li parameters. The paramenter li, which is the nominal leakage current for
a domain, must be determined by oﬄine testing, and is assumed to be constant
for the purposes of the model. The parameter ei can be determined dynamically.
The dynamic component of the energy equation is expressed as:
Edynamic(i) = eiV
2
i
where Vi is known. Previous work has shown that the dynamic energy can be
predicted accurately using existing on chip counters [27]. Once we have both the
dynamic energy consumed and the voltage, we can determine ei.
3.2 Throughput Model
Throughput is much more diﬃcult to quantify, since domains have complicated
interactions which are diﬃcult to model analytically. However, we present here an
analytic model that we will show can accurately predict the throughput when we
make small changes in the current domain operating frequencies.
3.2.1 Execution Time
We assume that a fraction of the execution time can be assigned to each domain
and that the total execution time is the sum of these times. Thus, the total time17
can be expressed as follows:
Time = T1 + T2 + ... + Ti (3.3)
The time assigned to each domain is the number of cycles assigned to that domain
divided by the operating frequency of that domain:
Timei =
Cyclesi
Freqi
(3.4)
The relationship between frequency and voltage is taken to be [66]:
Freq ∝
(Vdd − Vth)α
Vdd
(3.5)
where Vdd is the operating voltage, Vth is the threshold voltage and α is a technology
dependent factor that is usually close to 2 [12].
In order to implement this model at run-time, we must be able to determine
both the number of cycles and the energy consumption that can be attributed
to each domain. Next, we discuss in detail how execution time can be divided
between domains.
3.2.2 Critical Path Model
Implicit in our model of the execution time is the assumption that the total
execution time can be divided into a series of sequential events, each of which can
be assigned to a speciﬁc domain. This section will discuss how we accomplish this
using critical path information.
We can view the execution of a program as a critical path of events which
occur in sequence. If each of these events could be considered to have occurred18
Dispatch
Execution
Commit
i i+1 i+2 Time
Front-End
Int
FP
LS
ROB
Dependence
Critical Path
Figure 3.1: Critical Path Model, from Fields et al. [15]
exclusively within a single domain of the processor, we could assign to each domain
the total number of cycles that were required to execute those events, and thereby
divide the total execution time between the domains. In order to do this, we need
both a critical path model and a means for tracking that critical path within the
processor.
We begin with the critical path model outlined by Fields et al. [15],[16],[17]
and illustrated in Figure 3.1. In this model, nodes represent events within the
execution of an instruction. In Figure 3.1, each vertical column of three node
corresponds to a single instruction. Each instruction is broken down into three
events, corresponding to that instruction being dispatched (D), executed (E), and
committed (C). Edges between nodes represent dependencies between the events
they connect. For example, each commit node has a dependence edge to the
commit node of the following instruction, representing the constraint of in-order
retirement. Edges between execution nodes represent data dependencies between
those instructions. The complete description of edges is in Table 3.1.
Each node in the graph has at least one incoming edge. For any node, the
“critical” edge is the edge which arrives last. To ﬁnd the overall critical path
through a sequence of instructions, we must trace back from the commit of the19
Table 3.1: Summary of Dependence Edges
Target Node Edge Description
Dispatch Ei−1 → Di Preceding instruction was a branch miss.
Ci−w → Di ROB was stalled on prev cycle.
Di−1 → Di In-order dispatch.
Exec Di → Ei Exec follows dispatch.
Ej → Ei j produced value consumed by i
Commit Ei → Ci Commit follows exec.
Ci−1 → Ci In-order commit.
ﬁnal instruction along the last arriving edge of each node.
We can easily map this critical path model onto our proposed MCD processor
with each node representing an event that occurs exclusively within a domain.
The critical path model can be mapped onto any MCD architecture that utilizes a
front-end domain, with the ROB either included in the front-end, or in a separate
domain, and an arbitrary number of execution domains. Later in this section we
will discuss how critical path nodes can be tracked and counted at run-time.
Now that we have established a critical path of execution, and assigned the
sequential events within that path to speciﬁc domains, we can express the time
that is assigned to each domain. This time is taken to be the number of critical path
events (CP Count) that fall within the domain, multiplied by the nominal delay
(in cycles) of each such event (DelayFactor), divided by the operating frequency
of the domain.
Timei =
DelayFactori ∗ CP Counti
Freqi
(3.6)
We can now write the equation for total execution time as:20
Time =
X
i
Timei (3.7)
Using this equation we can now predict the eﬀect on execution time of varying
the frequencies of the domains. Of course this prediction assumes that shrinking a
section of the critical path will have a commensurate eﬀect on reducing the overall
execution time. In reality, reducing the length of one critical path will expose
other critical paths. For this reason, the time estimates are accurate only when
the domain frequency changes are small. We demonstrate the accuracy of this
model below.
The last obstacle comes in determining the DelayFactor for each domain. This
is, in eﬀect, an estimate of the average number of cycles of delay of each critical
node within the domain. Ideally, this number could be calculated statically. For
example, the delay factor to perform an addition is a constant number of cycles.
However, diﬀerent actions performed by a speciﬁc domain require a diﬀerent num-
ber of cycles to perform. In addition, this delay also captures many subtle eﬀects;
for example the average latency of an addition could increase due to contention
for the functional units. For this reason, the most practical way to determine the
delay factor for each domain is to observe it experimentally, by varying the speed of
each domain individually and observing the change in performance. The execution
time with respect to a single domain, i, can be written as:
Time =
DelayFactori ∗ CP Counti
Freqi
+ T0 (3.8)
where T0 represents all of the execution time other than that which is assigned
to the domain of interest. By running with domain i at a certain speed, then21
running at a slightly diﬀerent speed, we can observe the Time, CP Counti and
Freqi for two intervals. Using this information we can solve for the DelayFactori
for this domain. We need only perform this calibration step periodically, as the
DelayFactor does not vary dramatically over the course of execution on the bench-
marks we examined.
We have veriﬁed this approach experimentally, by using it to predict the change
in execution time as we sweep the frequency of a single domain through a wide
range. We chose a mix of eight SPEC2000 benchmarks for this test: applu, swim,
vpr, wupwise, bzip2, gcc, mesa and vortex. Each benchmark was warmed up
for 1 billion cycles. Each domain was set to 50% of the max frequency, with the
exception of the domain being studied. For this domain, we set the domain to
10% of the max frequency, then increased in steps to 90%. This experiment was
repeated, sweeping the frequency from 90% down to 10%. The length of each
step was set to 10 million cycles. Note that at each step, we can recalculate the
DelayFactor, based on the previous two intervals. After each step we compare the
actual run-time, with the time predicted by the model. A plot of these frequency
sweeps for each domain for a single benchmark (vpr) is shown in Figure 3.2. The
average accuracy across all applicatons is shown in Table 3.2. For the four domains,
Front − End, Int, FP and LS, the worst case errors were 1.87%, 2.54%, 0.44%
and 1.99% respectively.
Table 3.2: Average Prediction Accuracy
Domain Front-End Int FP LS Overall
Error (%) 1.43 2.05 0.39 1.96 1.4522
Actual
Prediction
Actual
Prediction
Actual
Prediction
Actual
Prediction
Figure 3.2: Predicted versus Actual Throughput for a Sample Application
3.2.3 Critical Path Predictor
In order to implement our model we assume that we are able to accurately
count the number of critical path nodes that lie within each domain. In order
to do this, we must be able to record the dependencies between instructions, and
then trace backwards to determine the critical path. To precisely determine the
critical path, we must record every node until the program ﬁnishes execution and
then trace back to the beginning. Clearly, it is not feasible in practice to record
the dependencies between billions of instructions dynamically. The critical path
can be approximated by recording a smaller number of instructions, and tracing
back periodically to identify the critical nodes, and then incrementing the total
number of critical nodes for each domain based on that trace. We then ask the
following question: How long must our incremental traces be in order to approach23
the correct critical path counts? In order to answer this question, we ran the same
eight benchmarks mentioned above. We varied the length of the periodic traces
from 128 to 8192 instructions and compared the results against the actual results,
as recorded from a trace of the entire program length. The ratio between the
estimated critical path node counts and the actual values is shown in Figure 3.3,
for one application (vpr). For most of the applications, the estimated counts begin
to converge with the actual counts as the trace size exceeds 1024 instructions. The
average accuracy across all benchmarks is recorded in Table 3.3, for trace sizes of
1024 through 8192.
128 256 512 1024 2048 4096 8192 Inf
Trace Length
128 256 512 1024 2048 4096 8192 Inf
Trace Length
128 256 512 1024 2048 4096 8192 Inf
Trace Length
128 256 512 1024 2048 4096 8192 Inf
Trace Length
Figure 3.3: Predicted Count / Actual Count versus Trace Length (for vpr)
We believe that keeping traces of between 2048 and 8192 entries is feasible
in hardware using a similar hardware overhead to the criticality predictor used
by Fields et al. [15] Their critical path predictor consists of additional entries in
the ROB which contain critical edge information, plus an external predictor. As
instructions retire, the critical edge information is passed to the external predictor.24
Table 3.3: Critical Node Predictor Accuracy versus Trace Length
Trace Length 1024 2048 4096 8192
Front-End 11.86% 4.44% 2.34% 1.42%
Int 12.85% 6.01% 3.18% 2.04%
FP 5.56% 1.09% 0.89% 0.62%
LS 6.59% 3.51% 2.78% 1.96%
We can utilize the same mechanism, but with a diﬀerent external predictor. They
are trying to identify speciﬁc critical instructions, whereas we are interesting in
how often the critical path ﬂows through certain domains.
In order to construct the critical path graph shown in Figure 3.1, we need
a record of the critical edges between the graph nodes. Recall that each node
represents an event, with each instruction consisting of three events corresponding
to that instruction being dispatched, executed and committed. Each event must
be preceded by several other events: for example the execution of an instruction
must be preceded by the dispatch of that instruction, and by the execution of the
instruction or instructions which generate its operands. These causal relationships
are represented by the edges of the graph. For each event the critical edge is the
last arriving edge. Each of the three nodes within a single instruction has exactly
one critical edge. It is a record of these edges that must be recorded within the
ROB. Each instruction’s ROB entry contains three subﬁelds, one for each node.
These subﬁelds each contain an index representing the last arriving edge for that
node. This index speciﬁes the instruction and node from which the last arriving
edge originates. So for an n entry ROB, the index ﬁeld would have log2(n)+2 bits:
log2(n) bits to specify the instruction from which the critical edge originates, and
2 bits to specify the node within that instruction. For example, if the execution25
of instruction i stalls because one of its operands is being produced by instruction
j, the index ﬁeld associated with the execution of instruction i would contain a
reference to the execution of instruction j.
As an instruction retires, the critical edge information associated with that
instruction is passed to the external predictor. This information is recorded in a
hardware structure as a trace. When the hardware structure ﬁlls, the predictor
performs a “trace-back.” This consists or tracing backwards from the commit of
the last committed instruction, along the last arriving edges stored in the trace. We
must also keep a counter associated with each domain. As the trace-back occurs,
at each node we increment the counter of the domain associated with that node.
So each dispatch node that we pass through increments the front-end counter. For
execute nodes, the counter that is incremented depends on the instruction type
(integer, ﬂoating-point or load/store). Since instructions will continue to commit
as the trace back occurs, we will need either two structures to store traces, or a
multiported structure that allows the information from the committing instructions
to be written into the entries freed by the trace-back. The entire critical path
predictor is summarized in Figure 3.4.
3.3 Summary
In this chapter we have presented a ﬁrst-order model of MCD processor energy
and throughput that is simple yet eﬀective. We have demonstrated that through-
put can be accurately predicted for small changes in frequency by assuming that
the execution time can be divided between the domains and that changing the fre-
quency of one domain will only eﬀect the fraction of the execution time assigned to
that domain. We then demonstrated how we can perform this division of execution26
Dispatch
Critical Edge
Execution
Critical Edge
Commit
Critical Edge
log  n + 2 bits 2 log  n + 2 bits 2 log  n + 2 bits 2
ROB Critical Path Information
ROB Entries
Retiring Instruction
CP Info Trace of Critical Path Edges
Front-End Counter
Int Counter
FP Counter
L/S Counter
Inst.
Type
2 bits
Trace
Logic
External Critical Path Predictor
Figure 3.4: Critical Path Predictor
time using a critical path model. Finally we discuss how a hardware predictor can
accurately predict the critical path in hardware.Chapter 4
Adaptive MCD Architecture
In this chapter, we will discuss how the model presented in the previous chapter
can be used to improve eﬃciency of an MCD system. We will present an adaptive
algorithm that makes use of the model, and then compare how eﬀective it is in
running real world applications. We will also compare its performance against a
simple, baseline DVS scheme, and one of the best published systems for DVS on
an MCD processor.
4.1 Best Step Algorithm
We now propose an algorithm that uses our model’s dynamic energy and
throughput predictions to select a voltage and frequency conﬁguration that eﬃ-
ciently meets a speciﬁc energy or throughput goal. The external input to this
algorithm is a throughput or power target that the algorithm attempts to follow.
As shown in [3] and [18], throughput targets can be dynamically determined by
the operating system. For the rest of the paper we will refer to our algorithm as
BestStep, as it attempts to determine the domain that is most critical to meeting
our performance demand, and stepping that domain up or down.
The algorithm ﬁrst determines whether the system is currently running above
or below its performance target. The algorithm behaves identically whether the
target it is attempting to meet is a power or throughput goal. In either case it will
attempt to either raise or lower the performance of the system to meet the target
in the most energy-eﬃcient manner possible.
Once it has determined that it is taking a step up or down, the BestStep
2728
algorithm selects a small change in throughput towards the target. The targeted
change in throughput must be small, because our throughput model can only
accurately predict the change in throughput for small changes in frequency. Recall
from the previous section that we can predict the new execution time (and therefore
the throughput) given the frequency change of a single domain. Therefore, given a
new throughput goal, we can determine the change in frequency for each domain
that will meet that throughput goal.
After determining the change in frequency (and thus voltage) for each domain
that will meet the throughput target, the algorithm selects the domain that can
meet this goal with least amount of energy and then scales this domain’s volt-
age and frequency. As the algorithm takes successive steps toward the through-
put/energy goal, domains which have a larger impact on performance with rela-
tively low power consumption will run faster, while more power hungry domains
or domains with little impact on performance will slow down. The desired result
is a voltage/frequency conﬁguration that will meet the targeted performance level
in an energy eﬃcient manner. Note that since we only scale a single domain, the
DelayFactor for that domain can be re-calculated at the beginning of the next
interval, thus the delay factors are continuously being updated.
By considering leakage as well as dynamic power, the BestStep algorithm has
an advantage over other DVFS schemes, especially at low throughput/energy tar-
get points. Because of the high leakage in modern microprocessors, it is possible
that lowering the throughput could actually increase energy consumption, due to
the fact that leakage power dissipation does not decrease with lower voltage nearly
as much as dynamic power. If the BestStep algorithm predicts that lowering
the voltage will actually increase the total energy consumed, it will maintain the29
current performance level, instead of stepping down.
Because of the relative complexity of this algorithm, we assume that it will be
run in software. We discuss in the results section how we account for the additional
overhead incurred.
4.2 Simulation Framework
Our simulation framework is based on a heavily modiﬁed version of the SESC su-
perscalar simulator [56]. For modeling of dynamic power, SESC includes a version
of the Wattch power estimation tool [7]. We have modiﬁed the energy counters
used by Wattch to account for the changes in voltage in our DVFS system. We
also consider the logic blocks in the processor to be clock-gated.
As SESC does not include any type of static energy modeling, we have imple-
mented a leakage model that calculates the energy consumed due to subthreshold
leakage in each domain. We base our leakage calculation on the model presented
in Tsai et al. [67]. This model ﬁrst presents a transistor level model of leakage
current. Using the transistor leakage equations, it then calculates the leakage of a
number of common circuits present in modern microprocessors. Finally, the num-
ber of each type of circuit is estimated for each of the major architectural blocks
of the processor, thereby providing a leakage estimate for the entire processor.
We have also modiﬁed the SESC simulator to model a Multiple Clock Domain
microprocessor with a four domain conﬁguration similar to that presented by Se-
meraro et al. [58], with each domain able to operate at any frequency within the
allowed range. In addition to allowing each domain to operate at a diﬀerent volt-
age and frequency, each communication action between domains incurs a 1-cycle
synchronization penalty. This results in a 6.1% degradation in performance versus30
Table 4.1: Architectural Parameters
Parameter Value
Br. Pred. Bimodal/
2-level PAg
Level 1 1024 entries,
10-bit hist.
Level 2 1024 entries
Bimodal Size 1024
Comb. Pred Size. 4096
Br. Mis. Pen. 17
Decode Width 4
Issue Width 6
Retire Width 6
Phys. Reg. File 72 int, 72 FP
Parameter Value
L1 D-Cache 64kB, 2-Way Assoc.
L1 I-Cache 64kB, 2-Way Assoc.
L1 Cache Lat. 2
L2 Cache 1MB, dir. mapped
L2 Cache Lat. 12
Integer ALUs 4+1 mult/div
FP ALUs 2+1 mult/div
Int. Queue 32 entries
FP Queue 32 entries
L/S Queue 64 entries
ROB Size 128
a fully synchronous processor operating at the same clock frequency. The general
architectural parameters are given in Table 4.1.
Since our adaptation system requires a small amount of additional hardware
and periodically runs an algorithm in software, we must also account for this
additional overhead. In order to do this, we charge an energy cost for each access
to the table required in our critical path predictor. Since our adaptation algorithm
does not actually run in software on SESC, we must also account for its energy
consumption and running time. To do this, we compiled the algorithm and ran
it through SESC to determine its energy cost and execution time in cycles. This
energy cost and time are added each time the algorithm is performed dynamically.
With the alorithm operating at a granularity of 100k cycles, the average energy
overhead was 0.4% and the average time overhead was 0.3%. We discuss in the
results section why this interval length was selected. Note that the energy estimate
is conservative, since our test run of the algorithm was performed with all domains
set to the maximum voltage.31
4.3 Experimental Setup
We test our algorithm on a broad mix of SPEC2000 integer and ﬂoating
point applications as well as an MPEG encoder and decoder. The applications
selected for these tests are: mesa, applu, vortex, crafty, wupwise, twolf, gzip,
equake, swim, vpr, bzip2, gcc, MPEG-encode and MPEG-decode. For the SPEC
benchmarks, we used the MinneSPEC Large input data sets [35], run to completion
with no fast-forwarding.
For comparison, we use as a baseline a DVFS algorithm that varies all of
the domains monolithically. This algorithm monitors the throughput and energy
consumption dynamically, as does our BestStep algorithm. However, it simply
raises or lowers the voltage of all domains in lock-step, depending on whether
the processor is currently above or below its energy/throughput target. In the
subsequent sections and in all results, this algorithm is labeled Monolithic.
We also compare against one of the best DVFS systems for MCD architectures
presented in the literature by Wu et al. [70]. We have implemented this algorithm
to work on our simulation framework. Using the same nomenclature as in Wu et al.
[70], we subsequently refer to this algorithm as Analytic. The Analytic algorithm
attempts to vary the voltage and frequency of its execution domains to maintain a
target queue occupancy in the input queues of each domain (this reference queue
occupancy is called qref. In order to do this, the algorithm monitors the average
queue occupancy over an interval of execution. For interval k, the occupancy of a
queue is referred to as qk. At the end of each interval, the algorithm adjusts the
throughput of each execution domain. The algorithm utilizes two input signals:
(1) The diﬀerence between the current queue occupancy and the reference queue
occupancy (qk − qref). (2) The rate of change of the queue occupancy (qk − qk−1).32
0
2
4
6
8
10
10k
20k
50k
100k
500k
1M
Interval Length
E
t
 
 
 
 
I
m
p
r
o
v
e
m
e
n
t
 
(
%
)
2
Figure 4.1: Et2 Improvement for varying interval lengths.
It uses a linear combination of these input signals, along with the current domain
throughput, to calculate the throughput for the next interval. If we deﬁne the
new throughput selected at the end of interval k to be µk, we can express this
throughput as follows:
µk = µk−1 + KI(qk − qref) + KP(qk − qk−1)
where KI and KP are constant parameters of the algorithm, chosen so that the
algorithm is stable as it converges to a new throughput.
In our results, we primarily use the Et2 metric (energy times delay squared).
We chose this metric over energy-delay product or any other proposed metric as
it is approximately voltage independent to the ﬁrst order. In experiments where
the throughput is almost identical for each algorithm on a speciﬁc benchmark, we
report only Et2. In the experiment in which the throughputs vary signiﬁcantly
but energy remains constant, we report throughput.33
4.4 Finding the Optimal DVFS Interval
As our BestStep algorithm periodically changes the voltage/frequency con-
ﬁguration, we must ﬁrst determine the interval length that provides the best en-
ergy/throughput trade-oﬀ.
For each application we set a target IPC of 80% of the maximum IPC (the IPC
of running the entire application with each domain set to the maximum frequency).
We then ran the BestStep algorithm with an interval of 10k, 20k, 50k, 100k, 500k
and 1M instructions. For each interval length, we compare average Et2 improve-
ment over Monolithic for all benchmarks. The results are presented in Figure 4.1.
An interval of 100k cycles gave the best performance for the test benchmarks,
therefore we will use this interval length in all subsequent experiments.
0
5
10
15
20
mesa
applu
vortex
crafty
wupwise
twolf
gzip
equake
swim
vpr
bzip
gcc
mpeg-encode
mpeg-decode
average
Analytic
BestStep
E
t
 
 
I
m
p
r
o
v
e
m
e
n
t
 
(
%
)
2
Figure 4.2: Et2 Improvement versus Monolithic with high throughput.34
-30
-25
-20
-15
-10
-5
0
5
10
15
20
25
30
mesa
applu
vortex
crafty
wupwise
twolf
gzip
equake
swim
vpr
bzip
gcc
mpeg-encode
mpeg-decode
average
Analytic
BestStep
E
t
 
 
I
m
p
r
o
v
e
m
e
n
t
 
(
%
)
2
Figure 4.3: Et2 Improvement versus Monolithic with low throughput.
4.5 Adapting to Meet a Throughput Target
We ﬁrst present results of the BestStep algorithm adapting its energy con-
sumption to meet a throughput goal in the most energy-eﬃcient way possible.
This presents a problem when comparing against the Analytic algorithm, as it was
not designed to meet a speciﬁc throughput goal. In order to make a fair compari-
son, we ﬁrst ran the Analytic algorithm with the same conﬁguration presented in
Wu et al. [70]. The Qref values used by the INT, FP and LS domains are 6, 5 and
3 respectively, with the frequency adjusted every 10,000 cycles. We then targeted
both the Monolithic and BestStep algorithms to run at the same throughput as
Analytic for each application. For each benchmark, the overall throughput of the
three algorithms diﬀered by less than 1%. In our results we report the improve-
ment in Et2 for the Analytic and BestStep algorithms relative to Monolithic. We
report Et2, rather than simply energy, to take into account any small diﬀerences
in throughput. Both the Analytic and BestStep algorithms perform similarly,35
with Analytic returning a slightly better 7.3% improvement in Et2 versus 7.2%
for BestStep. The results are shown in Figure 4.2. We denote this as the “high
throughput” case, because the target throughputs are very close to the maximum
throughputs.
Two primary factors degrade the performance of the BestStep algorithm rela-
tive to the Analytic algorithm. First, BestStep tends to perform better at lower
throughputs, where it has more ﬂexibility to ﬁnd diﬀerent voltage conﬁgurations
to meet the throughput goal. The closer a particular benchmark runs to its max
throughput, the less ﬂexibility the BestStep algoritm has to adapt. Second, the
BestStep algorithm depends on the accuracy of its model predictions. If a bench-
mark is simply unpredictable, the BestStep algorithm’s performance will suﬀer.
An example of this type of unpredictability would be a benchmark in which the
DelayFactor parameters varied signiﬁcantly.
The BestStep alorithm performs well for benchmarks with predictable Delay-
Factors. Also, the BestStep algorithm peforms particularly well when the optimal
operating point consists of values that are very disparate. For benchmarks in which
the diﬀerent domains have widely diﬀering voltages, the BestStep algorithm has
an advantage over both the Monolithic and Analytic algorithms. The Monolithic
algorithm in particular must run all domains at the same voltage, and therefore
is at a signiﬁcant disadvantage in this case. Though the Analytic algorithm can
run its execution domains at independent voltages, it does not have the ﬂexibility
to change the voltage of its Front-End. The twolf benchmark in particular is an
example of an application where the good performance of the BestStep algorithm
is particularly attributable to the widely varying domain voltages.
Next, we wish to evaluate the BestStep algorithm adapting to a lower through-36
put target. Because the Analytic algorithm is designed to perform very close to the
maximum throughput, there is relatively little latitude for the BestStep algorithm
to adjust its conﬁguration. The BestStep algorithm is designed to function most
eﬀectively when it is known that the processor can be scaled to a lower through-
put target. While the Analytic algorithm is excellent very close to the maximum
throughput, it is at a fundamental disadvantage at very low throughputs, because
it cannot scale down the power hungry front-end domain.
In order to evaluate this scenario, we target the performance of each application
to be approximately 50% of its maximum throughput. Since the Analytic algo-
rithm is not designed to target a speciﬁc throughput, we reduced its throughput
by extending the length of the Qref target queue occupancies until each bench-
mark was slowed to approximately 50% of its maximum throughput. In order to
keep the comparison fair, we targeted the BestStep and Monolithic algorithms to
run at the same throughput as the Analytic algorithm. In the results presented
in Figure 4.3, for each benchmark all three algorithms are running at the same
throughput, with the Monolithic results used at the baseline for comparison. For
this experiment, BestStep improvement in Et2 jumps to 15.1% over Monolithic
versus only 1.3% for Analytic.
4.6 Adapting to Meet an Energy Target
We now evaluate the BestStep algorithm adapting to achieve the maximum
throughput while maintaining a speciﬁed average energy consumption. In order to
do this, we target an average energy consumption of 50% of energy consumed when
running at maximum throughput. As, the Analytic algorithm was not intended
to target average energy consumption, we only compare BestStep to Monolithic37
0
5
10
15
20
mesa
applu
vortex
crafty
wupwise
twolf
gzip
equake
swim
vpr
bzip
gcc
mpeg-encode
mpeg-decode
average
BestStep
T
h
r
o
u
g
h
p
u
t
 
I
m
p
r
o
v
e
m
e
n
t
 
(
%
)
Figure 4.4: Throughput Improvement versus Monolithic, meeting an energy tar-
get.
for this experiment. We report the improvement in throughput of the BestStep
algorithm versus Monolithic in Figure 4.4. Throughput with BestStep is 7.6%
higher than Monolithic on average. Not surprisingly, improvement in throughput
is not as great as the improvement in energy shown in the previous experiment,
because a small shift in throughput corresponds to a relatively large change in
energy consumption.
4.7 Adapting to Variations
Next, we show that our algorithm is able to adapt to the case in which one or
more domains dissipate more power due to the increased leakage current caused
by process variations. Recent work addressing variations at the architectural level
has suggested that systematic rather than random within-die variations will have
the greatest impact at the architectural level [24]. Since random variations af-38
0%
5%
10%
15%
20%
25%
Front-End
Integer
Floating Point
Load-Store
Average
Base Et2 Improvement
Leakage = +21.8%
Leakage = +41.1%
Leakage = +79.1%
Figure 4.5: Et2 Improvement with one leaky domain versus Monolithic.
fect each transistor independently, they tend to reduce average performance, but
eﬀect architectural blocks equally. Systematic variations, on the other hand, ef-
fect large sections of the chip, and therefore may cause a signiﬁcant diﬀerence in
performance between architectural components. Change in the eﬀective length
of transistors (Leff) has been considered as one the major sources of systematic
within-die variations [24], [63]. Recent work has examined diﬀerent likely scenarios
for how greatly Leff is likely to vary within a single die. Srivasta et al. consider
the case in which the 3σ for the variation of Leff is 10-20%. Cao et al. estimate
16.7% as the 3σ variation of Leff in fabrication technologies down to 70nm [11]. In
our results, we will consider 3σ variations of Leff that are between 10% and 20%.39
With a 3σ variation of 10%, the standard deviation of the leakage current for a
single transistor is 21.8%. However, as the 3σ of Leff increases to 20%, the stan-
dard deviation of leakage current jumps to 79.1% [63]. If we consider it plausible
that a single domain’s leakage current could diﬀer by one standard deviation from
the nominal value for the rest of the processor, then we will examine the possibil-
ity that a single domain’s per-transistor leakage current could be between 21.8%
and 79.1% greater than that of the rest of the processor. In our experiments we
will consider cases where the leakage of a single domain is 21.8%, 41.1% or 79.1%
greater than the nominal value for the rest of the processor (41.1% corresponds
to the case where 3σ for Leff = 15%). Figure 4.5 shows our results for these
experiments. For each scenario, we consider the case in which each of the four
domains has higher leakage than the rest of the processor. The BestStep algo-
rithm shows the greatest improvement when the Load/Store or Front-End domain
has higher leakage. Since these domains contain the caches, and therefore would
have a greater impact on overall leakage energy, this result is not surprising. The
average energy savings for each leakage data point are summarized in Table 4.2.
Table 4.2: Average Et2 Improvement with one leaky Domains versus Monolithic.
Base Leakage = +21.8% Leakage = +44.1% Leakage = +79.1%
15.1 % 15.3 % 17.1 % 19.6 %
We observe that the BestStep algorithm achieves higher energy savings versus
Monolithic in the presence of variations. However, the improvement is modest.
In the worst case, where one domain’s leakage is 79.1% great than the nominal
leakage, the energy savings only increases from 15.0% to 19.6%. The advantage
that the BestStep algorithm has is that it can consider the increased leakage of one40
domain when making DVFS decision. As a result, it may choose to run the leaky
domain at a lower voltage than it would otherwise. However, it then becomes
necessary to raise the voltages of other domains to meet the throughput goal.
The result is that there is a signiﬁcant, but not dramatic, improvement in energy
consumption.
4.8 State Space Study
In this section we study the energy and throughput space of our benchmarks.
By better understanding how applications behave at diﬀerent operating points, we
hope to better understand the theoretical basis of our algorithm.
4.8.1 Local Minima Search
The BestStep algorithm is essentially performing a walk through the energy
space, trying to ﬁnd the lowest energy point that satisﬁes the throughput require-
ment. If the algorithm is trying to move to a lower energy point, but it calculates
that a move in any direction would result in an increase in energy, it will remain
at its current operating point. This leaves the possibility that we could get stuck
in a local minima. In order to evaluate the likelihood of this scenario, we ran a
state-space study of a large selection of benchmarks. We used the following SPEC
2000 benchmarks with the MinneSpec [35] Large data sets: mesa, applu, vortex,
crafty, mgrid, bzip2, vpr, gcc, wupwise, gzip, equake, gap, swim, art, mcf,
apsi, and parser. We also used the same MPEG encoder and decoder mentioned
in the previous section. For each benchmark we evaluated a block of 100 million
instructions, after 1 billion instructions of warm-up. For this block of instructions
we performed a study of every possible voltage conﬁguration, with the voltage of41
each domain ranging from 0.5 V to 1.5 V in discrete steps of 0.1 V. For each con-
ﬁguration, we recorded the throughput and energy. A visual representation of the
throughput and energy spaces for 3 applications (Mesa, Swim and GCC) is shown in
Figures 4.6, 4.7 and 4.8. In order to represent the spaces in a 3-D plot, we reduced
the set of data points to only those in which the three back-end domain (Int, FP
and L/S) have the same voltage. In essence, these plots show the energy space for
these applications running on a two-domain MCD processor. Although these plots
do not give us any quantitative information about the presence of local minima
in the energy space, they provide some intuitive insight into the behavior of the
applications.
Figure 4.6: mesa Energy Space (left) and Throughput Space (right)
More importantly, we have analyzed the data gathered from the energy-spaces
of the each benchmark. For the energy-space of each benchmark, we searched
the entire space for local minima. In the majority of benchmarks, we found only
a single minimum, at the minimum operating voltage: the point at which each
domain is operating at 0.5 V. However, in the case of three applications (bzip2,
vpr, and apsi), a local minimum was found other than the global minimum. A
summary of these local minima is presented in Table 4.3. For a fourth benchmark,42
Figure 4.7: swim Energy Space (left) and Throughput Space (right)
Figure 4.8: gcc Energy Space (left) and Throughput Space (right)
gzip, the global minimum was not at the the minimum operating voltage, but
rather at a slightly higher operating point (also included in Table 4.3). Although
this does not present a problem for our algorithm, we thought it worth noting, as
every other application has its global minimum at the minimum operating point.
Of course, this analysis is not exhaustive. The real processor has an inﬁnite
number of operating points; as we can only evaluate a ﬁnite number, we must limit
ourselves to discrete operating points. Due to the computational complexity of
evaluating the entire space of each application (14,641 simulations per application)
we have limited ourselves to evaluating a relatively short section of code (100M
instructions). Although signiﬁcant, 100 million instructions cannot be considered43
representative of the application as a whole. However, we consider these results to
be indicative that the BestStep algorithm is unlikely to frequently be caught in
local minima. Out of 19 benchmarks evaluated, we found only three that exhibited
a local minimum, and in the case of each of these, only a single minimum other
than the global minimum was found.
Table 4.3: Summary of Local Minima
Application Local Minima [Front,Int,FP,LS]
bzip2 [0.7, 0.7, 0.5, 0.7]
vpr [0.7, 0.7, 0.5, 0.7]
apsi [0.7, 0.7, 0.5, 0.5]
gzip [0.5, 0.6, 0.5, 0.5]
4.8.2 Comparing BestStep to the optimal
As the BestStep algorithm attempts to ﬁnd the lowest energy operating point for
a speciﬁc throughput, there is no guarantee that it will ﬁnd the optimal operating
point for that throughput. This is because the algorithm stops at the ﬁrst oper-
ating point that meets its throughput goal. Ideally, we would like to compare the
performance of the BestStep algorithm to an “optimal” algorithm that would give
us the optimal operating point at each reconﬁguration point. However, this would
require us to run an exhaustive search to ﬁnd the optimal conﬁguration for each
interval during the course of each benchmark. While this is not feasible, the state
space study we have performed does give us a detailed look at a single interval of
each benchmark. Using this information, we can get an idea of how the BestStep
algorithm could perform relative to the optimal. Because we have recorded the
critical path counts at each point in the state space, we have all the information we44
0
5
10
15
20
25
applu
apsi
art
bzip2
crafty
equake
gap
gcc
gzip
mcf
mesa
mgrid
parser
swim
vortex
vpr
wupwise
mpeg-encode
mpeg-decode
average
E
n
e
r
g
y
 
I
n
c
r
e
a
s
e
 
(
%
)
Figure 4.9: Energy Increase of BestStep alorithm compared to optimal
need for the BestStep algorithm to perform a walk through the state space we have
recorded to ﬁnd an operating point that meets a speciﬁc throughput goal. We can
then compare the operating point selected by the BestStep algorithm to the best
operating point at that thoughput. This is an idealized experiment, because under
real operating conditions the algorithm is attempting to predict the behavior of
the next inteval based on the previous interval, whereas in this experiment, every
point in the state space represents the same interval of the benchmark. However,
this will give us some idea of the upper bound on performance we can expect using
this approach. For this experiment we select a throughput goal that is 50% of the
maximum throughput. The results are shown in Figure 4.9. For these benchmarks
the BestStep algorithm uses 10.2% more energy than the optimal. Although our
algorithm gives no guarantee of optimality, it does reach an operating point fairly
close to the optimal. Of course, during the real execution of a program, the gap
between optimal and BestStep will likely be larger. It appears there is still room
for future algorithms to expand and improve upon these results.Chapter 5
Modeling Multiprogrammed Workloads
In this chapter we extend our study of adaptivity to a processor adapting to a
multiprogrammed workload. This presents a challenge, when each application is
attempting to adapt to its own workload, oblivious to the requirements of other
applications running simultaneously in the system. Without intervention from the
operating system, individual applications may make choices which are not optimal
in the global context. It then becomes the responsibility of the operating system
to schedule applications in a way that is globally optimal. We will discuss how the
operating system can adapt its scheduling for the best performance with regard to
energy and throughput. We begin with the simple throughput and energy models
presented in Chapter 3. We then discuss how these can be extended to model
multiprogrammed workloads. Finally, using this model of inter-program behavior,
we show how a scheduler can minimize the total energy used by all applications.
Our intuition tells us that our scheduling algorithm should give more processor
time to applications with high throughput demands, and take processor share away
from lower throughput applications. In response, the lower throughput applica-
tions will be forced to increase their throughput goal to meet their performance
needs, while higher throughput applications can slow down while still maintaining
their overall throughput. This section will explain how the scheduler can achieve
this in such a way as to minimize overall energy consumption.
4546
5.1 Problem Formulation
The job of the operating system scheduler is to assign a fraction of the processor’s
execution time to each running process. We will deﬁne these slices as s1,s2,...,sn,
where there are n processes running in the system and si is the fraction of the
processor time that is given to process i: i.e.
P
si = 1.
We wish to ﬁnd the set of slices that allows each task to meet its throughput
goal, while minimizing global dynamic power consumption. We will solve this
problem in two steps. First we will derive an expression for the energy consumed
by a process completing K units of work in time T. Then, using this expression,
we will derive the slice values that will minimize the overall energy used by all the
processes running in the system.
5.2 Energy Consumed by a Single Process
Recall that we deﬁned the dynamic energy of a MCD processor in Chapter 3 as
follows:
E =
X
i
eiV
2
i
And the fraction of the execution time assigned to each domain as:
Ti =
DelayFactori ∗ CP Counti
Fi
So the slice value for application i is:
si =
Ti
P
j Tj47
Recall also that the frequency of each domain is deﬁned as:
Fi =
fi(Vi − Vth)α
Vi
(5.1)
Where fi and α are properties of the processor.
The overall throughput the processor is equal to the amount of work performed
divided by the time required to perform it:
F =
K
T
(5.2)
Where, F is throughput, K is work and T is the time.
We now introduce a new parameter, ti that represents the fraction of the total
execution time T that is assigned to domain i. So that time can now be expressed
as:
Ti = tiT
We deﬁne the total work being performed by the process as K, and the fraction of
the work assigned to each domain as ki. Therefore, the amount of work assigned
to domain i is:
Ki = DelayFactori ∗ CP Counti = kiK
To simply the analysis in order to ﬁnd a closed-form expression for the time slice
values, we will assume the following approximation:
Fi ≈ fiV
α−1
i48
For a process performing K units of work, we can express the voltage of each
domain as follows:
Vi = (
Fi
fi
)
1
α−1 = (
kiK
fitiT
)
1
α−1
Substituting, we can now express the dynamic energy consumed by a process
performing one unit of work.
Eunit =
X
i
ei(
kiK
fisiT
)
2
α−1 = (
K
T
)
2
α−1
X
i
ei(
ki
fiti
)
2
α−1
So for a process performing K units of work the energy consumed will be:
E = Eunit ∗ K = K ∗ (
K
T
)
2
α−1
X
i
ei(
ki
fiti
)
2
α−1
Now we have the following equation for energy consumed by a process completing
K units of work:
E = K
α+1
α−1T
−2
α−1
X
i
ei(
ki
fiti
)
2
α−1
Since every term inside the sum is a constant, for notational simplicity will replace
the entire sum with the constant ε.
E = εT
−2
α−1K
α+1
α−1 (5.3)49
5.3 Minimizing Inter-Application Energy
We now wish to adjust the slice values (s0,s1,...,sn) of each application in order
to minimize the overall energy consumed by all applications.
We ﬁrst consider the case of two applications, A and B. Using Equation 5.3,
we can express the total energy consumed by both applications as follows:
E = EA + EB = εA(TA)
−2
α−1K
α+1
α−1
A + εB(TB)
−2
α−1K
α+1
α−1
B
We will deﬁne A’s time slice as s. If the total time is T, the execution time available
to a A is:
TA = T ∗ s
The time slice for application is (1 − s), therefore B’s execution time is:
TB = T ∗ (1 − s)
We now re-write the total energy as:
E = εA(Ts)
−2
α−1K
α+1
α−1
A + εB(T(1 − s))
−2
α−1K
α+1
α−1
B
We wish to ﬁnd the value of s that minimizes the energy, E. To do this we set the
diﬀerential of E with respect to s to 0.
∂E
∂s
= 050
Which yields the following expression:
εAK
α+1
α−1
A
s
α+1
α−1
=
εBK
α+1
α−1
B
(1 − s)
α+1
α−1
For notational simplicity, we make the following substitutions:
εA = 
α+1
α−1
A εB = 
α+1
α−1
B
Which allows us to re-write the equality as:
(
AKA
s
)
α+1
α−1 = (
BKB
(1 − s)
)
α+1
α−1
We can now solve for the optimal s value.
s =
AKA
AKA + BKB
For a arbitrary number of applications, we can write the time slice for an applica-
tion i as:
si =
iKi
P
j jKj
(5.4)
In order to solve this equation, we must be able to ﬁnd Ki, and i for each
process. Ki can be easily derived from Equation 5.2, because we can observe both
the execution time and throughput of each application. We can now solve for εj
(and therefore j) because in the following expression, Ej, Tj and Kj can all be
determined:51
Ej = εj(Tj)
−2
α−1K
α+1
α−1
j
5.3.1 Intuitive Meaning of the Scheduling Algorithm
What is the intuitive meaning of our scheduling algorithm? We answer this
question by ﬁrst re-examining the energy consumption of each process.
The energy consumed by process i is optimal when si is equal to the value
determined by our algorithm. We can express the energy used by process i in this
case as follows:
Ei = εi(Fi)
α+1
α−1(T ∗ si)
And therefore the power consumption of process i as:
Pi = εi(Fi)
α+1
α−1
Recall that we can express the throughput as:
Fi =
Ki
T ∗ si
Substituting for the optimal value of si, we re-write the throughput as:
Fi =
Ki
P
j jKj
TKii
=
P
j jKj
Ti
We can now express the power as:52
Pi = ε(
P
j jKj
Ti
)
α+1
α−1
Recall that ε is deﬁned as 
α+1
α−1.
Pi = 
α+1
α−1(
P
j jKj
Ti
)
α+1
α−1
Which gives us the optimal power dissipation as:
Pi = (
P
j jKj
T
)
α+1
α−1 (5.5)
Notice that the optimal power dissipation value for process i is completely in-
dependent of any parameters that are speciﬁc to i. Therefore, the optimal energy
consumption for all applications occurs when the power is equal across all applica-
tions. This ﬁts nicely with our intuition that operating at a constant power level
is desirable. In the next chapter we will see the eﬀectiveness of this scheduling
algorithm in action. From here on we will refer to this method of scheduling as
Power-Matching scheduling.Chapter 6
Inter-Program Adaptivity
In this chapter we will present an adaptive operating system scheduler, based on
the Power-Matching algorithm presented in the previous chapter to make energy-
eﬃcient scheduling decisions. Because our scheduler assumes that applications will
be able to indicate their desired throughput, we also present a simple mechanism for
allowing a single application to provide throughput feedback. Using this feedback,
the operating system uses our modiﬁed scheduling algorithm to run all applications
at the optimal power level. This throughput also serves as the throughput target
used by our adaptive algorithm presented in Chapter 4.
The rest of this chapter is organized as follows. First we will describe our
system for allowing applications to give throughput feedback, which will call the
rate-matching throughput adaptive system (RMTA). We will describe in detail this
system’s API and how it can be used in practice. Then we will discuss the imple-
mentation of our Power-Matching scheduling algorithm within the Linux operating
system. Finally we will present two sets of results gathered from experiments on
two diﬀerent simulation platforms. First, we will present results from the Bochs
full system simulator, which can run a full Linux operating system. Then we
will show results from the SESC cycle-accurate simulator that was introduced
in Chapter 4. In the results section we will discuss in detail the advantages and
disadvantages of each simulation platform.
5354
6.1 Rate-Matching Throughput Control
There is a wealth of research on voltage scaling algorithms [8], [21], [22], [40],
[51], [62], [69]. This work has mostly focused on operating system (OS) techniques
for selecting a globally optimal voltage setting. The primary driving factor in this
class of selection algorithms has been the total system idle time. The operating
system typically scales the voltage (and frequency) down in response to idle periods
and increases it during bursts of activity to try and ﬁnd the lowest possible voltage
setting that eliminates idleness. Such schemes are compelling because they only
require minor changes to the operating system scheduler and no application-level
modiﬁcations. However, heuristic-based, operating system driven algorithms tend
not to exhibit stable behavior, nor do they robustly converge to a single optimal
operating point. This has led to a recent experimental study that thoroughly
evaluated many previous voltage scaling schemes to conclude that “No heuristic
policy that we examined achieved [the optimal voltage and frequency]” [22]. Part
of the reason why these heuristic approaches are limited is because they are driven
solely by system idle time and have no application-speciﬁc information.
Other work has examined how to select the optimal voltage given complete
information about application start times, deadlines, and computation needs [28],
[51]. Given complete application knowledge, these omniscient schemes can opti-
mally pick the operating voltage to minimize energy requirements while meeting
application deadlines. However, while such schemes can provide lower bounds on
energy requirements, they are hard to use in practice because they require com-
plete application information. Due to data-dependent execution and hardware
eﬀects such as cache misses, estimating future execution time for an application is
a daunting task.55
We contend that the problem with these two extremes is the lack of application-
speciﬁc information. OS-directed schemes do not take any application-speciﬁc
deadline information into account, while omniscient schemes assume an impractical
level of application knowledge. The problem stems from the lack of an interface
by which application writers can inform the hardware of relevant information for
making energy-optimal decisions.
In this section, we propose a new interface through which applications can
independently express throughput needs. Any interface for this type of system
should exhibit the following properties:
• Simplicity. It should be practical and intuitive to use. In particular, their
use should not be predicated on detailed knowledge of future application
behavior.
• Eﬃciency. It should provide suﬃcient information for the hardware to make
optimal or near-optimal voltage scheduling decisions with minimal run-time
overhead.
• Protection. It should allow the operating system to make per-process volt-
age scheduling decision. Applications should not be able to override energy
limits imposed on them by the operating system.
• Flexibility. It should enable application to implement any voltage selection
algorithm. Variations in application execution bursts necessitate diﬀering
voltage adaptation schemes.
• Compatibility. It should not preclude legacy applications from being exe-
cuted without modiﬁcation.56
We propose a system that achieves these goals via ﬁne-grained rate-matching.
Our approach relies on extracting explicit progress information from the applica-
tion. This progress is then compared with the desired rate of progress, allowing
the application to raise or lower its throughput target.
6.1.1 API Description
Our API achieves application-driven adaptivity by rate-matching. We call this
API RMTA, for rate-matching throughput adaptivity. We begin with the minimal
set of operations that provide a mechanism by which an application can inform the
system about its progress. The RMTA system can then pick a throughput level
that will meet the demands of the application.
The throughput control system is centered around a counter, Count, which
captures the applications progress. This counter is periodically incremented by
the application via the Progress operation, and decremented by the system at a
rate speciﬁed by the DecRate ﬁeld. Equilibrium is achieved when the increment
and decrement rates balance and keep the counter at a near-constant value.
If the program runs too slowly, then the application-controlled increments will
occur less often than the system-controlled decrements and the counter will even-
tually underﬂow. Likewise, if the application runs too quickly, then the counter
will eventually overﬂow. These conditions are signals to adjust the throughput up
or down respectively.
Such conditions are reﬂected to the application through exceptions. Through-
out this chapter, we will refer to these exceptions as counter exceptions. They
are handled by an exception handler that picks the new operating throughput
for the application. The exception handler does so by writing the application’s57
Throughput value, which in turn updates the actual throughput goal used by the
adaptive algorithm.
The OS has the ability to control an application’s usage of resources by setting
an upper bound on its throughput via the TMax ﬁeld. This bound could be the
physical limits at which the processor can operate, or the OS may wish to restrict
the throughput level further. For example, to extend battery life when a laptop is
not plugged in to a wall socket, the OS could set the maximum throughput of all
applications to a level lower than it would if there were more power available.
If the counter hits zero while an application is running and Throughput is al-
ready equal to TMax, then the application cannot meet its current performance
goal. Depending on the nature of the application, it may then want to exit, con-
tinue running at the highest allowable throughput level, or modify its performance
requirement, possibly performing a quality-of-service adjustment. For example, a
scalable video-decoder that cannot meet its frame rate goal, even at TMax, may
choose to use fewer colors, lower resolution, or a slower frame rate.
The amount of hysteresis in the system can be controlled by limiting the range
of the Counter value. To accomplish this, we provide a new ﬁeld CMax that
bounds the maximum possible counter value. Note that a CMin value is unneces-
sary, as the amount of hysteresis only depends on the range of values the counter
can take, not the absolute value of the counter itself. With this modiﬁcation, ex-
ceptions are reported when the counter hits zero or CMax. To keep the hysteresis
symmetric, TCount is typically initialized to CMax/2.
To more quickly arrive at the equilibrium voltage level (i.e. the level at which
the rate of increments equals the rate of decrements), a counter exception handler
could make use of the number of instructions and the total number of counter58
increments since the last exception. These values are stored in TInsts and TIncs,
respectively. After the exception handler has changed the voltage, it typically
resets the counter to CMax/2.
By default, an application begins with TRate set to zero. This has the eﬀect
of turning oﬀ the voltage control mechanism if the application does not contain
any Progress instructions. This allows legacy applications to execute with no
modiﬁcation.
Finally, there are also times when the processor is truly idle and just needs to
wait until it receives an interrupt (from a timer, for instance). In this case, the
HALT instruction causes the processor to wait until an external interrupt occurs.
The HALT instruction has already been adopted and implemented in many modern
ISAs, and is simply included here for completeness.
6.1.2 Implementation Considerations
OS Extensions. The throughput matching system we have proposed in the pre-
vious section is implemented using extensions to the Linux operating system. The
only piece of hardware that is assumed is a processor which is capable of dynamic
voltage/frequency scaling. A summary of all the extensions added to the operating
is listed below.
• Process State - The parameters used by the RMTA system are stored inside
the Linux process structures, as additional program state. A summary of
these values, as described in the API section, is provided in Table 6.1.
• OS Timer - The counter decrements are handled by an OS timer. When
the timer goes oﬀ, the TCount is decremented. Only the timer associated59
Table 6.1: New application state introduced by the throughput adaptive system.
Reg Description
TMax Maximum counter level
TIncs Increments since last voltage exception
TInsts Instruction since last voltage exception
Throughput Current Throughput Goal
TMax Maximum allowable throughput
TRate Counter decrement rate (Hz)
TCount Counter
with the running application is active.
• System Calls - The Progress command is implemented as a Linux system
call. There is an additional system call that allows the counter exception
handler to specify the new requested throughput. By implementing this as a
system call, instead of letting the application directly update its throughput
goal, the OS has the opportunity to override or modify the request. Another
system call lets the application specify the decrement rate, TRate (eﬀectively
the desired rate of progress).
• Exception Handlers - When a counter exception occurs, the user deﬁned
handler is invoked. The counter exception handlers are segments of user-level
code included in the application executable. When an application wishes to
make use of the RMTA system, it registers the call address of the exception
handler with the operating system. Then, when an exception occurs, the OS
traps to the user-deﬁned handler.
Throughput Calculation. As stated above, counter exception handlers are user-
deﬁned, allowing the program authors to write an algorithm that adjusts the
throughput in an application-speciﬁc way. Although it is up to the application60
Table 6.2: New commands utilized by the throughput-adaptive system.
Instruction Description
PROGRESS Increments counter by one
HALT Stops processor until interrupt
author to determine the best algorithm, any such algorithm will probably have the
following characteristics:
• Increase the throughput if the counter has underﬂowed.
• Decrease the throughput if the counter has overﬂowed.
• Reset all counters.
The simplest such algorithm, which also proves to be highly eﬀective, is an
“Increment/Decrement” algorithm. This simply raises the throughput by a ﬁxed
amount if the counter exceeds TMax and lowers it if the counter drops below
0. More sophisticated algorithms could make use of the number on instructions
(TInsts) and number of increments (TIncs) since the last exceptions.
6.2 Operating System Scheduler
In the previous chapter we have proposed a Power-Matching scheduling algo-
rithm for adapting to a multiprogrammed workload. This algorithm takes into ac-
count the independent throughput demands of all programs running on the system
and then schedules them in such a way as to achieve a maximum energy-eﬃciency.
We implement our system by modifying a standard Linux distribution based
on a 2.4 kernel. We modiﬁed the scheduler according to Equation 5.4 to use per-
process throughput information to adjust scheduling priorities of the processes61
in the system. Once the scheduler priorities have been correctly adjusted, each
application will automatically adjust its throughput to based on the amount of
processor time available to it.
Note that as opposed to previous work on idle time minimization, the through-
put level in our system is determined by a combination of application information
and operating system scheduling. We allow each application to select its own
throughput, and use the information at the OS level to schedule application in
order to minimize energy consumption.
The scheduling that results is quite diﬀerent from what one would observe with
a Weiser-style idle time scheduler. For instance, consider a single, cpu-bound appli-
cation. An idle time scheduler would always run this application at a high voltage,
because each scheduling interval has no idle time. The insertion of PROGRESS
instructions in the application gives the hardware additional information about
the actual needs of the application which may not always be reﬂected in the idle
time. This enables us to save energy without loss in application performance.
6.3 Bochs Results
We wanted to show the eﬀectiveness and feasibility of our RMTA system run-
ning on a real operating system running a realistic application workload. In order
to do this we used Bochs, an open-source, x86 simulator that includes models
for the network, disk, and other devices and can boot the Linux operating sys-
tem. By using Bochs, we were able to implement the entire RMTA system and
Power-Matching scheduler described in the previous section within an otherwise
unmodiﬁed Linux operating system. We also modiﬁed the simulator to model dy-
namic voltage/frequency scaling. The eﬀect of voltage scaling only impacts the62
Table 6.3: List of Benchmarks
Benchmark Description
toast GSM encoder
untoast GSM decoder
mpg123 mp3 player
go simulation of the game of go
ehgml ray tracer
abyss web server
energy and performance of the processor, and Bochs was modiﬁed to correctly ac-
count for a selective slowdown/speedup of the processor. We augmented Bochs to
also record the energy consumption of the processor including a constant level of
static power dissipation. The simulator supports voltage levels ranging from 1.5V
to 0.3V in 0.1V steps. Our simulator also takes the non-linear dependence between
voltage and throughput into account, as well as accounting for those times that do
not scale with voltage (cache misses, disk access, etc). Delay is modeled according
to Equation 5.1, with α = 2. The simulator also calculates the optimal energy
that an application could operate at if it had perfect knowledge of future behavior
based on the arrival times of tasks. In other words, optimal energy corresponds
to the energy that would be consumed if the application were able to run at the
minimum constant voltage level for the entire duration of its execution.
Benchmarks. We used a set of six benchmark programs (shown in Table 6.3),
attempting to use a range of diﬀerent application types to illustrate the applica-
bility of the proposed API. Toast and untoast are audio codecs, and mpg123 is an
MP3 player. Go is a game-tree search that we have modiﬁed to generate moves at
a ﬁxed rate. Ehgml is a ray-tracer that has been modiﬁed so that it renders scenes
at a ﬁxed frame rate. Finally, abyss is a web-server that was modiﬁed to respond63
Table 6.4: Audio ﬁles used as input for benchmarks.
Name Description
austin clip from Austin Powers
bach clip from Bach
godfather clip from the Godfather
hal2001 clip from 2001
jesse Jesse by Joshua Kadison
kennedy clip of John F. Kennedy
lastresort clip of Last Resort
mozart Ein Kleine Nacht Musik by Mozart
pachebel Canon in D by Pachebel
rebecca Rebecca by Pat McGee
fear clip of Roosevelt’s “nothing to fear” speech
infamy clip of Roosevelt’s “Pearl Harbor” speech
to web traﬃc at a ﬁxed rate. The multimedia benchmarks used a variety of input
data sets that are described in Table 6.4. Modifying each benchmark was a simple
task, and, with the exception of abyss, it took us less than an hour per benchmark
to perform the necessary modiﬁcations. Unlike other related work where the qual-
ity of service was varied to meet performance requirements [51], our benchmarks
provide a ﬁxed quality of service—i.e., each run of a benchmark corresponds to
the same amount of work.
6.3.1 Calibration
We calibrated the time and energy reported by Bochs against the time and energy
we measured from a 400 MHz Pentium II-based system with 128MB memory, 4GB
disk, and the Intel 440BX chipset. The simulator parameters were tuned to match
the real system. The time taken by each application was measured using the
Unix time command on the real machine. On Bochs, the time was measured
using the simulator’s internal timer. The real-world runtimes of the applications64
ranged from 4.43 secs (low) to 24 secs (high). For calibration purposes, we used
four of the longer audio data sets (jesse, mozart, pachebel, rebecca). For energy
calibration, we charged a diﬀerent energy cost for each instruction type (integer,
ﬂoating-point, memory). The results of the energy reported by the simulator were
compared against measurements from the Pentium II system. We measured the
current being drawn by the processor by attaching a probe to the voltage regulator
on the motherboard.
Figure 6.1 shows the results of our calibration runs. The y-axis shows the ra-
tio of the metric reported by the simulator to the measured metric. Both energy
and time calibration is reported per benchmark. The largest error in energy mea-
surements we observed was an underestimate by 13.9%, and the largest error we
observed in timing measurements was an overestimate by 4.2%. The average of
the absolute values of the error percentages was 6.1% for the energy and 1.9% for
the time.
6.3.2 Single Application Performance
Figure 6.2 shows the results of RMTA and two other DVS schemes on the six
benchmarks from Table 6.3. For benchmarks with multiple input sets, we took the
sum of the energy per input set which corresponds to a workload that executes
each clip from Table 6.4 once. For each benchmark, we normalize the reported
energy against the energy required by the application when no voltage scaling is
performed (i.e., the normal energy requirement for each application would be 1.0
in Figure 6.2). Each application has four bars: one corresponding to using RMTA,
one corresponding to each of our comparison algorithms, and the last corresponding
to optimal voltage scheduling according to Section 4.65
                                         
              
 
   
   
   
   
 
   
           
            
              
             
             
              
                
               
         
          
            
           
     
      
    
Figure 6.1: Results of calibration against measured data.
We compare RMTA versus a version of the algorithm presented by Weiser
et al., which estimates workload based on idle time [69]. We also enhance the
Weiser algorithm using the PACE methodology [37] (for PACE we used the aged-
k/Gamma version of the algorithm). We expect that RMTA will have an advantage
since it leverages continuous throughput feedback from the application.
For all our applications we compare against optimal energy. We believe the
comparison against optimal to be more meaningful since the total energy reduc-
tion could be improved simply by running the application on a faster (simulated)
machine. RMTA uses 10% more energy than the optimal voltage scaling strategy
in the worst case (abyss), and 5.3% more energy on average. Compared to not
applying any form of voltage scaling, RMTA saves 43% of the total energy re-
quired on average. The Weiser algorithm, which lacks the ﬁne-grained throughput
feedback of RMTA, uses 34.4% more energy than optimal on average. It can be66
                                           
                   
 
   
   
   
   
 
   
                                
      
           
    
       
RMTA
Figure 6.2: Performance of DVS Algorithms on a Single Application
improved considerable using PACE; Weiser+PACE uses 18.4% more than optimal
on average. In overall energy consumption, RMTA uses 21.3% less than Weiser on
average and 9.8% less than Weiser+PACE.
6.3.3 Multiprogrammed Workloads
In the next section we will present a much more detailed analysis of multipro-
grammed workloads. Here we will demonstrate that our RMTA system functions
as expected in a real operating system environment when multiple RMTA applica-
tion are run simultaneously. For each multiprogrammed workload, we keep track
of the per-application energy as well as the optimal per-application energy. We
report results from RMTA runs for three diﬀerent multiprogrammed workloads.
Workload go+mp3 corresponds to running the go benchmark and mpg123 bench-
mark simultaneously. Workload gsm-multi corresponds to two runs of untoast,67
the GSM decoder. Finally, 3app corresponds to go, mpg123, and untoast running
simultaneously.
A summary of the results for the two application workloads is provided in
Figure 6.3. Results from the three application workload is shown in Figure 6.4. For
each workload, the per application energy consumption using RMTA is compared
against the energy used by the application under the optimal throughput level.
Our implementation of RMTA performs to within 3.4% of the optimal on these
three workloads, and reduces the energy requirements of the workload by 42% on
average, versus the maximum throughput case.
Next we will present a series of voltage-level plots, generated by our simulator.
These plots give some insight into the behavior of the RMTA system on several
single application and multiprogrammed workloads.
                                             
               
 
   
   
   
   
   
   
   
   
    
       
                        
                           
                                                  
RMTA
Figure 6.3: RMTA with two applications running simultaneously.
The ﬁrst plot in Figure 6.5 shows the voltage as a function of time for mpg12368
                                               
               
 
   
   
   
   
   
   
   
   
   
 
    
       
                  
                      
                         
RMTA
Figure 6.4: RMTA with three applications running simultaneously.
with the Bach dataset. The voltage curve shows the eﬀect of using an incremental
adjustment in the voltage. For this particular run, the optimal voltage lies between
1.0V and 1.1V. The discrete nature of the voltage adaptation causes the voltage
to periodically increase by 0.1 before stabilizing at 1.0V for a further interval.
        
   
   
   
   
 
   
   
   
                  
 
 
 
 
 
 
 
 
                     
        
   
   
   
   
 
   
   
   
                              
 
 
 
 
 
 
 
 
            
Figure 6.5: mpg123 (left) and go (right)
The second plot Figure 6.5 shows the voltage as a function of time for the
benchmark go. The application workload passes through four distinct phases where69
the workload keeps decreasing as time progresses. This is consistent with the
structure of the application where a more open-ended initial board setup requires
more work to determine the best next move, while this task becomes simpler as
the game progresses.
The ﬁrst plot in Figure 6.6 shows the voltage as a function of time when both go
and mpg123 are executing. The mp3 player is playing a clip that ends at 12 seconds.
The combination of the two applications causes the processor to operate at 1.5
volts until the mp3 player completes. Notice how both applications independently
chose the same voltage to operate at due to the modiﬁed scheduler. Once the mp3
player completes, go is allowed to use a larger fraction of the processor–immediately
lowering its operating voltage and proceeding along the same phases as before.
  
   
 
   
   
   
   
   
   
                           
 
 
 
 
 
 
 
        
          
   
                       
             
   
   
   
   
 
   
   
   
                                               
 
 
 
 
 
 
 
        
             
 
                         
Figure 6.6: Go/mpg123 (left) and 2 GSM encoders (right)
The second plot in Figure 6.6 shows two GSM encoders running at the same
time on diﬀerent audio streams. Notice that the two applications independently
converge to the same operating point. Even though the applications are identical,
they converge to their operating point at diﬀerent rates, due to the diﬀering work-
loads. Once they reach a stable voltage, the two application ﬂuctuates between
0.9 and 1.0 volts, due the fact that only a discrete number of voltage levels are
available. The stable operating voltage for this workload lies somewhere between70
0.9 and 1.0 volts.
Figure 6.7 shows a more complicated mix of three applications; two that start
at approximately the same time and a third application that begins later. We
see the GSM encoder rapidly converge to its stable operating point, while the Go
simulation converges more slowly. The most interesting behavior occurs after the
MP3 player starts running. We see the voltage of the GSM encoder rise in response
to the heavier load on the system. The encoder and MP3 player quickly converge to
the same operating voltage. After the MP3 player exits, the GSM encoder returns
to its original voltage level. In the ﬁnal phase of the run, the two applications still
running both operate at the same voltage. Once again we see the oscillation that
is caused by voltage discretization.
  
    
    
    
    
  
    
    
    
                                       
 
 
 
 
 
 
 
        
           
          
  
                                 Figure 6.7: toast, mpg123 and Go71
6.4 SESC Results
In this section we will present a much more detailed study of multiprogrammed
workloads using results from simulations run on our modiﬁed SESC simulator.
Although it is not possible to run an actual operating system on SESC, it does
have two signiﬁcant advantages over Bochs: (1) Accuracy- as a cycle-accurate
simulator, SESC provides far more accurate estimates of energy consumption and
timing; (2) MCD simulation- because we have implemented support for modeling
MCD processors within SESC, we can observe the adaptive system presented in
Chapter 4 working in concert with our Power-Matching scheduler. In lieu of a
full-operating system scheduler, we have implemented a scheduler within SESC,
to give running processes a fraction of the total processor time in accordance with
our Power-Matching scheduling algorithm. Our scheduler uses 50ms epochs; each
50ms interval is divided amongst the running processes according to the time slice
values determined by the scheduling algorithm. To ensure that the performance
of our algorithm is not heavily dependent on the epoch length, we also evaluated
epochs of 10ms and 100ms. The performance varied less than 2% from the baseline
interval of 50ms. This is due to the fact that the  values remain quite constant
at this granularity, and therefore the scheduling priorities change very little when
the epoch length is changed.
A typical operating system scheduler, like the original Linux 2.4 scheduler used
on Bochs in the previous section, will give equal CPU time to two processor bound
jobs running simultaneously with equal priority in the system. Thus, in the ter-
minology of the previous chapter, by default, for two jobs running at the same
time, s1 = 0.50 and s2 = 0.50. However, as we demonstrated in the previous chap-
ter, this schedule is only optimal if the two application are already dissipating the72
Figure 6.8: Two applications running with K1
K2 = 1
predicted s = 0.53, optimal s = 0.55.
same power. Put in diﬀerent terms, this schedule is optimal if the “weighted work”
(K) that each task is trying to complete is equal. Intuitively, we can say that the
greater the disparity between the work that each process is trying to accomplish,
the greater the opportunity for our scheduler to save energy by altering the slice
values. For this reason, we examine the performance of our scheduler based on
how great the disparity is between the work demands of each process. Because our
adaptive system presented in Chapter 4 allows us to explicitly set the throughput
that each application runs at, we can also explicitly set the work that each pro-
cess is trying to complete. Therefore, if we have two processes trying to complete
K1 and K2 amounts of work respectively, we can explicitly specify the ratio be-
tween the two workloads (K1/K2). If we consider K1 to be the greater of the two
workloads, then we can say that the greater the ratio between the two workloads,73
Figure 6.9: Two applications running with K1
K2 = 2
predicted s = 0.68, optimal s = 0.70.
the more energy our scheduler should be able to save. We have evaluated the
performance of our scheduler running a set of ten two-application workloads with
workload ratios of K1/K2 = 1, K1/K2 = 2, and K1/K2 = 3. For the two applica-
tions we will deﬁne s1 = s and s2 = (1−s). In order to determine the eﬀectiveness
of our scheduler, we determined the optimal energy consumption by exhaustively
searching the entire range of possible values for s from s = 0.10 to s = 0.90 in
steps of 0.05. In addition, we also measure the energy consumed when s is set to
the value predicted by our algorithm. We compare the energy consumed using our
scheduler and the optimal energy against the baseline, which is considered to be
the energy consumed when s = 0.50 (note that this is diﬀerent from our Bochs
results, where energy savings were measured against the full-throughput energy
values). In Figures 6.8, 6.9, and 6.10 we have shown example plots of total energy74
Figure 6.10: Two applications running with K1
K2 = 3
predicted s = 0.72, optimal s = 0.65.
consumed versus s value for three diﬀerent two-application workloads. These plots
show workloads with K1/K2 equal to 1, 2, and 3, respectively. In each plot, there
is an additional data point at the s value predicted by our scheduler.
In the ﬁrst set of results, we run our set of ten workloads with K1/K2 = 1. With
equal amounts of work for each application, the default scheduler should perform
well, as the slice values should be fairly close to 0.50. The only opportunity for
energy improvement comes in the disparity between the  of the two applications.
Figure 6.11 shows the results for this experiment. As we predicted, there is very
little room for improvement over the baseline scheduler. In a few cases, the Power-
Matching scheduler actually performs slightly worse than the default scheduler.
This happens when the Power-Matching scheduler gives too much processor time to
the more power hungry process and “overshoots” the optimal operating point. If we75
-5%
0%
5%
10%
mesa/applu
swim/mpeg
mesa/mgrid
wupwise/vortex
bzip/twolf
vpr/vortex
swim/gcc
crafty/equake
mesa/wupwise
applu/gzip
average
Power-Matching
Optimal
T
h
r
o
u
g
h
p
u
t
 
I
m
p
r
o
v
e
m
e
n
t
Figure 6.11: Results for K1
K2 = 1
move too far past the optimal operating point we can reach an operating point with
higher energy consumption than would be acheived using the default scheduler.
This is also more likely to happen when the  values of the two applications are
very close. The optimal energy savings in this case is only 2.8%, with our scheduler
yielding only a 1.2% improvement.
Table 6.5: Energy Savings versus Baseline Scheduler
K1/K2 = 1 K1/K2 = 2 K1/K2 = 3 Average
Power-Matching Scheduler 1.2 % 7.0 % 14.2 % 7.5 %
Optimal 2.8 % 9.2 % 17.1 % 9.7 %
Next, in Figure 6.12, we see results for the case where the ﬁrst application in
the pair has a workload twice as high as the second (K1/K2 = 2). In this case,
we begin to see a signiﬁcant opportunity for energy savings. The optimal energy76
0%
5%
10%
15%
20%
25%
mesa/applu
swim/mpeg
mesa/mgrid
wupwise/vortex
bzip/twolf
vpr/vortex
swim/gcc
crafty/equake
mesa/wupwise
applu/gzip
average
Power-Matching
Optimal
T
h
r
o
u
g
h
p
u
t
 
I
m
p
r
o
v
e
m
e
n
t
Figure 6.12: Results for K1
K2 = 2
savings across these workloads is 7.0%, with the Power-Matching Scheduler saving
9.2% over the baseline.
Finally, we examine the case where K1/K3 = 3, with results shown in Fig-
ure 6.13. Here we see opportunity for ample energy savings. The optimal every
saving over the baseline is 17.1%, with our scheduler yielding a 14.2% improvement.
The results from these experiments is summarized in Table 6.5. On average
across all three experiments, our scheduler saved 7.8% energy versus the default
scheduler. As the optimal energy was 9.7% lower than the default scheduler, our
algorithm saved 77.3% of the excess energy available to be saved.
We now run an addition set of experiments with a set of three application
workloads. As we did not run an exhaustive search of all possible schedules for
these more complicated workloads, we only show the energy savings versus the77
0%
5%
10%
15%
20%
25%
mesa/applu
swim/mpeg
mesa/mgrid
wupwise/vortex
bzip/twolf
vpr/vortex
swim/gcc
crafty/equake
mesa/wupwise
applu/gzip
average
Power-Matching
Optimal
T
h
r
o
u
g
h
p
u
t
 
I
m
p
r
o
v
e
m
e
n
t
Figure 6.13: Results for K1
K2 = 3
default priority scheduler. Table 6.6 shows a summary of the selected workloads
and the division of work between the applications. The energy savings of the
Power-Matching scheduler is shown in Figure 6.14. We can see a clear trend, that
the workloads with more disparate workload division tend to have higher energy
savings. In the case of crafty/gcc/equake, where each application has the same
amount of work to perform, there is very little opportunity to save energy. We also
see that for applu/bzip2/twolf the energy savings are particularly high. This is
due not only to the variations in work between the applications, but also due the
widely varying  values between these applications. The average savings across all
of these workloads is 11.8%.78
0%
5%
10%
15%
20%
25%
30%
mesa/mgrid/wupwise
applu/bzip2/twolf
vortex/vpr/gzip
crafty/gcc/equake
MPEG/bzip2/gzip
applu/vpr/equake
vortex/gcc/swim
crafty/mgrid/twolf
Average
T
h
r
o
u
g
h
p
u
t
 
I
m
p
r
o
v
e
m
e
n
t
Figure 6.14: Energy Savings for 3 Application Workloads
6.5 Summary
In this chapter, we have presented a scheduler that considers the throughput
and energy demands of the running application and schedules them for maxi-
mum energy, according to the theoretical models present in the previous chapter.
We have shown that this scheduling algorithm, when compared to a traditional
scheduling algorithm, can reduce energy consumption by 11.8% on a selection of
3-application workloads, with no reduction in system throughput. As our sched-
uler requires feedback on the desired throughput of applications, we have also
introduced a Rate-Matching Throughput Adaptive system (RMTA) that allow in-
dividual application to indicate their desired performance level. We demonstrated79
Table 6.6: 3 Application Workloads
Workload Work division (%)
mesa/mgrid/wupwise 60/20/20
applu/bzip2/twolf 50/30/20
vortex/vpr/gzip 40/40/20
crafty/gcc/equake 33/33/33
MPEG/bzip2/gzip 70/15/15
applu/vpr/equake 60/25/15
vortex/gcc/swim 50/25/25
crafty/mgrid/twolf 40/30/30
that this RMTA system can achieve energy consumption of within 5.3% of optimal
on average across a diverse selection of benchmarks.Chapter 7
Conclusion
In this thesis we have presented an MCD architecture that is able to eﬃciently
adapt to its workload. This adaptivity occurs at two levels. At the architectural
level, our MCD processor uses the BestStep algorithm to meet either a through-
put or energy goal by adjusting the voltage/energy conﬁguration of the processor.
Then, considering multiprogrammed workloads, we have demonstrated a Power-
Matching scheduler that uses throughput feedback to schedule applications for
reduced energy consumption. In each case, we ﬁrst presented a theoretical model,
then demonstrated the feasibility of the adaptive system built upon it. We have
also presented a rate matching throughput adaptive (RMTA) system, allowing in-
dividual applications to indicated their desired throughput. This system serves
as a link between the architectural and multiprogrammed adaptive systems. The
throughputs speciﬁed by the RMTA system serve as both the target throughputs
for the BestStep and as the feedback for the Power-Matching scheduler.
Our model of MCD architecture allows predicting the change in energy con-
sumption and overall throughput, given small changes in the frequencies of indi-
vidual domains. We have shown this model to be highly accurate within the given
contraints. Utilizing this model, we have proposed an algorithm that allows the
processor to meet either a speciﬁc throughput or energy goal in an eﬃcient manner.
Our algorithm is unique in that it uses global feedback, in the form of critical path
information, to make online DVFS decisions. We have shown that our algorithm
is able to perform 7.21% better in terms of Et2 when compared to Monolithic
(a synchronous DVFS algorithm) when targeted at a high throughput goal. This
8081
performance is similar to that of one of the best performing MCD DVFS systems
in the literature [70]. Our system performs even better at lower target through-
puts, where the system has more ﬂexibility to adapt the voltage of the individual
domains. At a lower throughput goal, our system jumps to a 15.1% improvement
in Et2 over Monolithic. We also show that our system can improve Et2 when it
is set to keep the overall system energy consumption from exceeding a maximum
value. In this case we show a 17.2% improvement in Et2 versus Monolithic. We
also show that our system has an increased tolerance to variations, as compared
to other DVFS algorithms. In the case where one domain has higher leakage, our
improvement over Monolithic increases from 15.1% to between 15.3% and 19.6%
depending on the degree of leakage assumed.
Our multiprogrammed adaptive system expands our modeling to consider mul-
tiprogrammed workloads. Using our models of throughput and energy, we deter-
mined how a scheduler could apportion processor time between applications to
achieve optimal energy savings, without aﬀecting throughput. We showed that
the optimal schedule also equalizes power dissipation between running processes.
We demonstrated that this scheduling algorithm can reduce total energy consump-
tion by 11.8% as compared to an ordinary priority scheduler across a selection of
3-application multiprogrammed workloads, without impacting the throughput of
the individual applications.
Our RMTA system provides both the throughput targets for the BestStep
algorithm, and the throughput feedback used by the Power-Matching scheduler.
On a full-system simulator, running a customized Linux kernel and realistic work-
load, we demonstrated that this system can achieve with 5.3% of optimal energy
consumption across a diverse array of benchmarks.82
When used together, our architectural and multiprogrammed adaptive systems
provide an comprehensive, full-system approach for allowing an MCD architecture
to meets its workload in an eﬃcient manner.Appendix A
SESC Simulator
This appendix addresses the changes made to the SESC simulator [56]. Most of
the simulation results presented in this thesis (with the exception of the non-MCD
multiprogram results) were generated using a heavily modiﬁed version of the SESC
Simulator. SESC is a fast and highly ﬂexible cycle-accurate simulator from the
University of Illinois, primary authored by Jose Renau and Luis Ceze. For the
research in this thesis, we required a great deal of added functionality in the basic
SESC implementation. A summary of these extensions is presented below.
A.1 Multiprogramming
In order to study multiprogrammed workloads, we implemented a multiprogrammed
version of SESC. By default SESC does have the ability to spawn multiple threads,
but can only load a single executable into memory. We modiﬁed the memory man-
agement system and thread classes to use multiple address spaces. We also re-wrote
portions of the executable loader to load more than one user-speciﬁed executable.
We then added a mini-scheduler to schedule threads on the available processors.
SESC already provides functions to swap processes on and oﬀ of processors, so our
scheduler simply makes use of this existing capability.
A.2 GALS Processor Support
We modiﬁed SESC to support multiple clock and voltage domains. In order to do
this, we implemented a ClockDomain class, which contains domain-speciﬁc infor-
mation, such as a domain’s operating voltage and frequency. Each major compo-
8384
nent of the processor contains a pointer to the ClockDomain to which it belongs.
The user can deﬁne the clock domains in the SESC conﬁguration ﬁle and specify
which processor structures will belong to each domain. Subordinate structures are
automatically placed in the same domain as their parent, unless a separate domain
is speciﬁed. For example, if the Integer ALU is placed in a clock domain, all of the
individual functional units (adders, multipliers, etc.) as well as the integer issue
queue will by default be placed in the same domain. Below is a list of structures
within SESC than can be placed in a unique domain:
• Instruction Fetch
• Rename/Dispatch
• Reorder Buﬀer/Retirement Logic
• ALU Instruction Window
• Individual Functional Units
• L1 Instruction/Data Caches
• L2 Cache
• Memory System
A.3 Critical Path Modeling
We also added a critical path predictor to SESC, to gather the critical path infor-
mation used by our adaptive MCD system. For each instruction, the “last-arriving
edge” information described in our critical path model is tracked by the simulator
and stored in the data-structures SESC uses to track individual instructions. As85
each instruction retires this information is passed to the critical path model, which
builds a graph of the dynamic critical path dependences. This model periodically
performs the “trace backs” and records the total critical path counts.
A.4 Dynamic Voltage Scaling
We added a dynamic voltage scaling system to SESC that can perform an arbitrary
DVS algorithm and then set each clock domain to the appropriate voltage. We
then expanded the energy tracking functionality of SESC to account for varying
voltages. SESC tracks dynamic power based on a set of energy counters based
on Wattch [7]. At startup, each energy counter is added to the ClockDomain
to which it belongs. The ClockDomains also keep a running total of their total
dynamic power consumption. Each time the voltage of a domain changes, the
dynamic power is updated based on the value of each counter in that domain and
its voltage.
A.5 Leakage Modeling
We have incorporated a leakage model into SESC based on the model presented
by Tsai et al [67]. This model estimates leakage power by ﬁrst considered the
leakage equations for a single transistor. Using these equation they calculated
the estimated leakage current for a number of common circuit families found in
modern microprocessors. They extend this to the architectural level by presenting
equations for predicting how many of each type of circuit are likely to appear in
the major architectural components of a microprocessor.BIBLIOGRAPHY
[1] AMD, Mobile AMD Athlon 4 Processor Model 6 CPGA DataSheet, Sep 2001.
[2] A. C. Bavier, A. B. Montz, and L. L. Peterson. Predicting MPEG Execution
Times. Proceeding of the ACM SIGMETRICS 98, pp. 131-140, June 1998.
[3] D. Biermann, E. Gun Sirer, and R. Manohar. A Rate Matching-based Ap-
proach to Dynamic Voltage Scaling. In Proceedings of the First Watson
Conference on the Interaction between Architecture, Circuits, and Compilers
(PAC2), October 2004.
[4] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De.
Parameter Variations and Impact on Circuits and Microarchitecture. In Pro-
ceedings of 36th Design Automation Conference (DAC), June 2003.
[5] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De.
Design and reliability challenges in nanometer technologies. In Proceedings of
37th Design Automation Conference (DAC), June 2004.
[6] K. Bowman, S.G. Duvall, and J.D. Meindl. Impact of Die-to-Die and Within-
Die Parameter Fluctuations on the Maximum Clock Frequency Distribution
for Gigascale Integration. IEEE Journal of Solid-State Circuits, Vol. 37, No.
2. February 2002.
[7] D. Brooks, V. Tiwari, M. Martonosi. Wattch: A Framework for Architectural-
Level Power Analysis and Optimization. In Proceedings of the International
Symposium on Computer Architecture, June 2000.
[8] T.D. Burd, T.A. Pering, A.J. Stratakos, and R. Brodersen. Dynamic Voltage
Scaled Microprocessor System. IEEE Journal of Solid-State Circuits, vol. 35,
pp. 1571-1580, Nov. 2000.
[9] T. D. Burd and R. Brodersen. Design issues for dynamic voltage scaling.
Proceedings of the International Symposium on Low Power Electronics and
Design, pp. 9-14, 2000.
[10] J.A. Butts and G. Sohi. A Static Power Model for Architects. Proceedings of
the 33rd International Symposium on Microarchitecture (MICRO). December
2000.
[11] Y. Cao, P. Gupta, A. B. Kahng, D. Sylvester and J. Yang. Design Sensitivities
to Variability: Extrapolation and Assessments in Nanometer VLSI. IEEE
ASIC/SoC Conference, September 2002.
[12] K. Chen and C. Hu. Performance and Vdd Scaling in deep submicrometers
circuits. In IEEE Journal of Solid State Circuits, Vol. 33. No. 10. October
1998.
8687
[13] J. Douceur and W. Bolosky. Progress-based Regulation of Low importance
Processes. Proceedings of the Seventeenth ACM Symposium on Operating Sys-
tems Principles, pp. 247-258, December 1999.
[14] S. Dropsho, G. Semeraro, D.H. Albonesi, G. Magklis, and M.L. Scott. Dy-
namically Trading Frequency for Complexity in a GALS Microprocessor. 37th
International Symposium on Microarchitecture, pp. 157-168, December 2004.
[15] B. Fields, S. Rubin, and R. Bodik. Focusing Processor Policies via Critical-
Path Prediction. In 28th International Symposium on Computer Architecture.
June 2001.
[16] B. Fields, R. Bodik, and M.D. Hill. Slack: Maximizing Performance Under
Technological Constraints. In 29th International Symposium on Computer Ar-
chitecture. May 2002.
[17] B. Fields, R. Bodik, M.D. Hill, and C.J. Newburn. Using Interaction Costs
for Microarchitectural Bottleneck Analysis. In 36th International Symposium
on Microarchitecture. December 2003.
[18] K. Flautner, S. Reinhardt, and T. Mudge. Automatic Performance-Setting for
Dynamic Voltage Scaling. In Proceedings of the International Conference on
Mobile Computing and Networking (MOBICOM-7), May 2001.
[19] K. Flautner and T. Mudge. Vertigo: Automatic Performance-Setting for
Linux. In 5th Symposium on Operating Systems Design and Implementation,
December 2002.
[20] M. Fleischmann. LongRun Power Management - Dynamic Power Management
for Crusoe Processors. Transmeta Corporation, 2001.
[21] K. Govil, E. Chan, and H. Wasserman. Comparing algorithms for dynamic
speed-setting of a low-power CPU. Proceedings of ACM Int’l Conf. on Mobile
Computing and Networking, pp. 13-25, Nov. 1995.
[22] D. Grunwald, P. Levis, K. Farkas, C. B. Morrey III, and M. Neufeld. Policies
for Dynamic Clock Scheduling. Proceedings of Operating Systems Design and
Implementation, pp. 73-86, Oct. 2000.
[23] I.Hong, M. Potkonjak, and M. Srivastava. On-line scheduling of hard real-
time tasks. In Proceedings of the International Conference on Computer-Aided
Design, pp. 653-656, November 1998.
[24] E. Humenay, D. Tarjan, K. Skadron. Impact of Parameter Variations on Multi-
Core Chips. In Workshop on Architectural Support for Gigascale Integration
(AGSI) held in Conjunction with ISCA-33. June 2006.
[25] Intel. Intel SpeedStep Technology, Jan 2000.88
[26] Intel. Intel 80200 Processor based on Intel XScale Microarchitecture, Nov
2000.
[27] C. Isci and M. Martonosi. Runtime Power Monitoring in High-End Processor:
Methodology and Empirical Data. In Proceedings of the 36th International
Symposium on Microarchitecture (MICRO), December 2003.
[28] T. Ishihara and H. Yasuura. Voltage Scheduling Problem for Dynamically
Variable Voltage Processors. In Proceedings of the International Symposium
on Low Power Electronics and Design, pp. 197-202, August 1998.
[29] S. Iyer and P. Druschel. Anticipatory scheduling: A disk scheduling frame-
work to overcome deceptive idleness in synchronous I/O. In Proceedings of the
18th ACM Symposium on Operating Systems Principles, pp. 117-130, October
2001.
[30] A. Iyer and D. Marculescu. Power Eﬃciency of Multiple Clock, Multiple Volt-
age Cores. In Proc. IEEE/ACM Intl. Conference on Computer-Aided Design
(ICCAD), San Jose, CA, Nov. 2002.
[31] A. Iyer and D. Marculescu. Power-Performance Evaluation of Globally Asyn-
chronous, Locally Synchronous Processors. In Proceedings of the 26th Inter-
national Symposium on Computer Architecture (ISCA), May 2002.
[32] A. Joshi, A. Phansalkar, L. Eeckout, and L.K. John. Measuring Benchmark
Similarity Using Inhereny Program Characteristics. In IEEE Transactions on
Computers, Vol. 55, No. 6, June 2006.
[33] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice-Hall, 1992.
[34] A. Keshavarzi, J.W. Tschanz, S. Narendra, V. De, K. Roy, C.F. Hawkins,
W.R. Daasch, M. Sachdev. Leakage and Process Variation Eﬀects in Current
Testing on Future CMOS Circuits. In IEEE Design and Test of Computers,
2002.
[35] A.J. KleinOsowski and D.J. Lilja. MinneSPEC: A New SPEC Benchmark
Workload for Simulation-Based Computer Architecture Research. In Com-
puter Architecture Letters, Volume 1, May, 2002.
[36] Kuroda et al. Variable supply-voltage scheme for low-power high-speed
CMOS. IEEE Journal of Solid-State Circuits, vol. 33, no. 3, pp. 454-462,
March 1998.
[37] J. Lorch and A.J. Smith. Improving dynamic voltage scaling algorithms with
PACE. In Proceedings of the ACM SIGMETRICS 2001 Conference, pp. 50-61,
June 2001.89
[38] J. Lorch and A.J. Smith. Operating system modiﬁcations for task-based speed
and voltage scheduling. In Proceedings of the First International Conference
on Mobile Systems, Applications, and Services (MobiSys), pp. 215-229, May
2003.
[39] G. Magklis, M.L. Scott, G. Semeraro, D.H. Albonesi, and S. Dropsho. Proﬁle-
based dynamic voltage and frequency scaling for a multiple clock domain
microprocessor. In Proceedings of the 30th Annual International Symposium
on Computer Architecture, June 2003.
[40] A. Manzak and C. Chakrabarti. Variable voltage task scheduling algorithms
for minimizing energy. In Proceedings of International Symposium on Low
Power Electronics and Design, August 2001.
[41] A. Manzak and C. Chakrabarti. Variable Voltage Task Scheduling Algorithms
for Minimizing Energy/Power. IEEE Transactions on VLSI, Vol. 11, No. 2,
April 2003.
[42] D. Marculescu and E. Talpes. Variability and Energy Awareness: A
Microarchitecture-Level Perspective. In Proceedings of the ACM/IEEE De-
sign Automation Conference (DAC), June 2005.
[43] D. Marculescu. Power Eﬃcient Processors Using Multiple Supply Voltages.
In Proc. Workshop on Compilers and Operating Systems for Low Power, in
conjunction with International Conference on Parallel Architectures and Com-
pilation Techniques (PACT), Philadelphia, Oct. 2000.
[44] D. Marculescu. Application Adaptive Energy Eﬃcient Clustered Architec-
tures. In Proc. ACM/IEEE Intl. Symposium on Low Power Electronics and
Design (ISLPED), Newport Beach, CA, Aug. 2004
[45] R. Marculescu, D. Marculescu, and L. Pileggi. Toward an Integrated Design
Methodology for Fault-Tolerant, Multiple Clock/Voltage Integrated Systems.
In Proc. IEEE Intl. Conference on Computer Design (ICCD), San Jose, CA,
Oct. 2004.
[46] A. J. Martin, A. Lines, R. Manohar, M. Nystrom, P. Penzes, R. Southworth,
U. V. Cummings, and T.K. Lee. The Design of an Asynchronous MIPS R3000.
In Proceedings of the 17th Conference on Advanced Research in VLSI, pp. 164-
174, September 1997.
[47] Thomas Martin. Balancing Batteries, Power and Performance: System Issues
in CPU Speed-Setting for Mobile Computing. Ph.D. thesis, Carnegie Mellon
University, 1999.
[48] M. McKusick, K. Bostic, M. Karels, and J. Quarterman. The Design and
Implementation of the 4.4 BSD Operating System. Addison-Wesley, 1996.90
[49] T. Mudge. Power: A First Class Architectural Design Constraint. IEEE Com-
puter Vol. 34, No. 4. April 2001.
[50] K. Niyogi, D. Marculescu. Speed and Voltage Selection for GALS Systems
Based on Voltage/Frequency Islands. In Proc. ACM/IEEE Asian-South Pa-
ciﬁc Design Automation Conference (ASPDAC), Shanghai, China, Jan. 2005.
[51] T. Pering and R. Brodersen. Energy Eﬃcient Voltage Scheduling for Real-
Time Operating Systems. In 4th IEEE Real-Time Technology and Applications
Symposium, Works In Progress Session, 1998.
[52] P. Pillai and K. G. Shin. Real-Time Dynamic Voltage Scaling for Low-Power
Embedded Operating Systems. In Proceedings of the 18th ACM Symposium
on Operating Systems Principles, pp. 89-102. 2001.
[53] J. Pouwelse, K. Langendoen, and H. Sips. Dynamic voltage scaling on a low-
power microprocessor. In Proceedings of the 7th Int. Conference on Mobile
Computing and Networking, pp. 251-259, July 2001.
[54] J. Pouwelse, K. Langendoen, and H. Sips. Energy Priority Scheduling for
Variable Voltage Processors. In Proceedings of the International Symposium
on Low Power Electronics and Design, August 2001.
[55] G. Qu and M. Potkonjak. Energy Minimization with Guaranteed Quality of
Service. In Proceedings of the International Symposium on Low Power Elec-
tronics and Design, pp. 43-49, 2000.
[56] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S.
Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC Simulator.
http://sesc.sourceforge.net, 2005.
[57] G. Semeraro, D.H. Albonesi, S.G. Dropsho, G. Magklis, S. Dwarkadas, and
M.L. Scott. Dynamic Frequency and Voltage Control for a Multiple Clock
Domain Microarchitecture. In 35th Annual International Conference on Mi-
croarchitecture, November 2002.
[58] G. Semeraro, G. Magklis, R. Balasubramonian, D.H. Albonesi, S. Dwarkadas,
and M.L. Scott. Energy Eﬃcient Processor Design Using Multiple Clock Do-
mains with Dynamic Voltage and Frequency Scaling. In Proceedings of the
8th International Symposium on High-Performance Computer Architecture
(HPCA), February 2002.
[59] G. Semeraro, D.H. Albonesi, S. Dropsho, G. Magklis, S. Dwarkadas, and M.L.
Scott. Improving Application Performance by Dynamically Balancing Speed
and Complexity in a GALS Microprocessor. Workshop on Application Speciﬁc
Processors, December 2003.91
[60] G. Semeraro, D.H. Albonesi, G. Magklis, M.L. Scott, S.G. Dropsho, and S.
Dwarkadas. Hiding Synchronization Delays in a GALS Processor Microarchi-
tecture. 10th International Symposium on Asynchronous Circuits and Sys-
tems, pp. 159-169, April 2004.
[61] Semtech. Power Supply Controller for Portable Pentium II & III SpeedStep
Processors, August 2000.
[62] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. DeMicheli. Dynamic
Voltage Scaling and Power Management for Portable Systems. In Proceedings
of the 38th Design Automation Conference, pp. 524-529, 2001.
[63] A. Srivastava, R. Bai, D. Blaauw, and D. Sylvester. Modeling and Analy-
sis of Leakage Power Considering Within-Die Process Variations. In Proceed-
ings of the International Conference on Low Power Electronics and Designs
(ISLPED), August, 2002.
[64] E. Talpes and D. Marculescu. A Critical Analysis of Application-Adaptive
Multiple Clock Processors. In Proc. ACM/IEEE Intl. Symposium on Low
Power Electronics and Design (ISLPED), Seoul, Korea, Aug. 2003.
[65] E. Talpes and D. Marculescu. Increased Scalability and Power Eﬃciency
through Multiple Speed Pipelines. In Proc. ACM Intl. Symposium on Com-
puter Architecture (ISCA), Madison, WI, June 2005.
[66] E. Talpes and D. Marculesu. Toward a Multiple Clock/Voltage Domain Island
Style for Power-Aware Processors. In IEEE Transactions on VLSI, May 2005.
[67] Y.F. Tsai, A. Hegde, N. Vijaykrishnan, M.J. Irwin. ChipPower: An
Architecture-Level Leakage Simulator. In Proceedings of the International
Systems-on-Chip Conference (SoCC), September 2004.
[68] R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham. Eﬃcient Software-
Based Fault Isolation. In Proceedings of the 14th ACM Symposium on Oper-
ating Systems Principles, pp. 203-216, December 1993.
[69] M. Weiser, B. Welch, A. J. Demers, and S. Shenker. Scheduling for Reduced
CPU Energy. In Proceeding the 1st Symposium on Operating System Design
and Implementation, pp.13-23, November 1994.
[70] Q. Wu, P. Juang, M. Martonosi, and D.W. Clark. Formal Online Methods
for Voltage/Frequency Control in Multiple Clock Domain Microprocessors. In
Proceedings of the 11th International Conference on Architectural Support for
Languages and Operating Systems (ASPLOS), October 2004.
[71] Q. Wu, P. Juang, M. Martonosi, and D.W. Clark. Voltage and Frequency Con-
trol With Adaptive Reaction Time in Multiple-Clock-Domain Processors. In92
Proceedings of the 11th International Symposium on High-Performance Com-
puter Architecture (HPCA), February 2005.
[72] F. Xie, M. Martonosi, S. Malik. Compile-Time Dynamic Voltage Scaling Set-
tings: Opportunities and Limits. In Proceedings of the 2003 PLDI Conference,
June 2003.
[73] W. Yuan and K. Nahrstedt. Energy-Eﬃcient Soft Real-Time CPU Schedul-
ing for Mobile Multimedia Systems. In Proceeding of the 19th Symposium on
Operating System Principles, October 2003.
[74] Y. Zhu and F. Mueller. Feedback EDF Scheduling Exploiting Dynamic Voltage
Scaling. In Proceedings of Real-Time and Embedded Technology and Applica-
tions Symposium, pp. 203-212, May 2004.
[75] Y. Zhu, D.H. Albonesi, and A. Buyuktosunoglu. A High Performance, Energy
Eﬃcient GALS Processor Microarchitecture with Reduced Implementation
Complexity. In Proceedings of the International Symposium on Performance
Analysis of Systems and Software (ISPASS), March 2005.
[76] Y. Zhu and D.H. Albonesi. Localized Microarchitecture-Level Voltage Man-
agement. International Symposium on Circuits and Systems, pp. 37-40, May
2006.