Member of HiPEAC by John Cavazos et al.
A Predictive Model for Dynamic Microarchitectural Adaptivity Control
Christophe Dubach, Timothy M. Jones
Members of HiPEAC
University of Edinburgh
Edwin V. Bonilla
NICTA &
Australian National University
Michael F. P. O’Boyle
Member of HiPEAC
University of Edinburgh
Abstract—Adaptive microarchitectures are a promising solu-
tion for designing high-performance, power-efﬁcient micropro-
cessors. They offer the ability to tailor computational resources
to the speciﬁc requirements of different programs or program
phases. They have the potential to adapt the hardware cost-
effectively at runtime to any application’s needs. However, one
of the key challenges is how to dynamically determine the
best architecture conﬁguration at any given time, for any new
workload.
This paper proposes a novel control mechanism based on a
predictive model for microarchitectural adaptivity control. This
model is able to efﬁciently control adaptivity by monitoring
the behaviour of an application’s different phases at runtime.
We show that using this model on SPEC 2000, we double the
energy/performance efﬁciency of the processor when compared
to the best static conﬁguration tuned for the whole benchmark
suite. This represents 74% of the improvement available if we
knew the best microarchitecture for each program phase ahead
of time. In addition, we show that the overheads associated with
the implementation of our scheme have a negligible impact on
performance and power.
I. INTRODUCTION
Adaptive superscalar microarchitectures are a promising
solution to the challenge of designing high-performance,
power-efﬁcient microprocessors. They offer the ability to
tailor computational resources to the speciﬁc requirements
of an application, providing performance when the appli-
cation needs it. At other times, hardware structures can
be reorganised or scaled down for a signiﬁcantly reduced
energy cost. These architectures have the potential to cost-
effectively adapt the hardware at runtime to any application’s
needs.
The amount of adaptation available directly determines
the level of performance and power-savings achievable. With
high adaptivity the processor is able to vary many different
microarchitectural parameters. This maximises the degree of
ﬂexibility available to the hardware, allowing adaptation of
the computational resources to best ﬁt the varying structure
of the running program. Although previous work has quanti-
ﬁed the theoretical beneﬁts of high adaptivity [1], predicting
and delivering this adaptation is still an open and challenging
problem. The key question is how to dynamically determine
the right hardware conﬁguration at any time, for any unseen
program.
In order to achieve the potential efﬁciencies of high adap-
tivity we require an effective control mechanism that predicts
the right hardware conﬁguration in time. Simple feedback
mechanisms that predict the future occupancy requirements
of a resource based on the recent past [2], [3] will not scale
to a large number of conﬁgurations. Other prior works have
used statistical machine learning to construct models which
estimate the performance and/or power as a function of the
microarchitectural conﬁguration [4], [5], [6], [7]. However,
these approaches are not practical in a dynamic setting. We
wish to predict the best microarchitectural parameter values
rather than the performance of any given conﬁguration.
Prior work would require online searching and evaluation
of the microarchitectural conﬁguration space which is not
realistic for anything other than trivial design spaces. What
we require are light-weight, runtime control mechanisms.
This paper develops a runtime resource management
scheme that predicts the best hardware conﬁguration for
any phase of a program to maximise energy efﬁciency. We
use a soft-max machine learning model based on runtime
hardware counters to predict the best level of resource
adaptation. Our model is constructed empirically by iden-
tifying optimal designs on training data. Optima from off-
line training quickly guide the model to runtime optima for
each adaptive interval. We show that determining the right
hardware counters is critical in accurately predicting the
right hardware conﬁguration. We also show that predicting
the right conﬁguration is an unusually difﬁcult learning
problem which explains the lack of progress in this area.
Whenever the program enters a new phase of execution,
our technique proﬁles the application to gather a new type
of temporal histogram hardware counter. These are fed into
our model which dynamically predicts the best hardware
conﬁguration to use for that phase and enables us to double
the average energy/performanceefﬁciency over the best pos-
sible static design. This represents 74% of the improvement
available from knowing the best microarchitecture for each
program phase from our sample space ahead of time.
The rest of this paper is structured as follows. Section II
motivates the use of machine learning for adaptivity. Sec-
tion III then describes our approach to dynamic adapta-
tion using a model explained in section IV. Section V
presents the experimental setup and section VI evaluates
our approach. Section VII investigates model accuracy and
section VIII describes implementation details. Section IX
describes related work and ﬁnally section X concludes. 0
 10
 20
 30
 40
 50
 60
 70
 80
R
e
q
u
i
r
e
d
 
I
Q
 
S
i
z
e
Time
8-Wide 4-Wide
(a) Gap IQ
 0
 10
 20
 30
 40
 50
 60
 70
 80
R
e
q
u
i
r
e
d
 
I
Q
 
S
i
z
e
Time
8-Wide 4-Wide
(b) Applu IQ
 0
 10
 20
 30
 40
 50
 60
 70
 80
R
e
q
u
i
r
e
d
 
I
Q
 
S
i
z
e
Time
8-Wide 4-Wide
(c) Mcf IQ
 0
 20
 40
 60
 80
 100
 120
 140
 160
R
e
q
u
i
r
e
d
 
I
n
t
e
g
e
r
 
R
e
g
i
s
t
e
r
s
Time
8-Wide 4-Wide
(d) Gap RF
 0
 20
 40
 60
 80
 100
 120
 140
 160
R
e
q
u
i
r
e
d
 
I
n
t
e
g
e
r
 
R
e
g
i
s
t
e
r
s
Time
8-Wide 4-Wide
(e) Applu RF
 0
 20
 40
 60
 80
 100
 120
 140
 160
R
e
q
u
i
r
e
d
 
I
n
t
e
g
e
r
 
R
e
g
i
s
t
e
r
s
Time
8-Wide 4-Wide
(f) Mcf RF
Figure 1. How the optimal size of two processor structures varies with time for pipeline widths 8 and 4 for three applications.
II. THE NEED FOR ML-BASED CONTROL
This paper proposes a novel technique for dynamic mi-
croprocessor adaptation that differs substantially from prior
work. Existing schemes, described in section IX, have either
focused on adapting only a few microarchitectural parame-
ters at a time, or proposed techniques for efﬁcient searching
of the design space at runtime. However, these schemes are
not suited for adapting an entire processor’s resources due to
the complex interactions that exist between hardware struc-
tures. Furthermore, runtime searching is undesirable since
it would inevitably visit poorly-performing conﬁgurations,
reducing overall efﬁciency. We require a control mechanism
that can quickly identify the optimal global hardware conﬁg-
uration to minimise power consumption whilst maintaining
high performance.
To illustrate this point, consider ﬁgure 1 where we show
the changing requirements of two hardware structures for
three applications over time, in order to maximise efﬁciency.
The ﬁrst line in each graph shows the size required for best
efﬁciency when the pipeline width is 8 instructions. The
second line shows the desired size when this is reduced to
4 instructions.
It is clear from this ﬁgure that the sizes of the issue queue
and register ﬁle leading to the best efﬁciency vary over time.
Furthermore, they are different when the width is ﬁxed to 4
compared to a width of 8. For example, in gap the optimal
register ﬁle size is initially 113 in both cases, but quickly
needs to be adjusted to 67 when the width is 4. Conversely,
for applu the desired size does not depend on the width.
Furthermore, looking at the required issue queue size for
each application is not enough to ﬁnd the desired register
ﬁle size. In other words, the structures’ optimal sizes change
over time and these changes are not necessarily correlated
with one another.
This motivates the need for machine learning based con-
trol mechanisms to learn how to adapt each structure and
determine the optimal conﬁguration for the entire processor.
The next section discusses our approach, then section IV
gives a formal description of our model.
III. MACHINE LEARNING FOR ADAPTIVITY CONTROL
Our approach to microarchitectural adaptivity control uses
a machine learning model to automatically determine the
best hardware conﬁguration for each phase of a program.
Our model predicts the best parameters for the entire pro-
cessor design space with only one attempt. To do this we
gather hardware counters that can be used to characterise
the phase and then provide them as an input to our model
to guide its predictions. We ﬁrst give an overview of how
our scheme works, then describe the counters that we gather
through dynamic proﬁling of each program phase.
A. Overview
Figure 2 shows an overview of how our technique works.
In stage 1 the application is monitored so that we can
detect when the program enters a new phase of execution.
We then proﬁle the application on a pre-deﬁned proﬁling
conﬁguration in stage 2 to gather characteristics of the new
phase. These are fed as an input into our machine learning
model which gives us a prediction of the best conﬁguration
to use (stage 3). After the processor has been reconﬁgured
we continue running the application until the next phase
change is detected.Figure 2. Overview of our technique. The hardware detects phase changes,
then proﬁles the application on a pre-deﬁned conﬁguration to extract
hardware counters. These are used as an input to our model that predicts
the optimal microarchitectural parameters for the phase. The hardware is
then reconﬁgured and execution continues.
Table I shows the conﬁgurable microarchitectural param-
eters that we have considered. It represents the design space
of a high-performance out-of-order superscalar processor
and is similar to spaces that other researchers have con-
sidered [1]. We vary fourteen different microarchitectural
parameters across a range of values, giving a total design
space of 627 billion points. The prior analysis column
cites papers that have developed techniques to resize each
of the structures we consider. We discuss this further in
section VIII.
The main contribution of this work is a machine learning
model that can accurately predict the best microarchitectural
conﬁguration to use for each program phase. We therefore
focus solely on stages 2 and 3 from ﬁgure 2 in this
paper. Section V describes the experimental methodology
and execution environment in more detail.
B. Dynamic Proﬁling
To characterise each application phase we extract hard-
ware counters from the running program. These are used as
an input to our machine learning model to allow it to predict
the best hardware conﬁguration for the phase.
1) Proﬁling Conﬁguration: One of the main problems
with extracting hardware counters at runtime is the risk
of the internal processor resources saturating: the resources
can become full, causing bottlenecks in the processor. This,
in turn, can hide the real resource requirements making it
difﬁcult to extract accurate information about the program’s
runtime behaviour. To overcome this problem we need to
extract counters on a conﬁguration that makes saturation
unlikely. We therefore brieﬂy use the microarchitectural
conﬁguration with the largest structures and the highest level
of branch speculation (named the proﬁling conﬁguration).
Table I
MICROARCHITECTURAL DESIGN PARAMETERS THAT WERE VARIED
WITH THEIR RANGE, STEPS AND THE NUMBER OF DIFFERENT VALUES
THEY CAN TAKE.
Parameter Value Range Num Prior Analysis
Width 2,4,6,8 4 [8], [9], [10]
ROB size 32 → 160 : 8+ 17 [11], [12]
IQ size 8 → 80 : 8+ 10 [11], [12], [13]
LSQ size 8 → 80 : 8+ 10 [3], [12]
RF sizes 40 → 160 : 8+ 16 [11], [12]
RF rd ports 2 → 16 : 2+ 8
RF wr ports 1 → 8 : 1+ 8
Gshare size 1K → 32K : 2∗ 6
BTB size 1K,2K,4K 3
Branches allowed 8,16,24,32 4
L1 Icache size 8K → 128K : 2∗ 5 [14], [15]
L1 Dcache size 8K → 128K : 2∗ 5 [14], [15]
L2 Ucache size 256K → 4M : 2∗ 5 [14], [15]
Depth (FO4 delay) 9 → 36 : 3+ 10 [16], [17], [18]
Total 627bn
For each program phase we gather hardware counters
on the proﬁling conﬁguration. We then reconﬁgure to the
conﬁguration predicted by our model and run the application
for that phase. Section VIII demonstrates that the cost of
gathering these counters is negligible. The next section now
describes the counters gathered during this proﬁling phase.
2) Hardware Counters: Table II gives a summary of the
counters that we gather for each processor structure. They
monitor the usage of each structure and the events that occur
during the proﬁle gathering phase and would therefore be
simple to extract in a real implementation. We discuss their
implementation in section VIII, showing that they can be
gathered with low overhead.
One key aspect of our counters is the notion of a temporal
histogram. This shows the distribution of events over time
and is vital to capture the exact requirements of each
structure. Each bin of the histogram stores the number of
cycles that the structure has a particular usage (e.g., 100
cycles with 16 entries used, 200 cycles with 32 entries used,
etc.).
Width: For the pipeline width we build a temporal
histogram that keeps track of the usage frequency of each
functional unit type. The histogram bins correspond directly
to the number of units in use.
Queues: We use temporal histograms to collect the
number of entries used in the queue on each cycle. In addi-
tion to this we add information about the average number of
speculative instructions present in the queue and the number
that were mis-speculated. Since our proﬁling conﬁguration
performs a high level of speculation, it is important to know
how many of the instructions are really useful.
Register File: We use temporal histograms to sum-
marise the number of the integer and ﬂoating point registers
used. In addition, temporal histograms are used to store the
usage of the read and write ports.Table II
HARDWARE COUNTERS USED AS AN INPUT TO OUR MACHINE
LEARNING MODEL.
Width
ALU usage (histogram)
Memory port usage (histogram)
Queues
Queue usage (histogram)
Speculative instructions (%)
Mis-speculated instructions (%)
Register File
Register usage (histogram)
Read port usage (histogram)
Write port usage (histogram)
Caches
Stack distance (histogram)
Block reuse distance (histogram)
Set reuse distance (histogram)
Reduced set reuse distance (histogram)
Branch predictor
BTB reuse distance (histogram)
Branch mis-prediction rate (%)
Pipeline depth
Cycles per instruction
Caches: We use temporal histograms representing
stack distance [19], [20] and reuse distance. Each bin corre-
sponds to a speciﬁc distance. Intuitively the stack distance
is important since it characterises the capacity usage of the
cache. We also estimate the potential conﬂicts that could
arise if the cache size were smaller in the Reduced set reuse
distance histogram. To do this we map the sets to those of
the smallest cache size (as though “emulating” the smallest
cache size available).
Branch Predictor: We use the access reuse distance
within the BTB, which is similar to the block reuse distance
in the caches. The second counter corresponds to the branch
mis-prediction rate which is useful to control the degree of
speculation within the processor.
Pipeline Depth: We only need the average number of
instructions executed per cycle over the entire phase.
C. Example
This section gives an example of how the hardware coun-
ters are used to determine the size of the load/store queue
that will lead to the best energy efﬁciency value. Figure 3
shows the efﬁciency values and counters extracted from
phases within four different programs. For each ﬁgure, the
top graph shows the relative efﬁciency of the processor when
the load/store queue size is varied. By choosing the best
conﬁguration for this phase from our training data (described
in section V-C), we can determine the optimal values for
all other parameters. To obtain maximum efﬁciency, the
size of the load/store queue for mgrid should be 32, swim
72, parser 16 and vortex 16. Underneath are the counters
gathered. The queue usage histogram on the left has bins
corresponding to queue sizes. On the right is the average
number of speculative instructions in the queue and the
fraction that were mis-speculated.
For mgrid and swim we see the best queue size directly
corresponds to the observed usage during the proﬁling
phase. For these applications there are few mis-speculated
instructions (mis-spec) present in the queue during the phase.
Now consider parser and vortex which both have a
signiﬁcant number of mis-speculated instructions. This time
the largest bin in the queue usage histogram counter is 8
which does not directly correspond to the size of the queue
that maximises efﬁciency. Instead, the best size of the queue
is 16 entries in both cases. Since these programs have similar
counters and the same desired queue size, our model can
“learn” this information. So, after training on parser, it can
make the correct prediction when it sees the same counters
again in vortex.
The next section shows how these counters can be used
to build a model that makes a single prediction of the best
hardware conﬁguration to use for this phase.
IV. MODELLING GOOD MICROARCHITECTURAL
CONFIGURATIONS ACROSS PROGRAM PHASES
In order to build a model that predicts good microar-
chitectural conﬁgurations across program phases we require
examples of various microarchitecturalconﬁgurationson dif-
ferent program phases and their corresponding performance
metrics (e.g., their energy-efﬁciency values). Additionally,
we require a program phase to be characterised by a set of
hardware counters described in the previous section.
Let {X(j)}M
j=1 be the set of training program phases and
{x(j)}M
j=1 be their corresponding D-dimensional vector of
counters. For each of these program phases we record the
performance on a set of N distinct microarchitectural conﬁg-
urations {y(i)}N
i=1. Each component of a microarchitectural
conﬁguration y is a single microarchitectural parameter ya
with a = 1,...,A, with A representing the number of
architectural parameters (14 in this paper). Given a new
program phase X∗ described by a set of counters x∗, we
aim to predict a set of (good) microarchitectural parameters
y∗ that are expected to lead to the highest energy-efﬁciency.
A. The Model
Our goal is to build a model that correctly captures the
relationship between programphases’ hardware counters and
good microarchitectural conﬁgurations. In other words, we
aim to learn a mapping f : X → e Y from the space of
program phase counters X to the space of good microarchi-
tectural conﬁgurations e Y .
In order to achieve this we model the conditional distri-
bution P(˜ y|x) of good microarchitectural conﬁgurations ˜ y8 16 24 32 40 48 56 64 72 80
0
.
2
0
.
8
1
.
4
Queue size
R
e
l
a
t
i
v
e
 
e
f
f
i
c
i
e
n
c
y
8 16 24 32 40 48 56 64 72 80
Queue usage
%
0
20
40
60
80
100
s
p
e
c
u
l
a
t
i
v
e
m
i
s
−
s
p
e
c
%
0
20
40
60
80
100
(a) Mgrid
8 16 24 32 40 48 56 64 72 80
0
2
4
Queue size
R
e
l
a
t
i
v
e
 
e
f
f
i
c
i
e
n
c
y
8 16 24 32 40 48 56 64 72 80
Queue usage
%
0
20
40
60
80
100
s
p
e
c
u
l
a
t
i
v
e
m
i
s
−
s
p
e
c
%
0
20
40
60
80
100
(b) Swim
8 16 24 32 40 48 56 64 72 80
1
.
5
2
.
5
3
.
5
Queue size
R
e
l
a
t
i
v
e
 
e
f
f
i
c
i
e
n
c
y
8 16 24 32 40 48 56 64 72 80
Queue usage
%
0
20
40
60
80
100
s
p
e
c
u
l
a
t
i
v
e
m
i
s
−
s
p
e
c
%
0
20
40
60
80
100
(c) Parser
8 16 24 32 40 48 56 64 72 80
1
2
3
4
5
Queue size
R
e
l
a
t
i
v
e
 
e
f
f
i
c
i
e
n
c
y
8 16 24 32 40 48 56 64 72 80
Queue usage
%
0
20
40
60
80
100
s
p
e
c
u
l
a
t
i
v
e
m
i
s
−
s
p
e
c
%
0
20
40
60
80
100
(d) Vortex
Figure 3. Load/store queue counters for four phases from different programs. We also show the relative efﬁciency achieved when varying the load/store
queue parameters on the best conﬁguration found (higher is better).
given a set program phase’s counters x. In our approach we
consider each microarchitectural parameter to be condition-
ally independent given the counters:
P(˜ y|x) =
A Y
a=1
P(˜ ya|x). (1)
It is important to note that there are dependencies between
microarchitectural parameters. However, our model assumes
that good parameters are conditionally independent given
the program phase’s counters, rather than assuming marginal
independence between parameters.
B. Predictions
Given the learnt model, we can predict a set of expected
good microarchitectural conﬁgurations y on a new program
phase x∗ by determining the most likely conﬁguration under
the learnt distribution:
y∗ = argmax
˜ y
P(˜ y|x∗), (2)
where we note that, due to conditional independence, this
reduces to computing the value of each ˜ ya that maximises
each single distribution P(˜ ya|x).
C. Model Parametrisation
In our model the conditional distribution of each microar-
chitecture parameter ˜ y (where we omit the subindex a for
clarity) given a set of counters x is described by a soft-max
function:
P(˜ y = sk|x) = σk(x,W) =
exp(wT
k x)
PK
j=1 exp(wT
j x)
, (3)
where P(˜ y = sk|x) denotes the probability of microarchi-
tectural parameter ˜ y having the value sk (out of K possible
values) given the program phase’s counters x; and the D×K
matrix of weights W are the model parameters where each
column {wk}K
k=1 corresponds to a set of weights one for
each value ˜ y can take on1.
D. Model Learning
In order to learn the parameters of the model our approach
is based upon likelihood maximisation. For clarity, we focus
on a single microarchitectural parameter y which can take
one out of K possible values as we can learn the model
parameters for each architectural parameter independently.
1Other approaches were tried and we found that a soft-max model led
to the best results.The data likelihood is given by:
L(W) =
˜ N Y
n=1
K Y
k=1
P(˜ y(n) = sk|x(n))δ(y
(n)=sk), (4)
where x(n) is the vector of counters corresponding to archi-
tecture conﬁguration ˜ y(n) and δ(y(n) = sk) is an indicator
function that is 1 only when the particular architecture
parameter on data-point n (y(n)) takes on the value sk
and zero otherwise. Additionally, we have introduced a
new symbol ˜ N denoting the number of good architecture
conﬁgurations. In our experiments we have selected the set
of good conﬁgurations to be those that are within 5% of the
best empirical performance.
By taking the logarithm of equation (4) and using equation
(3) the expression for the data log-likelihood that we aim to
maximise is:
L =
˜ N X
n=1
K X
k=1
δ(˜ y
(n) = sk)logσk(x
(n),W). (5)
We note that a na¨ ıve maximum likelihood approach can
lead to severe over-ﬁtting. Hence we have considered a
regularised version of the data log-likelihood by adding a
term to penalise large weights, preventing over-ﬁtting:
LPOST = L + λ tr (WTW), (6)
where tr (.) denotes the trace operator and λ is the regu-
larisation parameter.
Thus, the optimal solution to the weight parameters is
obtained with:
WReg = argmax
W
(LPOST) (7)
Training our model means ﬁnding the solution for WReg.
This can be done by using conjugate gradient optimisation
with a deterministic initialisation of all the weights to 1 and
with λ = 0.5. See [21] for more information.
E. Prediction
To make predictions, only equations (2) and (3) need to
be considered because the training is performed off-line. Let
us assume that we are concerned with making predictions
on single architecture parameter and that this parameter may
take on one out of K possible values. Additionally, lets say
that the corresponding model parameters are denoted by the
D × K matrix W. Hence, the computations involved for a
new program phase characterised by the D × 1 vector of
counters x∗ are:
b = WTx∗ (8)
y∗ = argmax
k
(b1,...,bK), (9)
where we have avoided the exponentiation in equation (3)
by realising that, at prediction time, we can make a hard
decision without computing the probabilities explicitly.
V. EXPERIMENTAL METHODOLOGY
This section presents the simulator and benchmarks used.
We also describe how we gathered our training data and the
methodology used to evaluate our technique.
A. Simulator and Benchmarks
Our cycle-accurate simulator is based on Wattch [22], an
extension to SimpleScalar [23]. We altered Wattch’s underly-
ing Cacti [24] models to updated circuit parameters. We also
removed the SimpleScalar RUU and added a reorder buffer,
issue queue and register ﬁles. To make our simulations as
realistic as possible we used Cacti to accurately model the
latencies of the microarchitecturalcomponents as they varied
in size. To avoid errors resulting from cold structures, we
warmed the caches and branch predictor for 10 million
instructions before performing each detailed simulation.
To evaluate our technique, we used all 26 SPEC CPU
2000 benchmarks [25] compiled with the highest optimisa-
tion level. We ran each benchmark using the reference input
set. We extracted 10 phases per program using SimPoint
with an interval size of 10 million instructions.
B. Performance Metric
We have evaluated the results of our predictor using en-
ergy efﬁciency as a metric, measured as [ips3/Watt] where
[ips] is the number of instructions executed per second and
[Watt] is the power consumption in Watts. This metric
represents the trade-offs between power and performance, or
the efﬁciency of each design point. It is widely used within
the architecture community [26] to indicate how efﬁcient a
conﬁguration is at converting energy into processing speed.
C. Gathering the Training Data
As seen in section IV, we need to gather data to train our
model and ﬁnd good solutions within our design space. To
achieve this we ﬁrst searched the design space by uniformly
sampling 1000 random conﬁgurations. We found the best
conﬁguration for each phase, then randomly chose 200
local neighbour conﬁgurations. Finally, we repeated this
by choosing the best out of the 1,200 for each phase and
altered each parameter one at a time to each of its possible
values. This totals 1,298 simulations per phase, or more
than 300,000 in total. In addition, the results of the search
were also used to approximate the best possible performance
achievable per phase.
D. Evaluation Methodology
With this data we built our model and evaluated it using
leave-one-out cross-validation. This is standard machine
learning methodology that ensures that when we present
results for a speciﬁc program, our model has never been
trained with it.
To evaluate our technique, we proceed in three stages. We
ﬁrst characterise the current program phase by running partb
z
i
p
2
_
s
o
u
r
c
e
c
r
a
f
t
y
e
o
n
_
r
u
s
h
m
e
i
e
r
g
a
p
g
c
c
_
i
n
t
e
g
r
a
t
e
g
z
i
p
_
g
r
a
p
h
i
c
m
c
f
p
a
r
s
e
r
p
e
r
l
b
m
k
_
7
0
4
t
w
o
l
f
v
o
r
t
e
x
_
l
e
n
d
i
a
n
1
v
p
r
_
r
o
u
t
e
a
m
m
p
a
p
p
l
u
a
p
s
i
a
r
t
_
1
e
q
u
a
k
e
f
a
c
e
r
e
c
f
m
a
3
d
g
a
l
g
e
l
l
u
c
a
s
m
e
s
a
m
g
r
i
d
s
i
x
t
r
a
c
k
s
w
i
m
w
u
p
w
i
s
e
M
E
A
N
R
e
l
a
t
i
v
e
 
e
f
f
i
c
i
e
n
c
y
0
1
2
3
4
5
6
7
b
z
i
p
2
_
s
o
u
r
c
e
c
r
a
f
t
y
e
o
n
_
r
u
s
h
m
e
i
e
r
g
a
p
g
c
c
_
i
n
t
e
g
r
a
t
e
g
z
i
p
_
g
r
a
p
h
i
c
m
c
f
p
a
r
s
e
r
p
e
r
l
b
m
k
_
7
0
4
t
w
o
l
f
v
o
r
t
e
x
_
l
e
n
d
i
a
n
1
v
p
r
_
r
o
u
t
e
a
m
m
p
a
p
p
l
u
a
p
s
i
a
r
t
_
1
e
q
u
a
k
e
f
a
c
e
r
e
c
f
m
a
3
d
g
a
l
g
e
l
l
u
c
a
s
m
e
s
a
m
g
r
i
d
s
i
x
t
r
a
c
k
s
w
i
m
w
u
p
w
i
s
e
M
E
A
N
R
e
l
a
t
i
v
e
 
e
f
f
i
c
i
e
n
c
y
0
1
2
3
4
5
6
7
Prediction (basic features)
Prediction (advanced features)
Figure 4. Energy-efﬁciency [ips3/Watt] achieved by our model com-
pared to the best overall static conﬁguration for SPEC CPU 2000 (higher is
better). Two different sets of hardware counters were used with our model:
the basic counters are made of the standard performance counters available
on current processors while the advanced ones use the new temporal
histogram counters.
of it on the proﬁling conﬁguration in order to gather the
hardware counters. We use our model to make a prediction
and then continue execution of the current phase with the
conﬁguration supplied by our model. We repeat this process
for all the program’s phases extracted.
VI. RESULTS
This section presents the results of our technique, com-
pared against a baseline static processor conﬁguration.
A. Baseline Conﬁguration
In order to determine a suitable baseline, we examined
all the architecture conﬁgurations in our sample space and
selected the static conﬁguration that led to the best energy-
efﬁciency on average across the benchmarks. This represents
the best achievable with a single ﬁxed static hardware
conﬁguration and is an aggressive baseline. Table III shows
its conﬁguration.
B. Results with two Hardware Counter Sets
In this section we evaluate the gains achievable with
our technique across the benchmark suite for two sets
of hardware counters. The ﬁrst is composed of standard
performance counters available in current processors. This
includes average queue occupancy, number of ALU oper-
ations, average register ﬁle usage, cache access and miss
rates, branch predictor access and miss rates, and average
number of instructions per cycle. The second set of counters
corresponds to the more advanced features presented in
section III-B2 that includes temporal histograms.
Figure 4 shows the energy-efﬁciency improvement
achieved by our approach relative to the baseline conﬁg-
uration for the two counter sets. When compared to the
best static hardware we achieve on average a factor 2x
improvement in energy-efﬁciency with the advanced counter
set. In some cases we achieve over 4x the performance of the
best static hardware for vortex, art, equake and up to 6.5x
b
z
i
p
2
_
s
o
u
r
c
e
c
r
a
f
t
y
e
o
n
_
r
u
s
h
m
e
i
e
r
g
a
p
g
c
c
_
i
n
t
e
g
r
a
t
e
g
z
i
p
_
g
r
a
p
h
i
c
m
c
f
p
a
r
s
e
r
p
e
r
l
b
m
k
_
7
0
4
t
w
o
l
f
v
o
r
t
e
x
_
l
e
n
d
i
a
n
1
v
p
r
_
r
o
u
t
e
a
m
m
p
a
p
p
l
u
a
p
s
i
a
r
t
_
1
e
q
u
a
k
e
f
a
c
e
r
e
c
f
m
a
3
d
g
a
l
g
e
l
l
u
c
a
s
m
e
s
a
m
g
r
i
d
s
i
x
t
r
a
c
k
s
w
i
m
w
u
p
w
i
s
e
M
E
A
N
M
e
t
r
i
c
 
i
m
p
a
c
t
 
(
%
)
0
20
40
60
80
100
120
140
160
180
200
b
z
i
p
2
_
s
o
u
r
c
e
c
r
a
f
t
y
e
o
n
_
r
u
s
h
m
e
i
e
r
g
a
p
g
c
c
_
i
n
t
e
g
r
a
t
e
g
z
i
p
_
g
r
a
p
h
i
c
m
c
f
p
a
r
s
e
r
p
e
r
l
b
m
k
_
7
0
4
t
w
o
l
f
v
o
r
t
e
x
_
l
e
n
d
i
a
n
1
v
p
r
_
r
o
u
t
e
a
m
m
p
a
p
p
l
u
a
p
s
i
a
r
t
_
1
e
q
u
a
k
e
f
a
c
e
r
e
c
f
m
a
3
d
g
a
l
g
e
l
l
u
c
a
s
m
e
s
a
m
g
r
i
d
s
i
x
t
r
a
c
k
s
w
i
m
w
u
p
w
i
s
e
M
E
A
N
M
e
t
r
i
c
 
i
m
p
a
c
t
 
(
%
)
0
20
40
60
80
100
120
140
160
180
200
Performance
Energy
Figure 5. Performance and energy breakdown for our model when using
the advanced features compared to the best overall static conﬁguration. On
average performance is improved by 15% and energy reduced by 21%.
for mcf. Only in two cases is the best static conﬁguration
slightly better than our approach: eon and lucas.
With the basic counter set, our model only achieves 1.3x
average improvement over the best overall static conﬁgu-
ration. For several benchmarks, the performance is signiﬁ-
cantly below that of the advanced counters. This shows that
the more advanced set of counters is necessary in order to
achieve good performance.
C. Breakdown in Performance and Energy
Having seen the results for the combined efﬁciency met-
ric, we now look at the breakdown in terms of performance
[ips] and energy [Joules]. Figure 5 shows these two metrics
individually compared to the best overall static conﬁgura-
tion. On average we observe a 15% increase in performance
and a 21% decrease in energy. For some benchmarks such as
crafty, the model achieves a remarkable 48% cut in energy
while maintainaing the same performance as the baseline
conﬁguration. The model detects that the L2 cache and
the register ﬁle are not being fully utilised and reduces
their correspoding size to 256K and 64 respectively. In
other cases, such as art, the model decreases the energy
consumption by 15% while at the same time increasing
performance by a factor 2. In this case, the model increases
the issue width and the number of read/write ports to the
register ﬁles and at the same time decreases the size of
the instrution cache to achieve lower energy consumption.
This clearly shows that our approach of driving adaptivity
with a predictive model can offer large beneﬁts to these
applications. They would otherwise exhibit poor energy-
efﬁciency had we use a ﬁxed static conﬁguration tuned for
the average case.
VII. ANALYSIS OF THE ACCURACY OF THE MODEL
In this section we evaluate the accuracy of our approach
in predicting the best conﬁguration for each phase of the
applications. We also present an analysis of the model
performance at a phase level and show how architectural
conﬁgurations vary with program phases.Table III
THE CONFIGURATION OF OUR BASELINE ARCHITECTURE.
Width ROB IQ LSQ RF RF rd RF wr Gshare BTB Branches Icache Dcache Ucache Depth
4 144 48 32 160 4 1 16K 1K 24 64K 32K 1M 12
b
z
i
p
2
_
s
o
u
r
c
e
c
r
a
f
t
y
e
o
n
_
r
u
s
h
m
e
i
e
r
g
a
p
g
c
c
_
i
n
t
e
g
r
a
t
e
g
z
i
p
_
g
r
a
p
h
i
c
m
c
f
p
a
r
s
e
r
p
e
r
l
b
m
k
_
7
0
4
t
w
o
l
f
v
o
r
t
e
x
_
l
e
n
d
i
a
n
1
v
p
r
_
r
o
u
t
e
a
m
m
p
a
p
p
l
u
a
p
s
i
a
r
t
_
1
e
q
u
a
k
e
f
a
c
e
r
e
c
f
m
a
3
d
g
a
l
g
e
l
l
u
c
a
s
m
e
s
a
m
g
r
i
d
s
i
x
t
r
a
c
k
s
w
i
m
w
u
p
w
i
s
e
M
E
A
N
R
e
l
a
t
i
v
e
 
e
f
f
i
c
i
e
n
c
y
0
1
2
3
4
5
6
7
b
z
i
p
2
_
s
o
u
r
c
e
c
r
a
f
t
y
e
o
n
_
r
u
s
h
m
e
i
e
r
g
a
p
g
c
c
_
i
n
t
e
g
r
a
t
e
g
z
i
p
_
g
r
a
p
h
i
c
m
c
f
p
a
r
s
e
r
p
e
r
l
b
m
k
_
7
0
4
t
w
o
l
f
v
o
r
t
e
x
_
l
e
n
d
i
a
n
1
v
p
r
_
r
o
u
t
e
a
m
m
p
a
p
p
l
u
a
p
s
i
a
r
t
_
1
e
q
u
a
k
e
f
a
c
e
r
e
c
f
m
a
3
d
g
a
l
g
e
l
l
u
c
a
s
m
e
s
a
m
g
r
i
d
s
i
x
t
r
a
c
k
s
w
i
m
w
u
p
w
i
s
e
M
E
A
N
R
e
l
a
t
i
v
e
 
e
f
f
i
c
i
e
n
c
y
0
1
2
3
4
5
6
7
Best static per program
Best dynamic
Prediction
Figure 6. Energy-efﬁciency achieved by our model for all of SPEC CPU
2000 compared to the best static conﬁguration tailored for each program and
compared with the best dynamic conﬁguration tailored for each program’s
phase. All the values are normalised by the best overall static conﬁguration
(higher is better).
A. Comparison Against Specialised Static Conﬁgurations
Although our approach clearly outperforms any ﬁxed
static conﬁguration, having different specialised static con-
ﬁgurations for each program may be considered an attractive
alternative. This approach is used for domain speciﬁc pro-
cessors such as DSPs and GPUs. Figure 6 shows the perfor-
mance of our technique relative to the best specialised static
conﬁguration found in our sample space for that program.
Clearly such an approach cannot be applied to “unseen”
programs and is not viable for general-purpose computing.
Nonetheless, it gives an important limit evaluation of our
approach.
On average, a specialised static conﬁguration gives a
factor 1.5x improvement compared to the factor 2x of our
approach. It is guaranteed never to perform worse than
the best average static conﬁguration so does not suffer
performance loss in lucas and eon. Conversely, it is unable
to exploit those cases where there is signiﬁcant improvement
available, e.g., mcf and equake, due to the large intra-
program dynamic phase variation.
B. Comparison Against Ideal Dynamic Conﬁgurations
We now wish to determine how far our model is from
the upper bound on efﬁciency. For this purpose we consider
a scheme that has the ability to adapt the microarchitecture
on a per-phase basis with full knowledge about how the
application and architecture will perform. Therefore we se-
lected, ofﬂine, the best conﬁguration from the sample space
for each phase of each program and then ran each phase with
its corresponding ideal conﬁguration (best dynamic) leading
to maximum energy-efﬁciency.
As can be seen in ﬁgure 6, on average this ideal setup
gives an improvement of 2.7x over the best ﬁxed static
conﬁguration. In some cases, like mcf, this improvement
is more than 7x. Even in the worst case, eon, there is
an improvement of 1.5x over the static baseline. As seen
our technique gives an average improvement of 2x, thus
achieving 74% of the available improvement. Generally the
performance of our approach tracks the maximum available.
In the case of galgel, however, there is a 4x improvement
available, yet we achieve only a factor 2x, showing there is
still room for improvement.
C. Accuracy of Our Approach on a Phase Basis
This section evaluates the accuracy of the predictive
model on a per phase basis. Figure 7(a) shows two graphs
overlaid. The ﬁrst is a histogram representing the distribution
of the efﬁciency values for the 260 phases. The x-axis shows
the improvement achieved for a particular phase relative
to the baseline. The y-axis represents the percentage of
phases with a speciﬁc efﬁciency value. So, for example.
the largest bin has an efﬁciency between 1x and 1.5x of
the baseline and corresponds to approximately 30% of the
phases. As in the previous section, the efﬁciency values are
normalised according to the baseline (i.e., the best overall
static conﬁguration).
To determine how often we are better (or worse) than the
baseline and by how much, we can look at the continuous
line on the graph which is the ECDF (Estimated Cumulative
Distribution Function). It shows how often our approach
achieves at least a certain efﬁciency improvement. For
example we see that our model predicts a conﬁguration
better than the baseline for 80% of the phases. We also notice
that for approximately 33% of the phases the predicted
conﬁguration has an efﬁciency of at least two times that
of the baseline. There are even a small number of phases
that achieve improvement of 32 times the baseline.
Although it is important to evaluate our approach relative
to the best static conﬁguration, it is equally important to
compare its accuracy against the best dynamic conﬁgurations
found in the sample space for each phase as shown in
ﬁgure 7(b). The best conﬁguration has a value of 1. If the
performance of the predicted conﬁguration is lower than 1,
it means that it is less efﬁcient. A value greater than 1,
although surprising at ﬁrst, indicates that the prediction is
actually better than the best found in the sample space. This
can occur because the best was not established by using an
exhaustive search of the entire space.0 2 4 6 8 10
0
2
0
4
0
6
0
8
0
1
0
0
Efficiency (relative to baseline)
%
 
o
f
 
p
h
a
s
e
s
33%
baseline
80%
... 32
(a) Baseline vs. Predicted
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
0
2
0
4
0
6
0
8
0
1
0
0
Efficiency (Relative to best found)
%
 
o
f
 
p
h
a
s
e
s
best
9%
0.74
50%
(b) Best vs. Predicted
Figure 7. Histograms showing the distribution of energy-efﬁciency values
for the 260 different phases extracted from SPEC 2000 when compared
to the baseline (a) and the best (b). In addition the ECDF (estimated
cumulative distribution function) is represented by the solid line. The values
are accumulated from the right.
We notice that 50% of the phases achieved at least 74%
of the efﬁciency of the best conﬁguration. In other words, on
average, we expect our model to achieve 74% of the max-
imum available (conﬁrming earlier results). Interestingly,
for about 9% of the phases, the predicted conﬁguration
actually performs better than the best found using a thousand
samples. This provides evidence that our model can actually
predict very efﬁcient parameters.
D. Architecture Conﬁguration Variation
We now want to show how architectural conﬁgurations
affect the efﬁciency of the overall processor design. Due to
space considerations we only present results for three out of
the fourteen microarchitectural parameters.
Figure 8 shows the distribution of efﬁciency values for
our 260 phases as violin diagrams for the width, instruction
queue, and instruction cache. These graphs show what
happens when the considered parameter is ﬁxed to a speciﬁc
value and all others are allowed to vary in order to ﬁnd
the highest-efﬁciency conﬁguration for each phase. This best
efﬁciency value is recorded on the graph for each phase and
the distribution of these values represented by the violin (the
thicker the violin, the more phases are concentrated around
Table IV
NUMBER OF SETS SAMPLED FOR EACH CACHE PER FEATURE TYPE.
Feature Type Insn. cache Data cache L2 cache
Set reuse distance 256 4 16
Blk reuse distance 16 128 32
that value). The % value on top, shows the percentage of
phases for which that ﬁxed hardware parameter is best. For
instance, in the case of processor width (ﬁgure 8(a)), a width
of 2 is best in 22% of cases, while a width of 4 is best in
32% of cases.
By observing these graphs it is clear that there is no single
parameter value that is good for all phases. Considering the
issue queue for instance (ﬁgure 8(b)), we see that a size of
72 is only optimal for 34% of the phases. However, for 25%
of the phases, those below the quantile black line, this value
would mean that the best achievable would be 0.6 that of
the optimal (i.e., 40% less efﬁcient). In addition we see that
the efﬁciency of some phases can drop to 0.3, the extreme
lower point of the violin’s distribution.
Looking at the instruction cache in ﬁgure 8(c) we see
that a small size (64 sets) is optimal for 28% of the phases.
It is also the value that gives the highest median (white
dot) at about 0.9 from the optimal. So if a designer was to
choose a static architecture, this could be a good candidate.
However, the smallest size is also the one that corresponds
to the lowest efﬁciency for some phases. We conclude that
there isn’t a one-ﬁts-all approach and shows the challenges
in building predictors for microarchitectural adaptivity.
VIII. IMPLEMENTATION ANALYSIS
This section describes how our technique could be im-
plemented in an actual processor design. We have evaluated
the costs of gathering our hardware counters and performing
reconﬁguration to demonstrate that our approach can be
implemented at low cost and with few overheads.
Gathering Hardware Counters: The construction of our
temporal histograms is the main overhead when gathering
our hardware counters. However, an efﬁcient implementation
is feasible. Since the caches contain the most complex his-
tograms and consume the largest fraction of total processor
power, they represent an upper bound on the overheads
necessary to characterise program behaviour. The block and
set reuse histograms are the most costly to gather. For each
block the former requires two timestamps (to record the time
the block was brought into the cache and the last hit), and
a hit counter. The latter requires a hit counter per set.
We have used dynamic set sampling [27] to reduce the
number of sets and blocks that need monitoring in order to
build these histograms. We ran the proﬁling conﬁguration
on all program phases and determined the optimum number
of sets that need to be sampled to maintain high prediction
accuracy. The results are shown in table IV. For example,0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
2 4 6 8
Parameter value (fixed)
E
f
f
i
c
i
e
n
c
y
 
(
r
e
l
a
t
i
v
e
 
t
o
 
b
e
s
t
 
f
o
u
n
d
)
22% 32% 28% 18%
(a) Width
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
8 16 24 32 40 48 56 64 72 80
Parameter value (fixed)
E
f
f
i
c
i
e
n
c
y
 
(
r
e
l
a
t
i
v
e
 
t
o
 
b
e
s
t
 
f
o
u
n
d
)
1% 5% 2% 12% 5% 18% 10% 10% 34% 3%
(b) Issue Queue
0
.
4
0
.
6
0
.
8
1
.
0
64 128 256 512 1024
Parameter value (fixed)
E
f
f
i
c
i
e
n
c
y
 
(
r
e
l
a
t
i
v
e
 
t
o
 
b
e
s
t
 
f
o
u
n
d
)
28% 7% 27% 26% 12%
(c) Instruction Cache
Figure 8. Distribution of the highest energy-efﬁciency achievable for the 260 phases when the value of one parameter is ﬁxed and the rest of the parameters
are allowed to vary. For each parameter’s value the white central dot represent the median efﬁciency value achievable in the phases and the black rectangle
shows the two quartiles, where 50% of the data lies. The % value on top, shows the percentage of phases for which that ﬁxed hardware parameter is best.
Dynamic Static
o
v
e
r
h
e
a
d
 
(
%
)
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
Dynamic Static
o
v
e
r
h
e
a
d
 
(
%
)
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
Set reuse distance
Block reuse distance
ICache DDache UCache ICache DCache UCache
Figure 9. Energy overheads of extracting the set and block reuse distance
for each cache. The maximum overhead for the dynamic energy is 1.55%
in the case of the data cache when extracting the block reuse distance. For
the static energy, a maximum overhead of 1.4% is reached.
to gather the data cache’s set reuse distance histogram we
only need to sample four sets.
Figure 9 shows how this translates into energy overheads.
The maximum dynamic energy overhead is 1.6% when
extracting the block reuse histogram from the data cache,
which also incurs a leakage energy overhead of 1.4%.
However, these overheads are only required when running
the proﬁling conﬁguration. We have veriﬁed experimen-
tally that reconﬁguration occurs once every 10 intervals,
on average. Therefore, the overall overheads of gathering
these counters become almost insigniﬁcant. These results
show that gathering our hardware counters is cost-effective
considering the efﬁciency savings that our model achieves.
Resource Reconﬁguration: Adaptation can be achieved
through the use of simple bitline segmentation of processor
structures [12], [13]. This allows partitions to be turned off in
isolation. We have modelled this within our simulator, allow-
ing a 200ns delay to power up 1.2 million transistors [28].
In addition, we have accurately modelled the delays required
to ﬂush caches and stall the pipeline when resources need
reconﬁguration. Table V shows the results.
We see that the branch predictor is the quickest to recon-
ﬁgure at 154 cycles whereas the L2 cache takes the longest
at almost 20,000 cycles. However, the majority of this time is
hidden as transistors can be powered up and down whilst the
resource is still being used. Our results shows that the overall
Table V
OVERHEADS OF RECONFIGURING EACH STRUCTURE IN CYCLES.
Processor structure Cycle overhead
Width 443
RF 487
Bpred 154
ROB 255
IQ / LSQ 234 / 275
ICache / DCache 478 / 620
UCache 18322
performance penalty when reconﬁguration occurs is just 3%
for one interval and the energy overheads are also 3%.
However, since reconﬁguration only occurs once every 10
intervals, the overheads for the whole phase are signiﬁcantly
reduced. This shows that reconﬁguring processor resources
can be achieved with very few overheads that are amortized
over the execution of the whole phase.
Model: Work by Jim´ enez and Lin [29] has shown
how to build a perceptron-based neural branch predictor.
At prediction time, our technique can be seen as a multi-
class generalisation of the perceptron. We can therefore
use a low-overhead version of their proposed circuit-level
implementation, since our approach does not need to be
trained online. This can be achieved, for example, by using
8bit signed integers for the weights (W). Since we have
approximately 2000 of these, this would require 2KB of
storage. Given that the model is only employed once every
10 intervals, on average, we estimate the runtime overheads
to be insigniﬁcant.
IX. PRIOR WORK ON MICROARCHITECTURAL
ADAPTIVITY
Recently, Lee and Brooks [1] showed that it is possible to
signiﬁcantly increase processor energy efﬁciency by adapt-
ing it as a program is running. Our work takes this a step
further and shows that it is possible to build a model that
can automatically drive the adaptation process.
Adaptive Processor Structures: Many researchers have
examined how processor structures can be made adaptive.The last column of table I summarises this information. In
particular the issue queue [2], [3], [11], [12], [13], [30], re-
order buffer [11], [12], register ﬁles [11], [12], pipeline [16],
[8] and caches [14], [30] have been studied.
Dhodapkar and Smith [31] focused on control mecha-
nisms by assessing the use of working set signatures to
detect changes in behaviour of the program. Liang, et. al.
and Tiwari, et. al. separately proposed variable latency
architectures where additional stages can be added to the
pipeline to combat process variations [17], [18].
However, these studies considered only a limited adap-
tivity scope and looked at each of the components of
the processor in isolation using control mechanisms based
on simple heuristics. More recently a table-driven tech-
nique [32] was proposed to reduce peak power in an adaptive
processor. In comparison, our work considers varying all
these parameters together and uses a machine learning model
to control the adaptation process.
Multicore Adaptivity: For multicore processors,
Mai et. al. illustrated an adaptive memory substrate and its
ﬂexibility when implementing very different architectures
named “Smart Memories” [15]. Later Sankaralingam et. al.
proposed the TRIPS architecture [33], Ipek et. al. “Core
Fusion” [9] and Tarjan et al. “Core Federation” [10]. These
last two approaches merge simple cores together in order
to create a wide superscalar processor.
Software-Controlled Adaptivity: Several researchers
have looked at adaptivity control from the software side.
Hughes et. al. [8] looked at multimedia applications charac-
terised by repeated frame processing. Hsu and Kremer [34]
implemented a compiler algorithm that adapts the voltage
and frequency based on the characteristics of the code.
Later Wu et. al. [35] looked at adapting the voltage within
the context of a dynamic compilation framework which
can monitor and transform the program as it is running.
Huang et. al. [36] proposed using subroutines as a natural
way to decide when to reconﬁgure the processor. Finally
Isci et. al. [37] developed a real system framework that
predicts program phases on the ﬂy to guide dynamic voltage
and frequency scaling.
Heuristic-Driven Schemes: Prior work [3] has also
considered controlling adaptivity by looking at hardware
counters extracted at runtime. However, they make use of
a heuristic for search at runtime whereas we directly predict
the best conﬁguration using a machine learning model.
Furthermore, they only focus on three processor queues
whereas we consider 14 parameters at once.
Runtime Exploration: Other researchers looked at
learning the space at runtime [38], [39]. In our context it is
undesirable to perform any sort of runtime exploration since
this would inevitably result in visiting poorly-performing
conﬁgurations and reduce the overall efﬁciency.
Predictive Models: Recently, Ipek et. al. [5], Lee
and Brooks [7] and Joseph et al. [6] proposed predictive
modelling (i.e., machine learning) for architectural design
space exploration. These models predict the design space
of a whole program for various architecture conﬁgurations,
thus enabling the efﬁcient exploration of large design spaces.
However, these are limited to whole program modelling and
must ﬁrst be trained for each application needing prediction.
Furthermore, they are not directly usable within the context
of dynamic adaptation since they would require a search of
the design space at runtime.
Phase Detection: Phase detection techniques are at
the core of any dynamic adaptive system and have been
extensively studied previously. The work from Dhodapkar
and Smith [40] offers a good comparison between many
proposed techniques. There are a number of examples of
online phase detection techniques in the literature that rely
on basic block vectors [41], instruction working sets [31]
or conditional branch counts [14], for example. Wavelet
analysis has also gained some attention [42], [43].
X. CONCLUSION AND FUTURE DIRECTIONS
This paper has proposed a novel technique for dynamic
microprocessor adaptation that differs substantially from
prior work. We built a machine-learning model to predict
the best conﬁguration that uses hardware counters collected
at runtime. We have introduced the notion of a temporal
histogram and shown that our model is able to perform
much better using these than conventional performance
counters. By using our model to drive adaptivity we were
able to double the energy-efﬁciency over the best overall
static conﬁguration. This represents 74% of the best that
achievable within our sampled space.
In this work we have assumed a ﬁxed proﬁling period
and that all resources are adapted at the same time. Given a
hardware substrate capable of reconﬁguring itself at different
frequencies for each resource, the challenge will be to ﬁnd
the degree of adaptation suitable for each hardware structure.
Finally, this paper has targeted a uniprocessor design.
However, the technique presented can be directly applicable
in the context of a multicore processor. If each of the cores
could implement our scheme and dynamic adapt to their own
workloads, this would lead to true heterogeneity; the key to
high energy-efﬁciency. In this scenario a possible extension
to this work could be to look at the implications of resource
sharing when driving adaptivity.
ACKNOWLEDGEMENTS
This work was supported by the Royal Academy of
Engineering and EPSRC. It has made use of the resources
provided by the Edinburgh Compute and Data Facility
(ECDF). (http://www.ecdf.ed.ac.uk/). The ECDF is partially
supported by the eDIKT initiative (http://www.edikt.org.uk).
NICTA is funded by the Australian Government as repre-
sented by the Department of Broadband, Communications
and the Digital Economy and the Australian Research Coun-
cil through the ICT Centre of Excellence program.REFERENCES
[1] B. Lee and D. Brooks, “Efﬁciency trends and limits from
comprehensive microarchitectural adaptivity,” in ASPLOS,
2008.
[2] D. Folegnani and A. Gonzalez, “Energy-effective issue logic,”
in ISCA, 2001.
[3] D. Ponomarev, G. Kucuk, and K. Ghose, “Reducing power
requirements of instruction scheduling through dynamic allo-
cation of multiple datapath resources,” in MICRO, 2001.
[4] C. Dubach, T. Jones, and M. O’Boyle, “Microarchitectural
design space exploration using an architecture-centric ap-
proach,” in MICRO, 2007.
[5] E.Ipek, S.A.McKee, B. de Supinski, M. Schulz, and R. Caru-
ana, “Efﬁciently exploring architectural design spaces via
predictive modeling,” in ASPLOS, 2006.
[6] P. Joseph, K. Vaswani, and M. J. Thazhuthaveetil, “A pre-
dictive performance model for superscalar processors,” in
MICRO, 2006.
[7] B. Lee and D. Brooks, “Accurate and efﬁcient regression
modeling for microarchitectural performance and power pre-
diction,” in ASPLOS, 2006.
[8] C. Hughes, J. Srinivasan, and S. Adve, “Saving energy
with architectural and frequency adaptations for multimedia
applications,” in MICRO, 2001.
[9] E. Ipek, M. Kirman, N. Kirman, and J. Martinez, “Core
fusion: Accommodating software diversity in chip multipro-
cessors,” in ISCA, 2007.
[10] D. Tarjan, M. Boyer, and K. Skadron, “Federation: Repurpos-
ing scalar cores for out-of-order instruction issue,” in DAC,
2008.
[11] J. Abella and A. Gonz´ alez, “On reducing register pressure
and energy in multiple-banked register ﬁles,” in ICCD, 2003.
[12] S. Dropsho, A. Buyuktosunoglu, R. Balasubramonian, D. H.
Albonesi, S. Dwarkadas, G. Semerano, G. Magklis, and M. L.
Scott, “Integrating adaptive on-chip storage structures for
reduced dynamic power,” University of Rochester, Tech. Rep.,
2002.
[13] A. Buyuktosunoglu, D. Albonesi, S. Schuster, D. Brooks,
P. Bose, and P. Cook, “A circuit level implementation of an
adaptive issue queue for power-aware microprocessors,” in
GLSVLSI, 2001.
[14] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and
S. Dwarkadas, “Memory hierarchy reconﬁguration for energy
and performance in general-purpose processor architectures,”
in MICRO, 2000.
[15] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and
M. Horowitz, “Smart memories: A modular reconﬁgurable
architecture,” in ISCA, 2000.
[16] A. Efthymiou and J. Garside, “Adaptive pipeline structures
for speculation control,” in ASYNC, May 2003.
[17] X. Liang, G.-Y. Wei, and D. Brooks, “Revival: Variation
tolerant architecture using voltage interpolation and variable
latency,” in ISCA, 2008.
[18] A. Tiwari, S. Sarangi, and J. Torrellas, “Recycle: Pipeline
adaptation to tolerate process variation,” in ISCA, 2007.
[19] K. Beyls and E. H. D’Hollander, “Reuse distance as a metric
for cache behavior,” in PDCS, 2001.
[20] C. Ding and Y. Zhong, “Predicting whole-program locality
through reuse distance analysis,” in PLDI, 2003.
[21] C. M. Bishop, Pattern Recognition and Machine Learning
(Information Science and Statistics). Springer-Verlag New
York, Inc., 2006.
[22] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A frame-
work for architectural-level power analysis and optimiza-
tions,” in ISCA, 2000.
[23] D. Burger and T. Austin, “The simplescalar tool set, version
2.0.” University of Wisconsin, Tech. Rep. TR-1342, 1997.
[24] D. Tarjan, S. Thoziyoor, and N. P. Jouppi, “Cacti 4.0,” HP
Laboratories Palo Alto, Tech. Rep. HPL-2006-86, 2006.
[25] J. Henning, “Spec cpu2000: Measuring cpu performance in
the new millenium,” IEEE Computer, 2000.
[26] A. Hartstein and T. R. Puzak, “Optimum power/performance
pipeline depth,” in MICRO, 2003.
[27] M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, “A
case for mlp-aware cache replacement,” in ISCA, 2006.
[28] P. Royannez, H. Mair, F. Dahan, M. Wagner, M. Streeter,
L. Bouetel, J. Blasquez, H. Clasen, G. Semino, J. Dong,
D. Scott, B. Pitts, C. Raibaut, and U. Ko, “90nm low leakage
soc design techniques for wireless applications,” in ISSCC,
2005.
[29] D. A. Jim´ enez and C. Lin, “Neural methods for dynamic
branch prediction,” ACM Trans. on Computer Systems,
vol. 20, 2002.
[30] D. Albonesi, “Dynamic ipc/clock rate optimization,” in ISCA,
1998.
[31] A. Dhodapkar and J. Smith, “Managing multi-conﬁguration
hardware via dynamic working set analysis,” in ISCA, 2002.
[32] V. Kontorinis, A. Shayan, D. M. Tullsen, and R. Kumar,
“Reducing peak power with a table-driven adaptive processor
core,” in Micro, 2009.
[33] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh,
D. Burger, S. W. Keckler, and C. R. Moore, “Exploiting
ilp, tlp, and dlp with the polymorphous trips architecture,”
in ISCA, 2003.
[34] C.-H. Hsu and U. Kremer, “The design, implementation, and
evaluation of a compiler algorithm for cpu energy reduction,”
in PLDI, 2003.
[35] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors,
Y. Wu, J. Lee, and D. Brooks, “A dynamic compilation
framework for controlling microprocessor energy and perfor-
mance,” in MICRO, 2005.
[36] M. Huang, J. Renau, and J. Torrellas, “Positional adaptation of
processors: Application to energy reduction,” in ISCA, 2003.
[37] C. Isci, G. Contreras, and M. Martonosi, “Live, runtime phase
monitoring and prediction on real systems with application to
dynamic power management,” in MICRO, 2006.
[38] R. Bitirgen, E. Ipek, and J. F. Martinez, “Coordinated man-
agement of multiple interacting resources in chip multipro-
cessors: A machine learning approach,” in MICRO, 2008.
[39] S. Choi and D. Yeung, “Learning-based smt processor re-
source distribution via hill-climbing,” in ISCA, 2006.
[40] A. S. Dhodapkar and J. E. Smith, “Comparing program phase
detection techniques,” in MICRO, 2003.
[41] T. Sherwood, S. Sair, and B. Calder, “Phase tracking and
prediction,” in ISCA, 2003.
[42] C.-B. Cho, W. Zhang, and T. Li, “Informed microarchitec-
ture design space exploration using workload dynamics,” in
MICRO, 2007.
[43] X. Shen, Y. Zhong, and C. Ding, “Locality phase prediction,”
in ASPLOS, 2004.