Informed microarchitecture design space exploration using workload dynamics by Chang-burm Cho et al.
Informed Microarchitecture Design Space Exploration using Workload Dynamics  
 
Chang-Burm Cho, Wangyuan Zhang and Tao Li 
Intelligent Design of Efficient Architecture Lab(IDEAL) 
http://www.ideal.ece.ufl.edu 
Department of ECE, University of Florida 
choreno@ufl.edu; zhangwy@ufl.edu; taoli@ece.ufl.edu 
 
 
Abstract 
 
Program runtime characteristics exhibit significant 
variation. As microprocessor architectures become more 
complex, their efficiency depends on the capability of 
adapting with workload dynamics. Moreover, with the 
approaching billion-transistor microprocessor era, it is not 
always economical or feasible to design processors with 
thermal cooling and reliability redundancy capabilities that 
target an application’s worst case scenario. Therefore, 
analyzing complex workload dynamics early, at the 
microarchitecture design stage, is crucial to forecast 
workload runtime behavior across architecture design 
alternatives and evaluate the efficiency of workload scenario-
based architecture optimizations. Existing methods focus 
exclusively on predicting aggregated workload behavior. In 
this paper, we propose accurate and efficient techniques and 
models to reason about workload dynamics across the 
microarchitecture design space without using detailed cycle-
level simulations. Our proposed techniques employ wavelet-
based multiresolution decomposition and neural network 
based non-linear regression modeling. We extensively 
evaluate the efficiency of our predictive models in forecasting 
performance, power and reliability domain workload 
dynamics that the SPEC CPU 2000 benchmarks manifest on 
high-performance microprocessors with a microarchitecture 
design space that consists of 9 key parameters. Our results 
show that the models achieve high accuracy in revealing 
workload dynamic behavior across a large microarchitecture 
design space. We also demonstrate that the proposed 
techniques can be used to efficiently explore workload 
scenario-driven architecture optimizations. 
 
1. Introduction 
It is well known to the processor design community that 
program runtime characteristics exhibit significant variation. 
As microprocessor architectures become more complex, 
architects increasingly rely on exploiting workload dynamics 
to achieve cost and complexity design. Therefore, there is a 
growing need for methods that can quickly and accurately 
explore workload dynamic behavior at early 
microarchitecture design stages. Such techniques can quickly 
bring architects insight on application execution scenarios 
across a large design space without resorting to detailed case-
by-case simulations. The workload dynamic profiles are also 
useful in guiding the deployment of scenario-oriented 
architecture optimizations. For example, in light of increasing 
processor power and thermal dissipation, instead of designing 
packaging that can meet the cooling capacity for worst-case 
scenarios, architects can examine how the workload thermal 
dynamics behave across different architecture configurations 
and deploy appropriate dynamic thermal management (DTM) 
policies [1] to mitigate thermal emergencies for their design. 
To obtain the dynamic behavior that programs manifest on 
complex microprocessors and systems, architects resort to 
detailed, cycle-accurate, simulations. Figure 1 illustrates the 
variation in workload dynamics for SPEC CPU 2000 
benchmarks  gap, crafty and vpr, within one execution 
interval. The results show the time-varying behavior of the 
workload performance (gap), power (crafty) and reliability 
(vpr) metrics across simulations with different 
microarchitecture configurations. As can be seen, the 
manifested workload dynamics varies widely across 
processors with different configurations while executing the 
same code base. As the number of parameters in design space 
increases, such variation in workload dynamics cannot be 
captured without using slow, detailed, simulations. However, 
using the simulation-based methods for architecture design 
space exploration where numerous design parameters have to 
be considered is prohibitively expensive. 
 
0 20 40 60 80 100 120 140
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Samples
C
P
I
gap
 
0 20 40 60 80 100 120 140
20
40
60
80
100
120
140
Samples
P
o
w
e
r
 
(
W
)
crafty
 
0 20 40 60 80 100 120 140
0.1
0.15
0.2
0.25
0.3
0.35
Samples
A
V
F
vpr
 
Figure 1 Variation of workload performance (gap), 
power (crafty) and reliability (vpr) dynamics In the past, phase analysis techniques [2, 3, 4, 5] have been 
proposed to detect and classify program execution that shows 
similar time-varying behavior. These techniques, however, do 
not explicitly reveal workload dynamics and studies [6] show 
that program execution classified as the same phase still 
exhibits drastically different dynamic complexities. Recently, 
researchers proposed several predictive models [7, 8, 9, 10, 11, 
12] to reason about aggregated workload behavior at the 
architecture design stage. Among them, linear regression and 
neural network models have been the most used approaches. 
Linear models are straightforward to understand and provide 
accurate estimates of the significance of parameters and their 
interactions. However, they are usually inadequate for 
modeling the non-linear dynamics of real-world workloads 
which exhibit widely different characteristic and complexity. 
Of the non-linear methods, neural network models can 
accurately predict the aggregated program statistics (e.g. CPI 
of the entire workload execution). Such models are termed as 
global models as only one model is used to characterize the 
measured programs. The monolithic global models are 
incapable of capturing and revealing program dynamics 
which contain interesting fine-grain behavior. On the other 
hand, a workload may produce different dynamics when the 
underlying architecture configurations have changed. 
Therefore, new methods are needed for accurately predicting 
complex workload dynamics. 
To overcome the problems of monolithic, global predictive 
models, we propose a novel scheme that incorporates 
wavelet-based multiresolution decomposition techniques, 
which can produce a good local representation of the 
workload behavior in both time and frequency domains. The 
proposed analytical models, which combine wavelet-based 
multiscale data representation and neural network based 
regression prediction, can efficiently reason about program 
dynamics without resorting to detailed simulations. With our 
schemes, the complex workload dynamics are decomposed 
into a series of wavelet coefficients. In the transform domain, 
each individual wavelet coefficient is modeled by a separate 
neural network. We evaluate the efficiency of using wavelet 
neural networks for predicting the dynamics that the SPEC 
CPU 2000 benchmarks manifest on high performance 
microprocessors with a microarchitecture design space that 
consists of 9 key parameters. Our results show that the models 
achieve high accuracy in forecasting workload dynamics 
across a large microarchitecture design space. 
The contributions of this paper are:  
•  We propose the use of wavelet neural network to build 
accurate predictive models for workload dynamic driven 
microarchitecture design space exploration. We show that 
wavelet neural networks can be used to accurately and 
cost-effectively capture complex workload dynamics 
across different microarchitecture configurations. 
•  We evaluate the efficiency of using the proposed 
techniques to predict workload dynamic behavior in 
performance, power and reliability domains. We perform 
extensive simulations to analyze the impact of wavelet 
coefficient selection and sampling rate on prediction 
accuracy and identify microarchitecture parameters that 
significantly affect workload dynamic behavior. 
•  We present a case study of using workload dynamic aware 
predictive models to quickly estimate the efficiency of 
scenario-driven architecture optimizations across different 
domains. Experimental results show that the predictive 
models are highly efficient in rendering workload 
execution scenarios. 
The rest of this paper is organized as follows. In the next 
section, we briefly describe the wavelet transform and the 
multiresolution analysis. The principles of neural networks 
are also introduced. Section 3 presents our wavelet based 
neural networks for workload dynamics prediction and system 
details. Section 4 describes our experimental setup. Section 5 
presents our experimental results on workload dynamics 
prediction and analyzes the tradeoffs between model 
complexity, configuration, and prediction accuracy. Section 6 
demonstrates the efficiency of using the proposed techniques 
in evaluating workload dynamic-driven architecture 
optimizations. Section 7 discusses related work. Section 8 
summarizes this work and presents future directions. 
 
2. Background 
 
To familiarize the reader with the general methods used in 
this paper, we provide a brief overview of wavelet-based 
multiresolution analysis and neural network-based regression 
prediction in this section. To learn more details of wavelets 
and neural networks, a reader is encouraged to read [13, 14]. 
2.1. Wavelet Theory and Multiresolution Analysis 
 
Wavelets are mathematical tools that use a prototype 
function (called the analyzing or mother wavelet) to 
transform data of interest into different frequency 
components, and then analyze each component with a 
resolution matched to its scale. Therefore, the wavelet 
transform provides a compact and effective mathematical 
representation of data. In contrast to Fourier transforms, 
which only offer frequency representations, wavelets are 
capable of providing time and frequency localizations 
simultaneously. Wavelet analysis employs two functions, 
often referred to as the scaling filter and the wavelet filter, to 
generate a family of functions that break down the original 
data. The scaling filter is similar in concept to an 
approximation function, while the wavelet filter quantifies the 
differences between the original data and the approximation 
generated by the scaling function. Wavelet analysis allows 
one to choose the pair of scaling and wavelet filters from 
numerous functions. In this section, we provide a quick 
primer on wavelet analysis using the Harr wavelet, which is 
the simplest form of wavelets [15].  
The Harr discrete wavelet transform (DWT) works by 
averaging two adjacent values on a series of data at a given 
scale to form smoothed, lower-dimensional data (i.e. 
approximations), and the resulting coefficients (i.e. details), 
which are the differences between the values and their 
averages. Figure 2 illustrates the procedure of using Harr-
base DWT to transform a series of data {3, 4, 20, 25, 15, 5, 
20, 3}. As can be seen, scale 1 is the finest representation of the data. At scale 2, the approximations {3.5, 22.5, 10, 11.5} 
are obtained by taking the average of {3, 4}, {20, 25}, {15, 
5} and {20, 3} at scale 1 respectively. The details {-0.5, -2.5, 
5, 8.5} are the differences of {3, 4}, {20, 25}, {15, 5} and 
{20, 3} divided by 2 respectively. The process continues by 
decomposing the scaling coefficient (approximation) vector 
using the same steps, and completes when only one 
coefficient remains.  
Original Data
3, 4, 20, 25, 15, 5, 20, 3  
Wavelet Filter (H0)
-0.5, -2.5, 5, 8.5
Scaling Filter (G0)
3.5, 22.5, 10, 11.5
Scaling Filter (G1)
13, 10.75
Wavelet Filter (H1)
-9.5, -0.75
Scaling Filter (G2)
11.875
Wavelet Filter (H2)
1.125
11.875   1.125   -9.5, -0.75  -0.5, -2.5, 5, 8.5
Approximation (Lev 0) Detail (Lev 1) Detail Coefficients (Level 2) Detail Coefficients (Level 3)  
Figure 2 An example of Harr wavelet transform on 
data {3, 4, 20, 25, 15, 5, 20, 3} 
As a result, wavelet decomposition is the collection of 
average and details coefficients at all scales. In other words, 
the wavelet transform of the original data is the single 
coefficient representing the overall average of the original 
data, followed by the detail coefficients in order of increasing 
resolutions. Different resolutions can be obtained by adding 
difference values back or subtracting differences from the 
averages. For instance, {13, 10.75} = {11.875+1.125, 11.875-
1.125} where 11.875 and 1.125 are the first and the second 
coefficient respectively. This process can be performed 
recursively until the finest scale is reached. Therefore, 
through an inverse transform, the original data can be 
recovered from wavelet coefficients. The original data can be 
perfectly recovered if all wavelet coefficients are involved. 
Alternatively, an approximation of the time series can be 
reconstructed using a subset of wavelet coefficients. Using a 
wavelet transform gives time-frequency localization of the 
original data. As a result, the time domain signal can be 
accurately approximated using only a few wavelet 
coefficients since they capture most of the energy of the input 
data. 
The wavelet transform provides a natural hierarchy 
structure for multiresolution data analysis. This property is 
very useful in capturing complex workload dynamics. Figure 
3 shows the sampled time domain workload execution 
behavior (The y-axis represents sampled workload IPC) on 
the benchmark gcc within one execution interval. In this 
example, the program execution is represented by 64 sampled 
data points. Since variation of program execution over time 
can be viewed as signals or time series, we apply discrete 
wavelet analysis to time series and the generated wavelet 
coefficients capture the characteristics that programs manifest 
at different scales. We then reconstruct workload time 
domain behavior using a subset of wavelet coefficients. 
Figure 4 shows the synthesized time domain workload 
dynamics using various number of wavelet coefficients. In 
Figure 4 (a)-(e), the first 1, 2, 4, 8, and 16 wavelet 
coefficients were used to approximate (i.e. using inverse 
wavelet transforms) program time domain behavior with 
increasing fidelity. As shown in Figure 4 (f), when all (e.g. 
64) wavelet coefficients are used, the original signal can be 
completely restored. Since a small set of wavelet coefficients 
provide concise yet informative information on workload 
dynamics, we use predictive models (i.e. neural networks) to 
relate them individually with various microarchitecture 
design parameters. 
 2.2 Neural Network  
 
An Artificial Neural Network (ANN) is an information 
processing paradigm that is inspired by the way biological 
nervous systems process information. It is composed of a set 
of interconnected processing elements working in unison to 
solve problems. The most common type of neural network 
(shown as Figure 5) consists of three layers of units: a layer 
of input units is connected to a layer of hidden units, which is 
connected to a layer of output units. The input is fed into 
network through input units. Each hidden unit receives the 
entire input vector and generates a response. The output of a 
hidden unit is determined by the input-output transfer 
function that is specified for that unit. The commonly used 
transfer functions include sigmoid, linear threshold function 
and Radial Basis Function (RBF) [16]. The ANN output, 
which is determined by the output unit, is computed using the 
responses of the hidden units and the weights between the 
hidden and output units. Neural networks outperform linear 
models in capturing the complex, non-linear relation between 
input and output, which make them promising techniques for 
tracking and forecasting workload dynamics. 
f(x )
w 1 w n w 2
x 1 x n in p u t la y e r
hi dden l ayer
out put  l ayer
 
Figure 5 Basic architecture of a neural network 
In this study, we use the RBF transfer function to model 
and estimate workload dynamic characteristics on unexplored 
design spaces because of its superior capability in 
approximating complex functions. The basic architecture of 
an RBF network with n-inputs and a single output is shown in 
Figure 5. The nodes in adjacent layers are fully connected. 
Such a network can be represented by the following 
parametric model: 
) , (| ) (
1
i i
r
i
i i X w x f θ µ φ − =∑
=
, 
where  X is an input vector,  i φ  is the basis function of the 
network,  i w ’s are weights of network and 
T
in i i i ) ,..., , ( 2 1 µ µ µ µ =  is called the center vector of the i th 
node,  T
in i i i ) ,..., , ( 2 1 θ θ θ θ = is called the radius vector of the 
i th node, and || || denotes the Euclidean norm.   
 
Figure 3 Sampled time domain program behavior 
(a) 1 wavelet coefficient  (b) 2 wavelet coefficients 
   
(c) 4 wavelet coefficients  (d) 8 wavelet coefficients 
(e) 16 wavelet coefficients  (f) 64 wavelet coefficients 
Figure 4 Workload dynamics can be synthesized using a set of wavelet coefficients 
 
In our study, a Gaussian function is used as the basis 
function of the network, i.e., 
 



 








 − −
= −
2
2 1 ) ,..., , (
exp ) , (
n i i i
i
i i i
X
X
θ θ θ
µ
θ µ φ
 
The above RBF function has the highest response when 
input corresponds to the center vector and response 
monotonically decrease if input data are far from the center 
which is controlled by radius vector. The training of the RBF 
network involves selecting the center locations and radii 
(which are eventually used to determine the weights) using a 
regression tree. A regression tree recursively partitions the 
input data set into subsets with decision criteria. As a result, 
there will be a root node, non-terminal nodes (having sub 
nodes) and terminal nodes (having no sub nodes) which are 
associated with input dataset. Each node contributes one unit 
to RBF network’s center and radius vectors. In our study, the 
selection of RBF centers is performed by recursively parsing 
regression tree nodes using a strategy proposed in [16]. 
2.3 Combing Wavelets and Neural Networks for 
Workload Dynamics Prediction 
We view workload dynamics as a time series produced by 
the processor which is a nonlinear function of its design 
parameter configuration. Instead of predicting this function at 
every sampling point, we employ wavelets to approximate it. 
Previous work [8, 10, 12] shows that neural networks can 
accurately predict aggregated workload behavior during 
design space exploration. Nevertheless, the monolithic global 
neural network models lack the capability of revealing 
complex workload dynamics. To overcome this disadvantage, 
we propose using wavelet neural networks that incorporate 
multiscale wavelet analysis into a set of neural networks for 
workload dynamics prediction. The wavelet transform is a 
very powerful tool for dealing with dynamic behavior since it 
captures both workload global and local behavior using a set 
of wavelet coefficients. The short-term workload 
characteristics are decomposed into the lower scales of 
wavelet coefficients (high frequencies) which are utilized for 
detailed analysis and prediction, while the global workload 
behavior is decomposed into higher scales of wavelet 
coefficients (low frequencies) that are used for the analysis 
and prediction of slow trends in the workload execution. 
Collectively, these coordinated scales of time and frequency 
provides an accurate interpretation of workload dynamics. 
Our wavelet neural networks use a separate RBF neural 
network to predict individual wavelet coefficients at different 
scales. The separate predictions of each wavelet coefficients 
are proceed independently. Predicting each wavelet 
coefficients by a separate neural network simplifies the 
training task of each sub-network. The prediction results for 
the wavelet coefficients can be combined directly by the 
inverse wavelet transform to predict the workload dynamics. 
Figure 6 shows our hybrid neuro-wavelet scheme for 
workload dynamics prediction. Given the observed workload 
dynamics on training data, our aim is to predict workload 
dynamic behavior under different architecture configurations. 
The hybrid scheme basically involves three stages. In the first 
stage, the time series is decomposed by wavelet 
multiresolution analysis. In the second stage, each wavelet 
coefficient is predicted by a separate ANN. In the third stage, 
the approximated time series is recovered from the predicted 
wavelet coefficients. Each RBF neural network receives the 
entire microarchitectural design space vector and predicts a 
wavelet coefficient. The training of a RBF network involves 
determining the center point and a radius for each RBF and 
the weights of each RBF which determine the wavelet 
coefficients. 
3. Experimental Methodology 
We evaluate the efficiency of using wavelet neural 
networks to explore workload dynamics in performance, 
power and reliability domains during microarchitecture 
design space exploration. We use a unified, detailed 
microarchitecture simulator in our experiments. Our 
simulation framework, built using a heavily modified and  
 G 0
H 0
G 1
H 1
G k
H k
...
.
.
.
Workload Dynamics (Time Domain)
W
a
v
e
l
e
t
 
D
e
c
o
m
p
o
s
i
t
i
o
n
W
a
v
e
l
e
t
 
C
o
e
f
f
i
c
i
e
n
t
s
.
.
.
 
.
.
.
.
.
. Microarchitecture 
Design Parameters
Predicted Wavelet 
Coefficient 1
.
.
.
.
.
. Microarchitecture 
Design Parameters
Predicted Wavelet 
Coefficient 2
Microarchitecture 
Design Parameters
.
.
.
.
.
.
.
.
.
RBF Neural Networks
.
.
.
.
.
. Predicted Wavelet 
Coefficient n
 
G*0
H*0
G*1
H*1
G*k
H*k
...
.
.
.
Synthesized Workload Dynamics (Time Domain)
W
a
v
e
l
e
t
 
R
e
c
o
n
s
t
r
u
c
t
i
o
n
.
.
.
P
r
e
d
i
c
t
e
d
 
W
a
v
e
l
e
t
 
C
o
e
f
f
i
c
i
e
n
t
s 0 0
0, 0, 0, 0, 0, …, 0
 
Step 1. Workload dynamics is 
decomposed into a series of 
wavelet coefficients using 
discrete wavelet transform. 
Step 2. Each wavelet coefficient 
is predicted by a separate ANN. 
Step 3. Workload dynamics is 
reconstructed by an inverse 
wavelet transform on predicted 
wavelet coefficients. 
Figure 6. Using wavelet neural networks for workload dynamics prediction 
 
extended version of the Simplescalar tool set [17], models 
pipelined, multiple-issue, out-of-order execution micro- 
processors with multi-level caches. Our framework uses a 
Wattch-based power model [18]. In addition, we built the 
Architecture Vulnerability Factor (AVF) analysis methods 
proposed in [19, 20] to estimate processor microarchitecture 
vulnerability to transient faults. A microarchitecture 
structure’s AVF refers to the probability that a transient fault 
in that hardware structure will result in incorrect program 
results. The AVF metric can be used to estimate how 
vulnerable the hardware is to soft errors during program 
execution. Table 1 summarizes the baseline machine 
configurations of our simulator. 
Table 1. Simulated machine configuration 
Parameter Configuration 
Processor Width   8-wide fetch/issue/commit 
Issue Queue  96 
ITLB  128 entries, 4-way, 200 cycle miss 
Branch Predictor  2K entries Gshare, 10-bit global history 
BTB  2K entries, 4-way 
Return Address  32 entries RAS 
L1 Instruction 
Cache 
32K, 2-way, 32 Byte/line, 2 ports, 1 cycle 
access 
ROB Size  96 entries  
Load/ Store  48 entries  
Integer ALU  8 I-ALU, 4 I-MUL/DIV, 4 Load/Store 
FP ALU  8 FP-ALU, 4FP-MUL/DIV/SQRT 
DTLB  256 entries, 4-way, 200 cycle miss 
L1 Data Cache  64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle 
L2 Cache  unified 2MB, 4-way, 128 Byte/line, 12 
cycle access 
Memory Access   64 bit wide, 200 cycles access latency 
 
We perform our analysis using twelve SPEC CPU 2000 
benchmarks  bzip2, crafty, eon, gap, gcc, mcf, parser, 
perlbmk, twolf, swim, vortex and vpr. We use the Simpoint 
tool [2] to pick the most representative simulation point for 
each benchmark (with full reference input set) and each 
benchmark is fast-forwarded to its representative point before 
detailed simulation takes place. Each simulation contains 
200M instructions. In this study, we consider a design space 
that consists of 9 microarchitectural parameters (see Tables 2) 
of the superscalar architecture. These microarchitectural 
parameters have been shown to have the largest impact on 
processor performance [8]. The ranges for these parameters 
were set to include both typical and feasible design points 
within the explored design space. Using detailed cycle-
accurate simulations, we measure processor performance, 
power and reliability characteristics on all design points 
within both training and testing data sets. We build a separate 
model for each program and use the model to predict 
workload dynamics in performance, power and reliability 
domains at unexplored points in the design space. The 
training data set is used to build the wavelet-based neural 
network models. An estimate of the model’s accuracy is 
obtained by using the design points in the testing data set. 
To build a representative design space, one needs to ensure 
that the sample data sets disperse points throughout the design 
space but keeps the space small enough to keep the model 
building cost low. To achieve this goal, we use a variant of 
Latin Hypercube Sampling (LHS) [21] as our sampling 
strategy since it provides better coverage compared to a naive 
random sampling scheme. We generate multiple LHS 
matrices and use a space filing metric called L2-star 
discrepancy [22]. The L2-star discrepancy is applied to each 
LHS matrix to find the representative design space that has 
the lowest value of L2-star discrepancy. We use a randomly 
and independently generated set of test data points to 
empirically estimate the predictive accuracy of the resulting 
models. In this paper, we used 200 train data and 50 test data 
for workload dynamic prediction since our study shows that it 
offers good tradeoffs between simulation time and prediction 
accuracy for the design space we considered. In our study, 
each workload dynamic trace is represented by 128 samples. 
Predicting each wavelet coefficient by a separate neural 
network simplifies the learning task. Since complex workload 
dynamics can be captured using limited number of wavelet 
coefficients, the total size of wavelet neural networks can be 
small. Due to the fact that small magnitude wavelet 
coefficients have less contribution to the reconstructed data,  Table 2. Microarchitectural parameter ranges used for generating train/test data 
Ranges 
Parameter 
Train Test 
# of Levels 
Fetch_width  2, 4, 8, 16  2, 8 4 
ROB_size  96, 128, 160  128, 160 3 
IQ_size  32, 64, 96, 128 32, 64 4 
LSQ_size  16, 24, 32, 64  16, 24, 32 4 
L2_size  256, 1024, 2048, 4096 KB 256, 1024, 4096 KB 4 
L2_lat  8, 12, 14, 16, 20 8, 12, 14 5 
il1_size  8, 16, 32, 64 KB 8, 16, 32 KB 4 
dl1_size  8, 16, 32, 64 KB 16, 32, 64 KB 4 
dl1_lat  1, 2, 3, 4  1, 2, 3 4 
 
we opt to only predict a small set of important wavelet 
coefficients. Specifically, we consider the following two 
schemes for selecting important wavelet coefficients for 
prediction: (1) magnitude-based: select the largest k 
coefficients and approximate the rest with 0 and (2) order-
based: select the first k coefficients and approximate the rest 
with 0.  In this study, we choose to use the magnitude-based 
scheme since it always outperforms the order-based scheme. 
To apply the magnitude-based wavelet coefficient selection 
scheme, it is essential that the significance of the selected 
wavelet coefficients do not change drastically across the 
design space. Figure 7 illustrates the magnitude-based ranking 
(shown as a color map where red indicates high ranks and 
blue indicates low ranks) of a total 128 wavelet coefficients 
(decomposed from benchmark gcc dynamics) across 50 
different microarchitecture configurations. As can be seen, the 
top ranked wavelet coefficients largely remain consistent 
across different processor configurations. 
 
 
 
P
r
o
c
e
s
s
o
r
 
C
o
n
f
i
g
u
r
a
t
i
o
n
 
Wavelet Coefficient Index 
Figure 7. Magnitude-based ranking of 128 wavelet 
coefficients 
4. Evaluation and Result 
In this section, we present detailed experimental results on 
using wavelet neural networks to predict workload dynamics 
in performance, power and reliability domains. The workload 
dynamic prediction accuracy measure is the mean square 
error (MSE) defined as follows: 
2
1
)) ( ˆ ) ( ((
1 ∑
=
− =
N
k
k x k x
N
MSE , 
where:  ) (k x  is the actual value,  ) ( ˆ k x is the predicted value 
and N is the total number of samples. As prediction accuracy 
increases, the MSE becomes smaller. 
The workload dynamics prediction accuracies in 
performance, power and reliability domains are plotted as 
boxplots in Figure 8. Boxplots are graphical displays that 
measure location (median) and dispersion (interquartile 
range), identify possible outliers, and indicate the symmetry 
or skewness of the distribution. The central box shows the 
data between “hinges” which are approximately the first and 
third quartiles of the MSE values. Thus, about 50% of the 
data are located within the box and its height is equal to the 
interquartile range. The horizontal line in the interior of the 
box is located at the median of the data, it shows the center of 
the distribution for the MSE values. The whiskers (the dotted 
lines extending from the top and bottom of the box) extend to 
the extreme values of the data or a distance 1.5 times the 
interquartile range from the median, whichever is less. The 
outliers are marked as circles. In Figure 8, the line with 
diamond shape markers indicates the statistics average of 
MSE across all test cases. 
Figure 8 shows that the performance model achieves 
median errors ranging from 0.5 percent (swim) to 8.6 percent 
(mcf) with an overall median error across all benchmarks of 
2.3 percent. As can be seen, even though the maximum error 
at any design point for any benchmark is 30%, most 
benchmarks show MSE less than 10%. This indicates that our 
proposed neuro-wavelet scheme can forecast the dynamic 
behavior of program performance characteristics with high 
accuracy. Figure 8 shows that power models are slightly less 
accurate with median errors ranging from 1.3 percent (vpr) to 
4.9 percent (crafty) and overall median of 2.6 percent. The 
power prediction has high maximum values of 35%. These 
errors are much smaller in reliability domain. 
In general, the workload dynamic prediction accuracy is 
increased when more wavelet coefficients are involved. 
However, the complexity of the predictive models is 
proportional to the number of wavelet coefficients. The cost-
effective models should provide high prediction accuracy 
while maintaining low complexity. Figure 9 shows the trend 
of prediction accuracy (the average statistics of all 
benchmarks) when various number of wavelet coefficients are 
used. As can be seen, for the programs we studied, a set of 
wavelet coefficients with a size of 16 combine good accuracy 
with low model complexity; increasing the number of wavelet 
coefficients beyond this point improves error at a lower rate. 
High Mag. 
Low Mag. 
Wavelet coefficients with large magnitude   
(
a
)
 
C
P
I
 
bzip crafty eon gap gcc mcf parser perl swim twolf vortex vpr
0
5
1
0
1
5
2
0
2
5
3
0
M
S
E
 
(
%
)
(
b
)
 
P
o
w
e
r
 
bzip crafty eon gap gcc mcf parser perl swim twolf vortex vpr
0
5
1
0
1
5
2
0
2
5
3
0
3
5
M
S
E
 
(
%
)
(
c
)
 
A
V
F
 
bzip crafty eon gap gcc mcf parser perl swim twolf vortex vpr
0
1
2
3
M
S
E
 
(
%
)
Figure 8. MSE boxplots of workload dynamics prediction in (a) performance, (b) power, and (c) reliability domains 
This is because wavelets provide a good time and locality 
characterization capability and most of the energy is captured 
by a limited set of important wavelet coefficients.Using fewer 
parameters than other methods, the coordinated wavelet 
coefficients provide interpretation of the series structures 
across scales of time and frequency domains. The capability 
of using a limited set of wavelet coefficients to capture 
workload dynamics varies with resolution level. 
0
1
2
3
4
5
16 32 64 96 128
Number of Wavelet Coefficients
M
S
E
 
(
%
) CPI
Power
AVF
 
Figure 9. The trends of MSE with increased number of 
wavelet coefficients 
Figure 10 illustrates MSE (the average statistics of all 
benchmarks) yielded on predictive models that use 16 wavelet 
coefficients when the number of samples varies from 64 to 
1024. As the sampling frequency increases, using the same 
amount of wavelet coefficients is less accurate in terms of 
capturing workload dynamic behavior. As can be seen, the 
increase of MSE is not significant. This suggests that the 
proposed schemes can capture workload dynamic behavior 
with increasing complexity. 
0
1
2
3
4
5
6
7
64 128 256 512 1024
Number of Samples
M
S
E
 
(
%
)
CPI
Power
AVF
 
Figure 10. MSE trends with increased sampling 
frequency 
As mentioned in Section 2.2, our RBF neural networks 
were built using a regression tree based method. In the 
regression tree algorithm, all input microarchitecture 
parameters were ranked based on either split order or split 
frequency. The microarchitecture parameters which cause the 
most output variation tend to be split earliest and most often 
in the constructed regression tree. Therefore, microarchitecture parameters largely determine the values of 
a wavelet coefficient are located on higher place than others 
in regression tree and they have larger number of splits than 
others. We present in Figure 11 (shown as star plot) the initial 
and most frequent splits within the regression trees that model 
the most significant wavelet coefficients. A star plot [23] is a 
graphical data analysis method for representing the relative 
behavior of all variables in a multivariate data set. The star 
plot consists of a sequence of equi-angular spokes, called 
radii, with each spoke representing one of the variables. The 
data length of a spoke is proportional to the magnitude of the 
variable for the data point relative to the maximum magnitude 
of the variable across all data points. From the star plot, we 
can obtain information such as: What variables are dominant 
for a given datasets? Which observations show similar 
behavior? For example, on benchmark gcc, Fetch, dl1 and 
LSQ have significant roles in predicting dynamic behavior in 
performance domain while ROB, Fetch and dl1_lat largely 
affect reliability domain workload dynamic behavior. For the 
benchmark  gcc, the most frequently involved 
microarchitecture parameters in regression tree constructions 
are ROB, LSQ, L2 and L2_lat in performance domain and 
LSQ and Fetch in reliability domain. 
Compared with models that only predict aggregated 
workload behavior, our proposed methods can forecast 
workload runtime execution scenarios. This feature is 
essential if the predictive models are employed to trigger 
runtime dynamic management mechanisms for power and 
reliability optimizations. Inadequate workload worst-case 
scenario predictions could make microprocessors fail to meet 
the desired power and reliability targets. On the contrary, 
false alarms caused by over-prediction of the worst-case 
scenarios can trigger responses too frequently, resulting in 
significant overhead. In this work, we study the suitability of 
using the proposed schemes for workload execution scenario 
based classification. Specifically, for a given workload 
characteristics threshold, we calculate how many sampling 
points in a trace that represents workload dynamics are above 
or below the threshold. We then apply the same calculation to 
the predicted workload dynamics trace. We use the directional 
symmetry (DS) metric, i.e., the percentage of correctly 
predicted directions with respect to the target variable, 
defined as: 
∑
=
− =
N
k
k x k x
N
DS
1
)) ( ˆ ) ( (
1
ϕ , 
where  1 ) ( = ⋅ ϕ if  x  and  x ˆ are both above or below the 
threshold and  0 ) ( = ⋅ ϕ otherwise. Thus, the DS provides a 
measure of the number of times the sign of the target is 
correctly forecasted. In other words, DS=50% implies that the 
predicted direction was correct for half of the predictions. In 
this work, we set three threshold levels (named as Q1, Q2 and 
Q3 in Figure 12) between max and min values in each trace 
as follows, where 1Q is the lowest threshold level and 3Q is 
the highest threshold level.  
Figure 13 shows the results of threshold-based workload 
dynamic behavior classification. The results are presented as 
directional asymmetry, which can be expressed as (1-DS). As 
can be seen, our wavelet-based RBF neural networks can not 
only capture workload dynamics, but also they can accurately 
classify workload execution into different scenarios. This 
suggests that proactive dynamic power and reliability 
management schemes can be built using the proposed models. 
For instance, given a power/reliability threshold, our wavelet 
RBF neural networks can be used to forecast workload 
execution scenarios. If the predicted workload characteristics 
exceed the threshold level, processors can start to response 
before power/reliability reaches or surpass the threshold level. 
Figure 14 further illustrates detailed workload execution 
scenario predictions on benchmark bzip2. Both simulation 
and prediction results are shown. The predicted results 
closely track the varied program dynamic behavior in 
different domains. 
3Q
2Q
1Q
MAX
MIN
1Q = MIN + (MAX-MIN)*(1/4)
2Q = MIN + (MAX-MIN)*(2/4)
3Q = MIN + (MAX-MIN)*(3/4)
 
Figure 12. Threshold-based workload execution 
scenario classification 
0
2
4
6
8
10
bzip
crafty
eon
gap
gcc
m cf
parser
perlbmk
swim
twolf
vortex
vpr
D
i
r
e
c
t
i
o
n
a
l
 
A
s
y
m
m
e
t
r
y
 
(
%
)
CPI_1Q
CPI_2Q
CPI_3Q
 
0
2
4
6
8
10
bzip
c rafty
eon
gap
gcc
mcf
p arser
p erlbm k
swim
tw olf
vortex
vpr
D
i
r
e
c
t
i
o
n
a
l
 
A
s
y
m
m
e
t
r
y
 
(
%
)
Power_1Q
Power_2Q
Power_3Q
 
0
2
4
6
8
10
bzip
c rafty
e on
g ap
gcc
m cf
parser
p erlbm k
swim
twolf
vortex
v pr
D
i
r
e
c
t
i
o
n
a
l
 
A
s
y
m
m
e
t
r
y
 
(
%
)
AVF_1Q
AVF_2Q
AVF_3Q
 
Figure 13. Threshold-based workload execution 
scenario prediction (1-DS) 
5. Workload Dynamics Driven Architecture 
Design Space Exploration: A Case of Soft Error 
Vulnerability Management 
In this section, we present a case study to demonstrate the 
benefit of applying workload dynamics prediction in early 
  
CPI Power  AVF 
(
a
)
 
B
y
 
S
p
l
i
t
 
O
r
d
e
r
 
bzip crafty eon gap gcc mcf
parser perlbmk swim twolf vortex vpr
Fetch
ROB IQ
LSQ
L2
L2_lat
il1 dl1
dl1_lat
bzip crafty eon gap gcc mcf
parser perlbmk swim twolf vortex vpr
Fetch
ROB IQ
LSQ
L2
L2_lat
il1 dl1
dl1_lat
bzip crafty eon gap gcc mcf
parser perlbmk swim twolf vortex vpr
Fetch
ROB IQ
LSQ
L2
L2_lat
il1 dl1
dl1_lat
(
b
)
 
B
y
 
S
p
l
i
t
 
F
r
e
q
u
e
n
c
y
 
bzip crafty eon gap gcc mcf
parser perlbmk swim twolf vortex vpr
Fetch
ROB IQ
LSQ
L2
L2_lat
il1 dl1
dl1_lat
bzip crafty eon gap gcc mcf
parser perlbmk swim twolf vortex vpr
Fetch
ROB IQ
LSQ
L2
L2_lat
il1 dl1
dl1_lat
bzip crafty eon gap gcc mcf
parser perlbmk swim twolf vortex vpr
Fetch
ROB IQ
LSQ
L2
L2_lat
il1 dl1
dl1_lat
Figure 11. Start plots that show the roles of microarchitecture design parameters in predicting workload dynamics across 
different domains. Figure 11 (a) shows the regression tree split order based results and Figure 11 (b) shows the regression 
tree split number based results. 
architecture design space exploration. Specifically, we show 
that workload dynamics prediction models can effectively 
forecast the worst-case operation conditions to soft error 
vulnerability and accurately estimate the efficiency of soft 
error vulnerability management schemes. 
Because of technology scaling, radiation-induced soft 
errors contribute more and more to the failure rate of CMOS 
devices. Therefore, soft error rate is an important reliability 
issue in deep-submicron microprocessor design. Processor 
microarchitecture soft error vulnerability exhibits significant 
runtime variation and it is neither economical nor practical to 
design fault tolerant schemes that target on the worst-case 
operation condition. Dynamic Vulnerability Management 
(DVM) refers to a set of strategies to control hardware 
runtime soft-error susceptibility under a tolerable threshold. 
DVM allows designers to achieve higher dependability on 
hardware designed for a lower reliability setting. If a 
particular execution period exceeds the pre-defined 
vulnerability threshold, a DVM response (see Figure 15) will 
work to reduce hardware vulnerability. A primary goal of 
DVM is to maintain vulnerability to within a pre-defined 
reliability target during the entire program execution. The 
DVM will be triggered once the hardware soft error 
vulnerability exceeds the predefined threshold. Once the 
trigger goes on, a DVM response begins. Depending on the 
type of response chosen, there may be some performance 
degradation. A DVM response can be turned off as soon as 
the vulnerability drops below the threshold. To successfully 
achieve the desired reliability target and effectively mitigate 
the overhead of DVM, architects need techniques to quickly 
infer application worst-case operation conditions across 
design alternatives and accurately estimate the efficiency of 
DVM schemes at early design stages. 
V
u
l
n
e
r
a
b
i
l
i
t
y
Time
Designed-for Reliability 
Capacity w/out DVM
Designed-for Reliability 
Capacity w/ DVM
DVM Trigger Level DMV Engaged
DVM Disengaged
DVM Performance 
Overhead
 
Figure 15. Dynamic Vulnerability Management 
We developed a DVM scheme to manage runtime 
instruction queue (IQ) vulnerability to soft error. Figure 16 
shows the pseudo code of our DVM policy. The DVM 
scheme computes online IQ AVF to estimate runtime 
microarchitecture vulnerability. The estimated AVF is 
compared against a trigger threshold to determine whether it 
is necessary to enable a response mechanism. To reduce IQ 
soft error vulnerability, we throttle the instruction dispatching 
from the ROB to the IQ upon a L2 cache miss. Additionally, 
we sample the IQ AVF at a finer granularity and compare the 
sampled AVF with the trigger threshold. If the IQ AVF 
exceeds the trigger threshold, a parameter wq_ratio, which 
specifies the ratio of number of waiting instructions to that of 
ready instructions in the IQ, is updated. The purpose of  0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Samples
I
Q
_
A
V
F
 
 
Simulation
Prediction
DVM Target
(Disable)
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Samples
I
Q
_
A
V
F
 
 
Simulation
Prediction
DVM Target
(Enable)
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Samples
I
Q
_
A
V
F
 
 
Simulation
Prediction
DVM Target
(Disable)
0 20 40 60 80 100 120 140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Samples
I
Q
_
A
V
F
 
 
Simulation
Prediction
DVM Target
(Enable)
DVM disabled  DVM enabled  DVM disabled  DVM enabled 
(a) Scenario 1: DVM successfully achieves its goal /w 
microarchitecture configuration A 
(b) Scenario 2: DVM fails to achieve its goal /w 
microarchitecture configuration B 
Figure 17. Workload dynamic prediction can be used to efficiently explore workload scenario-based architecture 
optimization. Above we see that the predictive models can accurately forecast whether the IQ DVM policy can achieve 
it goal when the underlying microarchitecture configuration changes. 
setting this parameter is to maintain the performance by 
allowing an appropriate fraction of waiting instructions in the 
IQ to exploit ILP. By maintaining a desired ratio between the 
waiting instructions and the ready instructions, vulnerability 
can be reduced at negligible performance cost. The wq_ratio 
update is triggered by the estimated IQ AVF. In our DVM 
design, wq_ratio is adapted through slow increases and rapid 
decreases in order to ensure a quick response to a 
vulnerability emergency. 
 DVM_IQ
 {
    ACE bits counter updating();
    if current context has L2 cache misses
    then stall dispatching instructions for current context;
    every (sample_interval/5) cycles
    {
       if online IQ_AVF > trigger threshold
       then wq_ratio = wq_ratio/2;
  else  wq_ratio = wq_ratio+1;
    }
    if (ratio of waiting instruction # to ready instruction # > wq_ratio)
    then stall dispatching instructions;
  } 
 
Figure 16. IQ DVM Pseudo Code 
We built workload dynamics predictive models which 
incorporate DVM as a new design parameter. Therefore, our 
models can predict workload execution scenarios with and 
without DVM feature across different microarchitecture 
configurations. Figure 17 shows the results of using the 
predictive models to forecast IQ AVF on benchmark gcc 
across two microarchitecture configurations. We set the 
DVM target as 0.3 which means the DVM policy, when 
enabled, should maintains the IQ AVF below 0.3 during 
workload execution. In both cases, the IQ AVF dynamics 
were predicted when DVM is disabled and enabled. As can 
be seen, in scenario 1, the DVM successfully achieves its 
goal. In scenario 2, despite enabling theDVM feature, the IQ 
AVF of certain execution period is still above the threshold. 
This implies that the developed DVM mechanism is suitable 
for the microarchitecture configuration used in scenario 1. On 
the other hand, architects have to choose another DVM policy 
if the microarchitecture configuration shown in scenario 2 is 
chosen in their design. Figure 17 shows that in all cases, the 
predictive models can accurately forecast the trends in IQ 
AVF dynamics due to architecture optimizations. 
Figure 18 (a) shows prediction accuracy of IQ AVF 
dynamics when the DVM policy is enabled. The results are 
shown for all 50 microarchitecture configurations in our test 
dataset. Since deploying the DVM policy will also affect 
runtime processor power behavior, we further build models to 
forecast processor power dynamic behavior due to the DVM. 
The results are shown in Figure 18 (b). The data is presented 
as a heat plot, which maps the actual data values into color 
scale with a dendrogram added to the top. A dendrogram 
consists of many U-shaped lines connecting objects in a 
hierarchical tree. The height of each U represents the distance 
between the two objects (benchmarks in our study) being 
connected. For a given benchmark, a vertical trace line shows 
the scaled MSE values across all test cases. Figure 18 (a) 
shows the predictive models yield high prediction accuracy 
across all test cases on benchmarks swim, eon and vpr. The 
models yield prediction variation on benchmarks gcc, crafty 
and  vortex. In power domain, prediction accuracy is more 
uniform across benchmarks and microarchitecture 
configurations. In Figure 19, we show the IQ AVF MSE 
when different DVM thresholds are set. The results suggest 
that our predictive models work well when different DVM 
targets are considered. 
0
0.1
0.2
0.3
0.4
0.5
bzip
crafty
eon
gap
gcc
mcf
parser
perlbmk
swim
twolf
vortex
vpr
I
Q
 
A
V
F
 
M
S
E
 
(
%
)
DVM Threshold =0.2
DVM Threshold =0.3
DVM Threshold =0.5
 
Figure 19. IQ AVF dynamics prediction accuracy 
across different DVM thresholds  
p
a
r
s
e
r
b
z
i
p
t
w
o
l
f
g
a
p
p
e
r
l
b
m
k
s
w
i
m
e
o
n
v
p
r
m
c
f
g
c
c
c
r
a
f
t
y
v
o
r
t
e
x
t 50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0 0.1 0.2 0.3 0.4 0.5
Value
0
5
0
1
0
0
1
5
0
2
0
0
Color Key
and Histogram
C
o
u
n
t
t
w
o
l
f
v
p
r
s
w
i
m
b
z
i
p
e
o
n
v
o
r
t
e
x
t
c
r
a
f
t
y
g
a
p
p
a
r
s
e
r
p
e
r
l
b
m
k
m
c
f
g
c
c 50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0 5 10 15 20 25
Value
0
1
0
0
2
0
0
3
0
0
4
0
0
Color Key
and Histogram
C
o
u
n
t
(a) IQ AVF  (b) Power 
Figure 18. The heat plot that shows the MSE of IQ AVF and processor power when DVM policy is enabled across all 
50 test cases on each benchmark 
 
6. Related Work 
There have been several attempts to use analytic models to 
understand microprocessor performance and power domain 
behavior. In [24] foldover Plackett-Burman experimental 
designs were used to obtain a significance ordering of 
microarchitectural parameters. Li et al. [25] presented a 
simple linear model for power consumption by operating 
system services. In [7] Joseph et al. developed linear models 
using D-optimal designs to identify significant parameters 
and their interactions. Lee and Brooks [9, 11] proposed 
regression on cubic splines for predicting the performance 
and power of applications executing on microprocessor 
configurations in a large microarchitectural design space. 
Neural networks have been used in [8, 10, 12] to construct 
predictive models that correlate processor performance 
characteristics with the design parameters. The above studies 
all focus on analyzing and predicting aggregated workload 
behavior while our work aims to model complex workload 
dynamics during microarchitecture design space exploration. 
Prior research has considered a range of phase analysis 
techniques. Sherwood and Calder proposed the use of Basic 
Block Vectors as a metric to capture a program’s phase 
behavior [2]. In [3], program working set changes are used to 
detect phase changes. Isci and Martonosi [5] showed that 
hardware performance counters can be exploited for phase 
classification and prediction. [4] tracks procedure calls via a 
call stack to dynamically identify phase changes. These 
studies, however, do not explicitly reveal how workload time 
varying behavior changes with microarchitecture 
configurations. 
Researchers have successfully applied wavelet techniques 
in many fields, including image and video compression, 
financial data analysis, and various fields in computer science 
and engineering [26, 27]. In [28], Joseph and Martonosi used 
wavelets to analyze and predict the change of processor 
voltage over time. In an earlier work [29], they used Fourier 
analysis to characterize the power behavior of programs. 
Recently, using wavelets to assist program phase analysis has 
started to gain popularity. In [30], wavelets were explored to 
analyze the phase behavior of memory bus accesses on 
commercial workloads. In [31], wavelets were used to 
improve accuracy, scalability, and robustness in program 
phase analysis. In [6], the multiresolution analysis capability 
of wavelets was exploited to analyze phase complexity. These 
studies, however, made no attempt to link program wavelet 
domain behavior to microarchitecture design parameters. In 
[32], Shen and Ding used wavelets as a filter to remove the 
gradual changes in a reuse-distance trace to identify locality 
phase in programs. The wavelet-based filtering is used to 
accurately determine the best place for phase markers, but is 
not used as a method for quantify program dynamics. 7. Conclusions 
Processor microarchitectures are evolving into 
increasingly complex systems, making accurate predictions of 
workload characteristics a challenging task. In the past, 
various approaches have been proposed to cost-effectively 
forecast program behavior. Existing analytical models, based 
on several simplifying assumptions, only capture aggregated 
workload statistics. As a result, these models lack the 
capability of tracking complex workload behavior in a 
processor design cycle. These motivate us to explore 
techniques that can fast and accurately analyze program 
dynamics at architecture design space exploration stage. 
Workload dynamics are complex phenomena since they 
typically contain a mixture of behavior localized in time and 
frequency. Applying wavelet analysis, our method can 
capture workload statistics across time and scale using a 
limited set of parameters. We show that these parameters can 
be cost-effectively predicted using non-linear modeling 
techniques such as neural networks. To our knowledge, the 
model we proposed is the first one that can track complex 
program dynamic behavior across different microarchitecture 
configurations. In this paper, we further examined using the 
proposed models to effectively explore workload scenario-
directed architecture optimizations. We believe our workload 
dynamics forecasting techniques will allow architects to 
quickly evaluate a rich set of architecture optimizations that 
target workload dynamics at early microarchitecture design 
stage. 
Acknowledgement 
This research is partially supported by NSF grant CSR-
0720476, Microsoft Research Trustworthy Computing Award 
14707 and an IBM Faculty Award. 
References 
[1] D. Brooks and M. Martonosi, Dynamic Thermal Management 
for High-Performance Microprocessors, HPCA, 2001. 
[2] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, 
Automatically Characterizing Large Scale Program Behavior, 
ASPLOS, 2002. 
[3] A. Dhodapkar and J. Smith, Managing Multi-Configurable 
Hardware via Dynamic Working Set Analysis, ISCA, 2002. 
[4] W. Liu and M. Huang, EXPERT: Expedited Simulation 
Exploiting Program Behavior Repetition, ICS, 2004. 
[5] C. Isci and M. Martonosi, Runtime Power Monitoring in 
High-End Processors: Methodology and Empirical Data, MICRO, 
2003. 
[6] C. B. Cho and T. Li, Complexity-based Program Phase 
Analysis and Classification, PACT, 2006. 
[7] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, 
Construction and Use of Linear Regression Models for Processor 
Performance Analysis, HPCA, 2006. 
[8] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, A 
Predictive Performance Model for Superscalar Processors, 
MICRO, 2006. 
[9] B. Lee and D. Brooks, Accurate and Efficient Regression 
Modeling for Microarchitectural Performance and Power 
Prediction, ASPLOS, 2006. 
[10] E. Ipek, S. A. McKee, B. R. Supinski, M. Schulz and R. 
Caruana, Efficiently Exploring Architectural Design Spaces via 
Predictive Modeling, ASPLOS, 2006. 
[11] B. Lee and D. Brooks, Illustrative Design Space Studies 
with Microarchitectural Regression Models, HPCA, 2007. 
[12] R. M. Yoo, H. Lee, K. Chow and H. H. S. Lee, Constructing 
a Non-Linear Model with Neural Networks For Workload 
Characterization, IISWC, 2006. 
[13] I. Daubechies, Ten Lectures on Wavelets, Capital City Press, 
Montpelier, Vermont, 1992. 
[14] S. Haykin, Neural Networks: A Comprehensive Foundation, 
Prentice Hall, ISBN 0-13-273350-1, 1999. 
[15] I. Daubechies, Orthonomal bases of Compactly Supported 
Wavelets, Communications on Pure and Applied Mathematics, 
vol. 41, pages 906-966, 1988. 
[16] M. Orr, K. Takezawa, A. Murray, S. Ninomiya and T. 
Leonard, Combining Regression Tree and Radial Based Function 
Networks, International Journal of Neural Systems, 2000. 
[17] Simplescalar, http://www.simplescalar.com/ 
[18] D. Brooks, V. Tiwari, and M. Martonosi, Wattch: A 
Framework for Architectural-Level Power Analysis and 
Optimizations, ISCA, 2000. 
[19] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and 
T. Austin, A Systematic Methodology to Compute the 
Architectural Vulnerability Factors for a High-Performance 
Microprocessor, MICRO, 2003. 
[20] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. 
Racunas and R. Rangan, Computing Architectural Vulnerability 
Factors for Address-Based Structures, ISCA, 2005. 
[21] J. Cheng, M. J. Druzdzel, Latin Hypercube Sampling in 
Bayesian Networks, FLAIRS, 2000. 
[22] B. Vandewoestyne, R. Cools, Good Permuatations for 
Deterministic Scrambled Halton Sequences in terms of L2-
discrepancy, Journal of Computational and Applied Mathematics 
Vol 189, Issues 1-2, 2006. 
[23] J. Chambers, W. Cleveland, B. Kleiner and P. Tukey, 
Graphical Methods for Data Analysis, Wadsworth, 1983  
[24] J. J. Yi, D. J. Lilja and D. M. Hawkins, A Statistically 
Rigorous Approach for Improving Simulation Methodology, 
HPCA, 2003. 
[25] T. Li and L. K. John, Run-time Modeling and Estimation of 
Operating System Power Consumption, SIGMETRICS, 2003. 
[26] S. Mallat, Multifrequency Channel Decompositions of 
Images and Wavelet Models, IEEE Transactions on Acoustic, 
Speech, and Signal Processing, vol. 37, page 2091-2110, 1989. 
[27] A. Feldmann, A. C. Gilbert, W. Willinger, and T. G. Kurtz, 
The Changing Nature of Network Traffic: Scaling Phenomena, 
ACM Computer Communication Review, vol. 28, page 5-29, 
Apr. 1998. 
[28] R. Joseph, Z. G. Hu, and M. Martonosi, Wavelet Analysis 
for Microprocessor Design: Experiences with Wavelet-Based 
dI/dt Characterization, HPCA, 2004. 
[29] R. Joseph, M. Martonosi and Z. G. Hu, Spectral Analysis for 
Characterizing Program Power and Performance, ISPASS, 2004. 
[30] T. Huffmire and T. Sherwood, Wavelet-Based Phase 
Classification, PACT, 2006. 
[31] C. B. Cho and T. Li, Using Wavelet Domain Workload 
Execution Characteristics to Improve Accuracy, Scalability and 
Robustness in Program Phase Analysis, ISPASS, 2007. 
[32] X. Shen, Y. Zhong and C. Ding, Locality Phase Prediction, 
ASPLOS, 2004. 