Exploiting performance counters to predict and improve energy performance of HPC systems by Tsafack Chetsa, Ghislain Landry et al.
Open Archive TOULOUSE Archive Ouverte (OATAO) 
OATAO is an open access repository that collects the work of Toulouse researchers and
makes it freely available over the web where possible. 
This  is  an author-deposited version published in  :  http://oatao.univ-toulouse.fr/
Eprints ID : 12657
To  link  to  this  article :  DOI  :10.1016/j.future.2013.07.010 
URL : http://dx.doi.org/10.1016/j.future.2013.07.010
To cite this version : Tsafack Chetsa, Ghislain Landry and Lefevre, 
Laurent and Pierson, Jean-Marc and Stolf, Patricia and Da Costa, 
Georges Exploiting performance counters to predict and improve 
energy performance of HPC systems. (2014) Future Generation 
Computer Systems, vol. 36. pp. 287-298. ISSN 0167-739X 
Any correspondance concerning this service should be sent to the repository
administrator: staff-oatao@listes-diff.inp-toulouse.fr
Exploiting performance counters to predict and improve energy
performance of HPC systems
G.L. Tsafack Chetsa a,b,∗, L. Lefèvre a, J.M. Pierson b, P. Stolf b, G. Da Costa b
a INRIA - Avalon, Ecole Normale Supérieure de Lyon, University of Lyon, France
b IRIT, University of Toulouse, France
h i g h l i g h t s
• We present two generic approaches for tracking high performance computing systems’ behaviour.
• We demonstrate that energy performance can be improved without a priori knowledge of application.
• We demonstrate that HPC systems can benefit from more than CPU frequency scaling.
• We introduce an approach to estimating the energy consumption of HPC applications.
a b s t r a c t
Hardware monitoring through performance counters is available on almost all modern processors. Al-
though these counters are originally designed for performance tuning, they have also been used for eval-
uating power consumption. We propose two approaches for modelling and understanding the behaviour
of high performance computing (HPC) systems relying on hardware monitoring counters. We evaluate
the effectiveness of our systemmodelling approach considering both optimizing the energy usage of HPC
systems and predicting HPC applications’ energy consumption as target objectives. Although hardware
monitoring counters are used for modelling the system, other methods – including partial phase recogni-
tion and cross platformenergy prediction – are used for energy optimization and prediction. Experimental
results for energy prediction demonstrate that we can accurately predict the peak energy consumption
of an application on a target platform; whereas, results for energy optimization indicate that with no a
priori knowledge of workloads sharing the platform we can save up to 24% of the overall HPC system’s
energy consumption under benchmarks and real-life workloads.
1. Introduction
The increasing need for performance for the many computa-
tional problems in science and engineering has taken high per-
formance computing (HPC) systems to the set of indispensable
tools inmodern industries and scientific research. HPC systems de-
liver tremendous raw performance for solving real-life problems.
These problems include: turbulence, combustion, genomics, as-
trophysics, geosciences, molecular dynamics, homeland security,
imaging and biomedicine, etc. HPC systems consume a large
amount of electrical power, almost all of which is converted in heat
requiring cooling. For example, Tianhe-1A consumes 4.04 MW of
∗ Corresponding author at: INRIA - Avalon, Ecole Normale Supérieure de Lyon,
University of Lyon, France. Tel.: +33 670403550.
E-mail addresses: ghislandry@gmail.com,
ghislain.landry.tsafack.chetsa@ens-lyon.fr (G.L. Tsafack Chetsa).
electricity [1]; a simple calculation at $0.10/KWh yields a power-
ing and cooling cost of about $3.5 million per year, which is signif-
icant. Beyond the operating cost, some data-centres are also being
limited by the peak power that electric facilities can provide. Con-
sequently, it is necessary to reduce their power consumption; how-
ever, this must not result in significant performance degradation.
The term significant performance degradation is a relative term;
nevertheless, up to 10% performance degradation is often accept-
able. Note: throughout this paper, (i) an HPC system is considered
as a set of computing and storage nodes excluding network equip-
ments such as routers and switches; whereas, (ii) the term ‘‘sys-
tem’’ designates a single node of the HPC system.
The importance of power efficiency in HPC systems has at-
tracted enormous attention from both the industrial and the
research communities. This is evidenced by the multitude of tech-
niques aiming to understand and reduce the energy consumption
of HPC systems. These techniques generally break into hardware
and software approaches. Hardware approaches led by manufac-
turers focus on designing power aware hardware while their soft-
ware counterpart focus on designing protocols and/or services
capable of adapting HPC subsystems – including processor, mem-
ory, communications, and storage subsystems – to meet applica-
tions’ requirements.
Embedded hardware event counters of modern micropro-
cessors, or simply hardware performance/monitoring counters,
monitor the occurrence of hardware events in a microproces-
sor with almost no performance penalty. These counters in-
clude: the number of cycles, instructions, cache references and
hits/misses, main memory writebacks and references, and branch
miss-predictions counts. Although originally designed for perfor-
mance tuning/adaptation, hardware performance counters have
been used in the past for estimating the power usage (or the en-
ergy consumption) of both HPC and desktop applications.
Performance tuning mechanisms rely on hardware perfor-
mance counters along with application’s specific metrics (Message
Passing Interface – MPI – calls in an MPI program for example) to
adapt the system accordingly. This can be accomplished by insert-
ing specific code segments in the source program (often referred to
as code instrumentation) or by tracking application’s specific rou-
tine calls at runtime.
Adapting the processor’s frequency to meet workloads’ de-
mands is the commonly used hardware adaptation approach for
reducing energy consumption. This is eased by the use of Dynamic
Voltage and Frequency Scaling (DVFS) [2,3] technology available
on modern processors. A fundamental requirement for such adap-
tation is defining how DVFS control should be performed. In other
words, the application designer must decide when and for how
long the processor should be kept in a given performance state
(P-state). In the performance optimization jargon, a specific part of
the program which is defined to run at a given performance state
is referred to as region or phase. Although these approaches for
adapting HPC systems are efficient, theymay fail for twomain rea-
sons.
1. HPC systems are often shared (the whole infrastructure is not
dedicated to a single workload) by multiple workloads each
having its own characteristics, in which case optimizing the
energy performance considering some applications is likely to
impact the performance of others.
2. In spite of the fact that HPC codes are actively maintained their
increasing complexity makes code instrumentation impracti-
cal and sometimes require extensive knowledge: a platform
provider can dedicate some engineers to ‘‘code instrumenta-
tion’’ or ask the programmer to write programs with these con-
straints inmind, which not only being unacceptable is not likely
to happen.
An effective way to overcome these limitations is to optimize
energy performance of the HPC system from the infrastructure
stand point. This implies understanding the behaviour of the HPC
system rather than that of individual applications sharing the
platform.
In this paper, we introduce two complementary general pur-
pose approaches for modelling and understanding the runtime be-
haviours of HPC systems. These approaches rely upon hardware
performance counters and break into on-line and off-line depend-
ing on their use. Information gathered from the off-line approach
can be of benefit to its on-line counterpart, which makes them
complementary. The off-line approach which we refer to as ‘‘DNA-
like description’’ of the system attempts to depict the HPC system
as a graph in which each state describes its behaviour (that of the
HPC system) over a time interval [4].
The on-line approach which we refer to as ‘‘Execution vectors
based system behaviour tracking’’ detects and characterizes the
runtime behaviour of the HPC system at runtime [5]. To accom-
plish this, phase changes are detected in the execution pattern of
individual nodes in theHPC system; detected phases are character-
ized afterwards. Given that HPCworkloads generally fall into com-
pute intensive, memory intensive, and communication intensive,
we define three types of behaviours including: compute intensive,
memory intensive, and communication intensive. Communication
intensive behaviour can further be divided into network transmit
and receive.
To demonstrate the effectiveness of HPC system’s behaviour
tracking approaches just mentioned, we explore several use cases.
These use cases show how the energy performance of an applica-
tion can be predicted using the DNA-like description on one the
hand and how energy performance of an HPC system can be im-
proved through behavioural characterization on the other hand.
Our work differs from the state of the art in many ways: we ad-
dress the power/energy consumption issue of HPC systems consid-
ering all HPC subsystems (processor, memory, disk and network).
In addition, our systemmodelling approaches does not rely on any
application specific metric, i.e., our approaches do not require any
a priori knowledge of applications running on the system.
The major contributions of this work are the following:
1. We present two different (on-line and off-line) approaches en-
abling a fine-grained control of HPC systems and a better un-
derstanding of application’s energy performance. Their strength
resides in the fact that they do not need any a priori knowledge
of applications/workloads sharing the HPC infrastructure.
2. We present an approach for optimizing/improving energy per-
formance of HPC systems considering HPC subsystems, includ-
ing the processor, memory, disk and network.
3. We introduce the concepts of cross platform energy prediction
and partial phase recognition. Cross platform energy prediction
can help choosing the appropriate execution platform for an ap-
plication; whereas partial phase recognition is an alternative to
phase prediction, it has the advantage of not being application
or architecture specific.
The remainder of this article is organized as follows. Sec-
tion 2 gives account of previous work. Our system’s modelling ap-
proaches are presented in Section 3. Section 4 presents several use
cases of our modelling approaches. Implementation of the two use
cases along with experimental results is discussed in Section 5. Fi-
nally, Section 6 concludes and gives future directions.
2. Related work
A large body of work investigates the use of hardware perfor-
mance counters for modelling the power consumption of applica-
tions ranging form desktop to HPC applications. In this section, we
first detailwork using hardware performance counters for estimat-
ing the power or energy consumption of individual applications.
Wenext present severalwork addressingHPC systems’ energy per-
formance and their limitations.
2.1. Power/energy estimation using hardware performance counters
In studies such as [6–12] efforts have been devoted to model or
estimate the power usage of individual applications or workloads.
These studies monitor the use of system’s component (in partic-
ular the processor and memory) during the workload execution
via hardware performance counters and correlate them with the
power consumed by the system when running that workload to
derive a power model. Kadayif et al. [7] propose a model for esti-
mating the energy consumption of the UltraSPARC CPU [13]. Au-
thors estimate the UltraSPARC CPU memory energy consumption
considering the following performance counters: Data cache read
hits, Data cache read references, Data cache write hits, Data cache
Table 1
Performance events selected to estimate CPU and memory power
consumption for Intel PXA255 processor.
CPU performance events Memory performance events
Instructions executed Instruction fetch misses
Data dependencies Data dependencies
Instruction cache misses
TLB misses
write references, Instructions cache hits, Instructions cache refer-
ences, Extended cache misses with writebacks. They claim their
energy model to be 2.4% accurate as compared to circuit level sim-
ulation. Similarly, Contreras et al. [8] present a first order linear
power estimationmodel that uses hardware performance counters
to estimate run-time CPU and memory power consumption of the
Intel PXA255 [14]. According to the authors, the proposed model
exhibits an average estimation error of 4%. During their analyses,
authors considered events listed in the first column of Table 1 for
the CPU and those listed in the second column of that very table for
the memory.
Energy consumption of the high-performance processor AMD
Phenom is estimated in [9]. In their work, authors categorize
AMD Phenom performance counters into four buckets: FP Units,
Memory, Stalls, and Instruction Retired and consider perfor-
mance events which express best their power consumption. These
performance events include: L2_cache_miss:all, Retire_uops, Re-
tire_mmx_and_fp_instruction:all, and Dispatch_stalls.
More recently, Da Costa et al. [6] have presented amethodology
of measurement of the energy consumption of a single process
application running on a standard PC. They defined a set of per
process and system-wide variables to demonstrate their accuracy
in measuring the energy consumption of a given process using
multivariate regression.
To summarize, the above researches tell us that performance
counters can accurately estimate the power usage of an applica-
tion, however it is worthwhile to mention that the accuracy of a
power/energy model depends on the workload. In other words, a
power model designed for estimating the power consumption of a
compute-bound workload may not fit well with a memory-bound
workload. This is obvious for communication intensive workloads.
2.2. Energy reduction approaches for HPC systems
In the past few years, HPC systems have witnessed the emer-
gence of energy consumption reduction techniques from the hard-
ware level to the software level. At the hardware level, themajority
of Information Technology (IT) equipment vendors works either
from bottom up, by using the more efficient components in their
equipments, and/or by providing their equipments with technolo-
gies that can be leveraged to reduce energy consumption of HPC
subsystems – such as processor, network, memory, and I/O – dur-
ing their operation. For example, the majority of modern proces-
sors is provided with Dynamic Resource Sleeping (DRS) which
makes components hibernate to save energy and thenwakes them
on demand. Although major progress has been made, improve-
ments in hardware solutions to energy reduction problem have
been slow, due to the high cost of designing equipments with
energy-saving technologies and the increasing demand of raw per-
formance. Our work takes advantage of hardware technologies to
reduce HPC systems’ energy consumption.
Unlike hardware approaches, software solutions for reducing
HPC systems’ energy usage have received extensive attention over
time. Rountree et al. [15] use node imbalance to reduce the over-
all energy consumption of a parallel application in an HPC system.
They track successive MPI communication calls to divide the ap-
plication into tasks composed of a communication portion and a
computation portion. A slack occurs when a processor is waiting
for data to arrive during the execution of a task. This leaves the pos-
sibility to slow the processor with almost no impact on the overall
execution time of the application. Rountree et al. developed Adagio
which tracks task execution slacks and computes the appropriate
frequency at which it should run. Although the first instance of a
task is always run at the highest frequency, further instances of the
same task are executed at the frequency that was computed after
it is first seen. Authors of [16] propose a tool called Jitter, which de-
tects slack periods in performance to performance inter-node im-
balance and uses DVFS to adjust the CPU frequency accordingly.
Our approach differs from that implemented in Adagio in that
our fine-grained data collection offers the possibility to differenti-
ate not only compute-intensive and communication-intensive ex-
ecution portions (these portions are referred to as phases/regions)
but also memory-intensive phases. Memory-intensive phases
can be run on a slower core without significant performance
penalty [17].
Isci et al. [18] and Choi et al. [19] use on-line techniques to
detect applications execution phases, characterize them and set
the appropriate CPU frequency accordingly. They rely on hardware
monitoring counters to compute runtime statistics such as cache
hit/miss ratio, memory access counts, retired instructions counts,
etc. which are then used for phase detection and characterization.
Policies developed in [18,19] tend to be designed for single task
environments. We overcome that limitation by considering each
node of the cluster as a black box, which means that we do not
focus on any applications, but instead on the platform. The flexi-
bility provided by this assumption enables us to track not appli-
cations/workloads execution phases, but node’s execution phases.
Our work also differs from previous works in that we use partial
phase recognition instead of phase prediction,which is not applica-
tion specific and does not require multiple executions of the same
application. On-line recognition of communication phases in MPI
applicationwas investigated by Lim et al. [20]. Once a communica-
tion phase is recognized, authors apply CPU DVFS to save energy.
They intercept and record the sequence of MPI calls during pro-
gram execution and consider a segment of program code to be re-
ducible if there are high concentrated MPI calls or if an MPI call
is long enough. The CPU is then set to run at the appropriate fre-
quency when the reducible region is recognized again.
Our work differs from those above in twomajor ways. First, our
phase detection approach does not rely on a specific HPC subsys-
tem or MPI communication calls. Second, unlike previous research
efforts, our model goes beyond the processor, since it also takes
advantage of power saving capabilities available on all HPC subsys-
tems. For example slowing down networks interfaces using Adap-
tive Link Rate, putting disks in low power-mode, and switching off
memory banks are power saving schemes which are not directly
linked to the processor.
3. Understanding HPC systems’ behaviour: on-line and off-line
models
In this section, we present two effective approaches for tracking
HPC systems’ behaviours. The first approach which we refer to as
‘‘DNA-like representation’’ of the HPC system attempts to describe
an HPC system as a state graph in which each state represents its
behaviour over a time interval; whereas the second approach per-
forms on-line detection and characterization of execution phases
of an HPC system at runtime. Asmentioned earlier, we assume that
an HPC system is a set of computing and storage nodes andwill use
the term ‘‘system’’ to designate a single node of the HPC system.
Network equipments such as routers and switches are not taken
into account because of their nearly constant power consumption.
In the rest of this paper, unless otherwise expressly stated, we use
the term ‘‘sensors’’ to designate performance counters along with
network bytes sent/received counts and disk read/write counts.
Sensors related to hardware performance counters provide insight
into the processor and memory activities. Likewise, disk and net-
work related sensors provide insight into disk and network activi-
ties respectively.
3.1. DNA-like system modelling
The DNA-like system modelling models each node of an HPC
system as a graph whose states represent the execution behaviour
of the system over fixed length time intervals. We made the as-
sumption that initial and final states of the graph are states or con-
figuration in which the system is idle. A transition from a state S1
to a state S2 of the graph is weighted by the conditional probability
that the system goes from S1 to S2. We refer to successive states
through which a system goes throughout its life cycle as its ‘‘DNA-
like’’ structure. Since not all states of the graph have the same be-
haviour, the DNA-like structure of a system can be thought of as
a succession of behaviours through which the system went over
time. From this, we define the terms ‘‘letter’’ and ‘‘system descrip-
tion alphabet’’ as follows: a letter or phase is defined as a behaviour
in the DNA-like structure of a system (a state of the graph mod-
elling that system); whereas, the system description alphabet is
the set of possible behaviours.
With the description just given, the runtime behaviour of a
system can roughly be represented by a sequence of the form
Li . . . Xj . . . Lk; where the Li are elements of the system description
alphabet. It is possible that some states do not appear in the system
description alphabet, those are represented by the Xj notation.
A letter is modelled as a column vector of sensors. Details
regarding their construction are provided later in this paper. The
choice of sensors for representing letters is constrained by the fact
that a letter must provide information about the computational
behaviour of the workload (is it memory intensive, processor
intensive or communication intensive?) and its energy/power
consumption.
3.1.1. Letter modelling and representation
The literature corroborates our observation that sensors
relevant to power consumption estimation model depend upon
workloads/applications being executed (see Section 2 for details).
Considering that a finite set of sensors is used to estimate the
power consumption of a given category of workloads; this suggest
that changes in the set of sensors relevant to power consumption
estimation over a time interval T reflect changes in the system’s
behaviour over that very time interval.
Relying on the above assumption, we propose Algorithm 1
which partitions the runtime of a system into different behaviours
according to changes in the set of sensors relevant to power esti-
mation. For the selection of relevant sensors we use this very sim-
ple powermodel Power ∼
∑n
i=1 αi∗Ci inwhichαi and Ci aremodel
coefficients and sensors respectively. We conduct a multivariable
linear regression to obtain coefficients αi and retain sensors Cj ex-
hibiting a 5% level of statistical significance to power consumption
estimation given the previous power model. For the sake of sim-
plicity, we limit these sensors to four (i.e., 4 sensors for represent-
ing a letter).
Once letters are defined, we use the following formalism for
their encoding: Let us assign each sensor to a four-bit aggregation
or half byte. Our quadruplet (each letter is a vector of four sensors)
is therefore of the form (X1, X2, X3, X4); where each Xi values is a
half byte. Now, deleting commas in between the Xi gives a sixteen-
bit aggregation which converted into decimal is an unsigned inte-
ger. The unsigned integer obtained from the above transformation
is then the final representation of a letter.
Data: A: a set of units, where a unit is composed of values of
sensors collected at a given time; units are sampled on
a per second basis. Note that they are arranged in their
order of occurrence in time.
Result: P = {ti} where ti are points in time at which changes
in the behaviour of the system were detected.
Initialization: consider a set S made up of k successive units
along the execution time-line; let us denote by Supper the
time at which the last unit of S was sampled. k is chosen such
that k > p+ 1, where p is the number of sensors.
Add the time at which the first unit of S was sampled to P
Compute the set R0 of sensors relevant to power
consumption estimation using the dataset composed of units
in S
while units available do
Add k units to S and update Supper to Supper + k
Compute the set Rt of relevant sensors from S
if Rt−1 6= Rt then
Find the point in time j ∈ [Supper − k, Supper ] such that
the set of relevant sensors R computed from the set
whose last unit was sampled at time j is the same as
Rt−1
Go to Initialization
Algorithm 1: Algorithm to detect application phases.
3.1.2. Example
This example investigates how close to reality is our approach
for partitioning a system’s runtime into different computational
behaviours (or simply phases). To accomplish this, we successively
run two applications IS and EP from the NAS Parallel Benchmark
(NPB) suite [21]. These applications are opposite from their compu-
tational stand point in the sense that IS is communication intensive
whereas EP is mainly computing. Data collected during their exe-
cution are used as input to the algorithm. Fig. 1(b) and (a) (where
the curve gives the variation of the power consumption of the sys-
tem over time and rectangles delimit detected phase) indicates
that using a simple power model it is possible to detect phases
the systemwent through. The couple of integers appearing in each
rectangle on the figure gives for each phase the corresponding let-
ter coded as an unsigned integer and the amount of time the sys-
tem spent in that particular phase. For this example, the DNA-like
structure of each node is straightforward, for the first node, it is:
Idle(17694 19)(17714 43)Idle.
3.2. Execution vectors based system behaviour tracking
It can be seen from Fig. 1(c) and (d) – where the y-axis repre-
sents the access rate of performance counters and the x-axis the
execution time-line – that the access pattern of hardware perfor-
mance counters or sensors in general strongly reflects changes in
the behaviour of the application/system. We speak of sensors’ ac-
cess rate because they are normalized with respect to the number
of cycles. From the observation just made, the concept of execu-
tion vector which is similar to power vector (PV) [22] seems ade-
quate for phase detection. An execution vector (EV) is defined as a
column vector whose entries are system’s metrics, including hard-
ware performance counters, network byte sent/received and disk
read/write counts. To remain consistentwith previous sections, we
shall refer to these system metrics as sensors.
The sampling rate corresponding to the time interval after
which sensors’ values are read depends on the granularity. While a
larger sampling rate may hide information regarding the system’s
behaviour, a smaller sampling rate may incur a non negligible
overhead. In this paper, we use a sampling rate of one second and
Po
w
er
 (W
att
)
Time (seconds)
(17694 19)
or IS
(17714 43) or EP
120
140
160
180
200
220
10 20 30 40 50 60 700
Time (seconds)
(17692 21)
or IS (12988 41) or EP
Po
w
er
 (W
att
)
120
140
160
180
200
220
10 20 30 40 50 60 700
(a) First node. (b) Second node.
Co
un
te
rs
 a
cc
es
s 
ra
te
time (s)
br misses
cache ref
cache misses
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 10 20 30 40 50 60 70
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Co
un
te
rs
 a
cc
es
s 
ra
te
br misses
cache ref
cache misses
time (s)
0 10 20 30 40 50 60 70
(c) Performance counters access pattern (first node). (d) Performance counters access pattern (second node).
Fig. 1. Dividing NAS benchmarks IS and EP ran successively into two different computational behaviours using their power usage and sensors.
further normalize sensors with respect to the number of cycles to
get their access rate.
We define the resemblance or similarity metric between two
EVs as the Manhattan distance between them. The Manhattan dis-
tance suits the case since it weighs more heavily differences.
A phase change is detected when the Manhattan distance
between consecutive EVs exceeds a preset threshold. The thresh-
old is fixed in the sense that it is always the same percentage –
we refer to that percentage as the detection threshold – of the
maximum distance between consecutive EVs (e.g., if the detection
threshold is X%, then the threshold is X% of the maximum distance
between consecutive EVs). However, the maximum distance be-
tween consecutive EVs is zeroed once a phase change is detected.
So, technically, the threshold varies throughout the systems life cy-
cle. The maximum existing distance between consecutive EVs is
continuously updated until a phase change is detected where it is
zeroed. The idea behind zeroing the maximum existing distance
when a phase change occurs is to allow detecting phase changes
when changing from a phase where distances between consecu-
tive EVs are big to a phase where they are not and vice versa. For
example, given a 10% similarity threshold, two consecutive EVs be-
long to the same group if the Manhattan distance between them is
less than 10% of the maximum existing distance between all con-
secutive execution vectors. Fig. 2(b) shows the distance between
consecutive execution vectors and the variation of the threshold
along the execution time-line when the system was running the
Advance Research Weather Research and Forecasting (WRF-ARW)
model [23]. A graphical view of our phase detection mechanism
is provided in Fig. 2(a), where dashed vertical lines indicate the
start and end times of WRF-ARW. The left extremity of horizontal
solid lines indicates the point (in the execution time-line) at which
phase changes are detected and their length indicate the duration
or length of corresponding phases. The x-axis represents the execu-
tion time-line and the y-axis represents IDs associated to detected
phases. Note that IDs of phases are non zero integers ordered by
their appearance order. It can be seen from Fig. 2(c) that detected
phases are corroborated by the access pattern of sensors.
The key motivation behind phase tracking is the use of charac-
teristics of known phases for optimizing similar phases. Depend-
ing on the needs, optimizations aim to reduce the execution time
of workloads, and/or their energy consumption. An effective phase
characterization is therefore needed. To accomplish this, once a
phase is detected, we apply principal component analyses (PCA)
on the dataset composed of EVs pertaining to that phase. We next
select five sensors among those contributing the least to the first
principal axis of PCA for phase characterization. This choice ismoti-
vated by the assumption that information regarding what the sys-
tem did not do during a phase is captured by sensors contributing
the least to the first principal axis of PCA. These 5 sensors serve
as the characteristic of the phase. Since data collected during a
phase can be too large for efficient representation and comparison
in hardware, we summarize a phase by the EV at the centroid of
the group made up of EVs sampled during that phase. We refer to
the EV summarizing a phase as the reference vector of that phase.
3.2.1. Partial phase recognition
Phase identification is an interesting property of phase detec-
tion mechanisms, since it allows reuse of system reconfiguration
information for reoccurring phases. However, a phase cannot be
identifiedwith a known phase before its completion. The literature
suggests using phase prediction, which predicts the next phase of
an application before it gets started; however, it might not work
very well when you do not have any a priori information about ap-
plications sharing your platform. Therefore, instead of identifying
ph
as
e 
id
time (s)
0
2
4
6
8
10
12
14
0 200 400 600 800 1000
W
RF
va
lu
es
time (s)
0
0.002
0.004
0.006
0.008
0.01
0.012
200 400 600 8000 1000
distance
threshold
W
RF
(a) Graphical view of system phase distributions. (b) Distance between EVs and variation of the detection threshold.
WRF
Co
un
te
rs
 a
cc
es
s 
ra
te
Time (s)
cache misses
cache ref
inst retired
0
0.05
0.1
0.15
0.2
0 200 400 600 800 1000
(c) Cache reference and miss rates along with branch miss rate.
Fig. 2. Phase changes detection using the similarity between consecutive execution vector as similarity metric.
a whole phase with a known, we propose to only identify a part of
that phase with the known phase and extrapolate the result to the
remaining part. We refer to that process as partial phase recogni-
tion; further details are provided in the next paragraph.
Partial phase recognition actually consists of identifying an on-
going phase (the phase has started and is not yet finished) Pi with
a known phase Pj only considering the already executed part of
Pi. The already executed part of Pi expressed as a percentage of
the length (duration) of Pj is referred to as the recognition thresh-
old RT . Thus, with a RT% recognition threshold, and assuming that
the reference vector of Pj is EVPj and that its length is mj, an on-
going phase Pi is identified with Pj if the Manhattan distance be-
tween EVPj and each EV pertaining to the already executed part of
Pi (corresponding in length to RT% of mj) are within the similar-
ity threshold ST . The pseudo algorithm below summarizes partial
phase recognition.
• let Pj be a completed phase, EVPj its reference vector, andmj its
duration
• Pi is partially recognized as Pj if
– ∀v EV in the already executed part of Pi (that is, EVs sampled
between the time stamp corresponding to the start time of Pi
and the time stamp start time of Pi+ RT% of mj), the distance
between v and EVPj is within threshold.
4. Model use cases: systemadaptation and energy consumption
prediction
Understanding the different behaviours of a high performance
computing (HPC) system goes through throughout its life cycle can
lead to a multitude of optimization opportunities. In this section,
we shall investigate how one can leverage those behaviours for re-
ducing the energy consumption of its infrastructure. Platform as
a service becoming more and more attractive users often face the
dilemma of choosing between multiple platforms for their appli-
cations. This section also investigates how users can determine the
least energy consuming platform for their applications. In the fol-
lowing, we first show how our DNA-like modelling approach can
be used for predicting the energy consumption of an application on
a given platform. We next use our phase tracking methodology for
reducing the energy consumption of applications without a priori
knowledge.
4.1. Use case 1: cross platform energy prediction
The main motivation behind this use case is that users often
have more than one candidate platform for running their jobs, in
which case choosing the least energy consuming platform may be
beneficial both for them and the platform provider. For simplicity,
we assume that the application whose energy consumption is
being predicted is the only application running on the platform
under consideration. In other words, the DNA-like structure of the
system at hand is that of the application.
The prediction model implicitly uses two sets of data, one from
a reference platform provided by the DNA-like structure of the
application, and one from a target platform which is the platform
on which we want to estimate the overall energy consumption
of the application. The aforementioned reference platform is the
platformonwhich the applicationwas first run andwhere its DNA-
like structure was built. In most cases, the reference platform will
be the platform onwhich the application was tested (nearly all the
information – including the application’s energy consumption – in
regard to the reference platform is known).
Knowing the DNA-like structure of an application, one can eas-
ily identify applications with the same computational require-
ments. We accomplish this by comparing the DNA-like structure
of the ongoing application to known DNA-like structures. A match
is foundwhen the already executed part of the ongoing application
matches with a given percentage of the known DNA-like structure.
Let us denote Etar the energy consumed by the already executed
part of the application of which we want to estimate the energy
consumption and Eref the energy consumed by the corresponding
part of the application whose DNA-like matches with the already
executed part of the application at hand (the application that we
want to estimate the energy consumption). For example, consid-
ering an application that lasts 60 min on its reference platform, let
us assume that Etar represents the energy consumed by the same
application on the target platform after 10 min run; therefore, Eref
represents the proportion of energy consumed by the application
on the reference platform during the first 10 min of its execution.
Denoted as Erel, the relative energy consumption between the two
platforms is given by the following equation:
Erel =
Etar
Eref
. (1)
With the above relative energy, the estimated energy consump-
tion of the application on the target platform is given by Eq. (2):
Eest =
∫ X%
0
P(t)i,tar dt + Erel ∗
∫ end
X%
P ′(t)j,ref dt. (2)
Where Eest is the estimated energy consumption on the target
platform; Erel the relative energy consumption between the tar-
get platform and the reference platform. In the above equation,∫ X%
0
P(t)i,tar dt represents the energy consumed by the application
before a match is found with a known DNA-like structure. Either
measured or estimated, P(t)i,tar is the instantaneous power us-
age of the application on the target platform. Likewise, P ′(t)j,ref is
the instantaneous power usage of the application on the reference
platform and can be obtained from its DNA-like structure. In cases
where the power consumption may radically change after the X%
threshold, if the change in power consumption does not imply any
change in the set of sensors used to estimate the power consump-
tion, then we still assume it is the same application otherwise we
attempt to find another match.
We define the estimation accuracy as the ratio between esti-
mated and measured energy on the target platform, i.e,
Accuracy =
Eest_tar
Etar
(3)
where Eest_tar is the estimated energy consumption on the target
platform.
Although comparing two DNA-like structures boils down to
comparing two strings, the overhead associatedwithmatching the
DNA-like structure of a running application with previously seen
known applications is proportional to the size of already known
applications times the size of the DNA-like structure of the running
application. We simplify this with the assumption that our profile
database only contains one applicationwhich is that of the applica-
tion of which wewant to predict the energy consumption. We also
assume that the application follow a very simple pattern which
starts with an initialization phase and finishes which a finalization
phase. Between the initialization and the finalization phases, there
are some iterative computations and optional communications.
Finally, assuming that the instantaneous power usage of the ap-
plication is approximately the same during each of its iterations;
meaning that the energy consumed by the application in each it-
eration throughout its life cycle is nearly the same, Eq. (2) can be
simplified to Eq. (4); where Einit represents the energy consumed
by the application on the target platform during its initialization
phase; Eref−init is the energy consumed by the application on the
reference platform from the end of the initialization phase to the
endof itswhole execution; and Eref−exe is themeasured energy con-
sumption (resulting from complete execution) of the application
on the reference platform.
Eest = Einit + Erel ∗ (Eref−exe − Eref−init). (4)
4.2. Use case 2: optimizing energy performance of HPC systems
The methodology described in Section 3.2 permits online de-
tection and characterization of different runtime behaviours of the
system. We use the coupling with partial phase recognition to
guide on-the-fly system adaptation considering three HPC subsys-
tems: processor, disk and network interconnect. The power con-
sumption of these HPC subsystems along with that of the memory
is about 55% [24] of the total power consumption of a typical HPC
system. For the processor, we define three computational levels ac-
cording to the characteristics of the workload:
• High or CPU-bound: the CPU-bound computational level corre-
sponds to the maximum available CPU frequency, and is used
for CPU-bound workloads.
• Medium or memory-bound: corresponds to an average in be-
tween the maximum and the minimum available frequencies;
it is mainly used for memory-bound workloads.
• Low: the system is in the low computational levelwhen the CPU
frequency is set to the minimum available.
For the disk, we define two states: active and sleep; where active
includes both the disk’s active and standby modes. Finally, for the
network interconnect we define two data transfer speeds:
• The communication-intensive speed: corresponds to the high-
est available transfer rate of the network card.
• Low-communication speed: where the speed of the network
interconnect is set to the lowest speed.
As mentioned earlier in this paper, principal component anal-
ysis (PCA) is applied to vectors belonging to any newly created
phase for selecting five sensors which are used as phase charac-
teristics. These characteristics are translated into system adapta-
tion as detailed in Table 2. Let us comment the first row of that
table. Workloads/applications with frequent cache references and
misses are likely to be memory bound. In our case, having these
sensors (cache_reference and cache_misses) selected from PCA in-
dicates that the workload is not memory bound. If in addition that
workload does not issue a high I/O rate (presence of I/O related
sensors in the first column), then we assume that it is CPU-bound;
consequently, the frequency of the processor can be scaled to its
maximum, the disk sent to sleep and the speed of the interconnect
scaled down. For the second row of Table 2, the characteristics do
not include any I/O related sensor, this implies that the systemwas
running and I/O intensive workload; thus, the processor’s speed
can be set to its minimum. Note in passing that changing the disk’s
state from sleep to active does not appear in Table 2, this is because
the disk automatically enters the active state when it is accessed.
5. Experimentation and validation
5.1. Evaluation for energy consumption prediction
We evaluate our energy estimation model considering two
workloads: the first workload (workload_1) iteratively computes
the inverse of a 10×10matrix and copies a large file from a remote
repository; the second workload is GeneHunter a real life program
for linkage analysis [25].
Table 2
Translation of phase characteristics into system adaptation (I/O related sensors includes network and disk activities).
Sensors selected from PCA for phase characterization Decisions
Cache_references & cache_misses & I/O related sensors CPU frequency set to its maximum; spin down the disk; network speed scaled down
No I/O related sensors CPU frequency set to its lowest; network speed scaled up
Instructions & last level cache misses (llc) CPU frequency set to its minimum; network speed scaled up
Instructions or llc & I/O related sensors CPU frequency set to its average value; network speed scaled down; spin down the disk
I/O related sensors (low computation and communication-intensive) CPU frequency set to its maximum; network speed scaled up
Fig. 3. Per scenario energy estimation accuracy.
We further consider three scenarios: (i) the first scenario es-
timates the energy consumption of workload_1 on a node run-
ning at 2.13 GHz using the same node running at 1.6 GHz as
the reference platform; (ii) scenario 2 still estimates the energy
usage of workload_1, but uses a Dell Power Edge server and a
Sun Fire V20z as reference and target platforms respectively;
(iii) in the third and last scenario we attempt to estimate the en-
ergy consumption of GeneHunter. For this specific case, our refer-
ence platform is an Intel Xeon E5506 Quad-core with 8 cores and
12GB of RAMwhile the target platform is an Intel XeonX3440with
4 cores and 16 GB of RAM. For all three scenarios, an empiric par-
tial execution threshold of 20% is used. This means that, a match
with an existing DNA-like structure DS is found if the already ex-
ecuted part of our synthetic application matches with 20% of DS,
i.e., assuming DS lasted 60 s, a match will be found if the already
executed part of the synthetic application matches with the DNA-
like structure describing the first 12 s of DS.
We compute for each scenario the expected energy consump-
tion based on Eq. (4); results are summarized in Fig. 3. We can see
from those results that the accuracy is very good. Notice that the
accuracy is higher because it is computed considering the aver-
age energy consumption. We believe that overestimating the ac-
tual energy consumption (as Fig. 3 indicates) is acceptable since
the peak energy consumption is typically greater than the average.
We can also notice that the accuracy for Gene Hunter is extremely
high, this can be attributed to the fact thatwewere unable to divide
the application into phases reflecting its actual behaviour.
5.2. Execution vectors based systembehaviour tracking guided system
adaptation
In this section, we analyse and discuss experimental results for
our energy performance optimization use case.
5.2.1. Evaluation platform
Our evaluation support is a twenty five node cluster set up on
the Grid5000 [26] French large scale experimental platform. Each
node is an Intel Xeon X3440 with 4 cores and 16 GB of RAM. Avail-
able frequency steps for each core are: 2.53, 2.40, 2.27, 2.13, 2.00,
1.87, 1.73, 1.60, 1.47, 1.33 and 1.20 GHz. In our experiments, low
computational level always sets the CPU frequency to the lowest
available which is 1.20 GHz, whereas high and medium computa-
tional levels set the CPU frequency to the highest available (2.53
GHz) and 2.00 GHz respectively. Each node uses its own hard drive
which supports active, ready and standby states. InfiniBand-20G
is used for interconnecting nodes. The Linux kernel 2.6.35 is in-
stalled on each node where perf event is used to read the hard-
ware monitoring counters. MPICH is used as MPI library. For the
experiments, we use three benchmarks (LU, BT and SP) from NPB
suite and two real-life applications: Molecular Dynamics Simula-
tion (MDS) [27] and the Advance Research Weather Research and
Forecasting (WRF-ARW) model [28,23]. WRF-ARW is a fully com-
pressible conservative-form non-hydrostatic atmospheric model.
It uses an explicit time-splitting integration technique to efficiently
integrate the Euler equation. The classical Molecular Dynamics
solves numerical Newton’s equations of motion for the interaction
of the many particles system. We monitored each node’s power
usage with one sample per second using a power distribution unit.
5.2.2. Results analyses and discussion
To evaluate our system adaptation policy, we consider 3 ba-
sic configurations of the monitored cluster: the first configuration
which we refer to as on-demand is the configuration in which the
default Linux’s on-demand governor is enabled on all the nodes of
the cluster; the second configuration called performance is the con-
figuration inwhich the Linux’s performance governor is enabled on
each node of the cluster; and finally the third configuration which
we refer to as managed is the configuration in which our system
adaptation policy is applied. We also consider two levels of system
adaptation:
• System adaptation level one: It corresponds to the situation in
which only processor related optimization is made.
• System adaptation level two: It embraces level one, and addi-
tionally considers optimizing the interconnect and the disk.
Results we present here are obtained using an empirical simi-
larity threshold ST of 5%. The same goes for the recognition thresh-
old RT which is set to 10%.
(a) System adaptation level one: Processor’s only optimization:
when an ongoing phase is identified with an existing phase, the
characteristics of the existing phase are used to adapt the pro-
cessor’s frequency accordingly. Diagrams of Fig. 4 show the aver-
age energy consumption (Fig. 4(a)) and execution time (Fig. 4(b))
of MDS and WRF-AWR under the three system’s configurations.
These diagrams indicate that our management policy can save up
to 19% of the total energy consumption with less than 4% perfor-
mance loss.
Fig. 4(b) shows that on-demand and performance governors
nearly achieve the same performance. This is because Linux’s
on-demand governor do not lower the CPU frequency unless the
system load decreases below a given threshold. Traces of CPU load
under WRF-AWR for one node of our cluster are shown in Fig. 5,
where the y-axis represents the percentage of load. The plot in-
dicates that the CPU load remains above 85%, in which case the
N
or
m
al
iz
ed
 e
ne
rg
y 
co
ns
um
pt
io
n
20 %
40 %
60 %
80 %
0 %
100 %
MDS WRF-ARW
performance
managed
on-demand
20 %
40 %
60 %
80 %
N
or
m
al
iz
ed
 e
xe
cu
tio
n 
tim
e
0 %
100 %
MDS WRF-ARW
(a) Average energy consumed by each application under different
configurations.
(b) Average execution time of each application under different
configurations.
Fig. 4. Phase tracking and partial recognition guided CPU optimization results.
core1
core2
core3
core4
lo
ad
 p
er
 c
or
e
time (s)
200 400 600 800 1000 1200 14000 1600
90
95
100
105
Fig. 5. Load traces for one of the nodes running WRF-AWR under the on-demand configuration.
MDS WRF-ARW
N
or
m
al
iz
ed
 e
ne
rg
y 
co
ns
um
pt
io
n
0 %
20 %
40 %
60 %
80 %
100 % performance
managed
on-demand
MDS WRF-ARW
N
or
m
al
iz
ed
 e
xe
cu
tio
n 
tim
e
0 %
20 %
40 %
60 %
80 %
100 %
(a) Energy performance. (b) Performance (execution time).
Fig. 6. Phase tracking and partial recognition guided processor, disk and network interconnect optimization results: the chart shows average energy consumed by each
application under different configurations.
on-demand and performance governors almost have the same be-
haviour.With processor’s only optimization, ourmanagement pol-
icy differs from that of the Linux’s on-demand governor in that we
do not use the system’s load as system adaptation metric, which
enable us to consume less energy.
(b) System adaptation level two: Processor, disk and network op-
timization: Fig. 6 presents the energy performance considering the
processor, alongwith disk and network. These graphs indicate that
considering the processor along with disk and network intercon-
nect improves energy performanceup to 24%with the sameperfor-
mance degradation as with processor’s only optimization. In other
words, the disk and the network also contribute in improving en-
ergy performance.
5.2.3. Threshold selection and energy performance
In this section, we investigate the influence of phase detec-
tion parameters ST and RT on application performance and energy
consumption. Firstly, we set the detection threshold and vary the
partial recognition threshold. Secondly, we set the partial recogni-
tion threshold and vary the detection threshold. Fig. 7 —where the
y-axis represents either the average energy consumption (Fig. 7(a))
or the average execution time (Fig. 7(b)) and the x-axis the detec-
tion threshold — shows the impact of the detection threshold on
the execution time and energy consumption of WRF-AWR. It can
be seen from that figure that for WRF-AWR, a detection thresh-
old either of 5% or 20% could be a good choice. However, 5% could
be preferable since there is a difference in energy consumption
of up to 5000 J, for a less than 10 s difference in execution time.
Fig. 7 also reveals that the detection threshold might have a signif-
icant impact onperformance (both in terms of energy consumption
and execution time). For these experiments, the partial recognition
threshold was fixed to 10%.
The influence of the recognition threshold RT on energy per-
formance is summarized by Fig. 8, where the x-axis represents
the recognition threshold and the y-axis either the average energy
consumption (Fig. 8(a)) or the average execution time (Fig. 8(b)).
(a) Energy performance. (b) Performance (execution time).
Fig. 7. Impact of the detection threshold on energy performance for WRF-AWR.
(a) Energy performance. (b) Performance (execution time).
Fig. 8. Influence of the partial recognition threshold on energy performance for WRF-AWR.
According to Fig. 8, a partial recognition threshold of 15% is conve-
nient both in terms of energy and execution time.
What we can learn from Figs. 7 and 8 is that these values must
be chosen depending on the target objective. A small recognition
thresholdmay limit the impact ofwrong decisions; however itmay
also have an influence on right decisions. For, making the right
decision earlier allows saving more energy, in reverse, making a
wrong decision earlier can result in significant energywaste and/or
performance degradation.
5.3. Performance analyses of multiple applications
Earlier herein, we talked about the usefulness to look at awhole
system; however, our evaluations has so far focused on only one
application at a time. In this section, we evaluate our strategy with
multiple applications running at the same time.We consider as test
applications the WRF-AWR model and BT from the NAS Parallel
Benchmark (NPB) suite. WRF-AWR spans 48 processes and run on
48 cores (in other words 12 nodes) whereas BT spans 48 processes
on 12 nodes. The two applications do not share any resources ex-
cept the network that interconnect them. We measure the power
consumption of each of our 24-node cluster using a power distribu-
tion unit (PDU). Fig. 9, where the x-axis represents the average per-
centage of energy improvement or performance loss, indicates that
our methodology still function when there are multiple applica-
tions sharing the platform; however, we also notice a performance
degradation of up to 15% for BT. We are currently investigating
the reason of BT performance degradation. Nevertheless, we be-
lieve this can be attributed to the fact that some phasesmight have
been considered asmemory intensive or communication intensive
while they were actually compute intensive. In addition, applica-
tions (BT for example) which do not implement load imbalance are
Fig. 9. Energy and performance (execution time) when BT and WRF-AWR are
sharing the cluster.
likely to experience higher performance degradation than those
which implement load imbalance (WRF-AWR for example) [29,30].
6. Conclusion and future works
In this paper, we present two generic approaches for tracking
high performance computing systems’ behaviour regardless of the
applications being executed. We show through two use cases how
they can be used for improving energy performance of a HPC sys-
tem at runtime on the one hand and on the other hand how they
can be used for estimating the energy consumption of an appli-
cation given a target platform. Experimental results reveal the ef-
fectiveness of these two methodologies under real-life workloads
and benchmarks. Comparison of our system adaptation policywith
baseline unmanaged execution shows that we can save up to 19%
of energy with less than 4% performance loss considering the pro-
cessor only and up to 24% considering the disk, processor and net-
work. As ourmethodology for improving energy performance does
not depend on any application specific metric, we expect it to be
extended to a large number of power-aware HPC systems. Future
works include combining the two system’s behaviour tracking to
enable live feedback on whether a system adaptation will be effec-
tive or not. This can help improve the energy performance while
reducing performance loss. We also plan on adding memory opti-
mization to processor, disk andnetwork optimization and integrat-
ingworkload consolidation andmigration as core functionalities of
the system management policy.
Acknowledgments
This work is supported by the INRIA large scale initiative
Hemera focused on ‘‘developing large scale parallel and distributed
experiments’’. Experiments presented in this paper were carried
out using the Grid’5000 experimental testbed, being developed
under the INRIA ALADDIN development action with support from
CNRS, RENATER and several Universities as well as other funding
bodies (see https://www.grid5000.fr).
References
[1] NVDIA, Nvidia tesla gpus power world’s fastest supercomputer, 2010. Press
release.
[2] INTEL, Developer’s manual: Intel 80200 processor based on Intel xscale
microarchitecture, 1998.
[3] AMD, Mobile AMD Duron processor model 7 data sheet, 2001.
[4] G.L.T. Chetsa, L. Lefèvre, J.-M. Pierson, P. Stolf, G.D. Costa, DNA-inspired scheme
for building the energy profile of HPC systems, in: E2DC, in: Lecture Notes in
Computer Science, vol. 7396, Springer, 2012, pp. 141–152.
[5] G.L. Tsafack, L. Lefevre, J.-M. Pierson, P. Stolf, G. Da Costa, Beyond CPU
frequency scaling for a fine-grained energy control of HPC systems, in: SBAC-
PAD 2012: 24th International Symposium on Computer Architecture and High
Performance Computing, IEEE, New York City, USA, 2012, pp. 132–138.
[6] G.D. Costa, H. Hlavacs, Methodology of measurement for energy consumption
of applications, in: GRID, IEEE, 2010, pp. 290–297.
[7] I. Kadayif, T. Chinoda,M. Kandemir, N. Vijaykirsnan,M.J. Irwin, A. Sivasubrama-
niam, VEC: virtual energy counters, in: Proceedings of the 2001 ACMSIGPLAN-
SIGSOFT Workshop on Program Analysis for Software Tools and Engineering,
PASTE ’01, ACM, New York, NY, USA, 2001, pp. 28–31.
[8] G. Contreras, Power prediction for Intel xscale processors using performance
monitoring unit events, in: Proceedings of the International Symposium on
Low Power Electronics and Design (ISLPED), ACM Press, 2005, pp. 221–226.
[9] K. Singh, M. Bhadauria, S.A. McKee, Real time power estimation and thread
scheduling via performance counters, SIGARCH Comput. Archit. News 37
(2009) 46–55.
[10] D. Bautista, J. Sahuquillo, H. Hassan, S. Petit, J. Duato, A simple power-
aware scheduling for multicore systems when running real-time applications,
in: IPDPS, IEEE, 2008, pp. 1–7.
[11] R. Joseph, M. Martonosi, Run-time power estimation in high performance
microprocessors, in: International Symposium on Low Power Electronics and
Design, pp. 135–140.
[12] W.Wu, L. Jin, J. Yang, P. Liu, S.X.D. Tan, A systematicmethod for functional unit
power estimation in microprocessors, in: Design Automation Conference.
[13] SUN, The UltraSPARC processor—technology white paper: the UltraSPARC
architecture, 1995.
[14] INTEL, Intel xscalemicroarchitecture for the PXA255 processor: User’s manual
Intel corporation, 2003.
[15] B. Rountree, D.K. Lownenthal, B.R. de Supinski, M. Schulz, V.W. Freeh,
T. Bletsch, Adagio: making dvs practical for complex HPC applications, in: Pro-
ceedings of the 23rd International Conference on Supercomputing, ICS ’09,
ACM, New York, NY, USA, 2009, pp. 460–469.
[16] V.W. Freeh, N. Kappiah, D.K. Lowenthal, T.K. Bletsch, Just-in-time dynamic
voltage scaling: exploiting inter-node slack to save energy in MPI programs,
J. Parallel Distrib. Comput. 68 (2008) 1175–1185.
[17] R. Kotla, A. Devgan, S. Ghiasi, T. Keller, F. Rawson, Characterizing the impact of
differentmemory-intensity levels, in: IEEE 7thAnnualWorkshop onWorkload
Characterization (WWC-7).
[18] C. Isci, G. Contreras, M. Martonosi, Live, runtime phase monitoring and
prediction on real systems with application to dynamic power management,
in: Proceedings of the 39th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO 39, IEEE Computer Society, Washington, DC, USA,
2006, pp. 359–370.
[19] K. Choi, R. Soma, M. Pedram, Fine-grained dynamic voltage and frequency
scaling for precise energy and performance tradeoff based on the ratio of off-
chip access to on-chip computation times, Trans. Comp.-Aided Des. Integ. Cir.
Sys. 24 (2006) 18–28.
[20] M.Y. Lim, V.W. Freeh, D.K. Lowenthal, Adaptive, transparent frequency and
voltage scaling of communication phases in MPI programs, in: Proceedings of
the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, ACM, New York,
NY, USA, 2006.
[21] D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter, R.A. Fatoohi,
P.O. Frederickson, T.A. Lasinski, H.D. Simon, V. Venkatakrishnan, S.K. Weer-
atunga, The NAS Parallel Benchmarks, Technical Report, Int. J. Supercomput.
Appl. (1991).
[22] C. Isci, M. Martonosi, Identifying program power phase behavior using power
vectors, in: Workshop on Workload Characterization.
[23] W.C. Skamarock, J.B. Klemp, J. Dudhia, D.O. Gill, D.M. Barker, W. Wang,
J.G. Powers, A description of the advanced research WRF version 2. Available
from Ncar; P.O. Box 3000; Boulder, CO 88 (2001) 7–25.
[24] Y. Liu, H. Zhu, A survey of the research on power management techniques for
high-performance systems, Softw.-Pract. Exp. 40 (2010) 943–964.
[25] G. Conant, S. Plimpton, W. Old, A. Wagner, P. Fain, Parallel GeneHunter:
implementation of a linkage analysis package for distributed-memory
architectures, in: Proceedings of the First IEEEWorkshop onHigh Performance
Computational Biology, International Parallel and Distributed Computing
Symposium, p. electronic.
[26] R. Bolze, F. Cappello, E. Caron, M. Daydé, F. Desprez, E. Jeannot, Y. Jégou,
S. Lanteri, J. Leduc, N. Melab, G. Mornet, R. Namyst, P. Primet, B. Quetier,
O. Richard, E.-G. Talbi, I. Touche, Grid’5000: a large scale and highly recon-
figurable experimental grid testbed, Int. J. High Perform. Comput. Appl. 20
(2006) 481–494.
[27] K. Binder, J. Horbach, W. Kob, W. Paul, F. Varnik, Molecular dynamics
simulations, J. Phys.: Condens. Matter 16 (2004) S429.
[28] WRF-AWR, The weather research and forecasting model, 2012.
[29] G. Chen, K. Malkowski, M.T. Kandemir, P. Raghavan, Reducing power with
performance constraints for parallel sparse applications, in: IPDPS, IEEE
Computer Society, 2005.
[30] H. Kimura, M. Sato, Y. Hotta, T. Boku, D. Takahashi, Emprical study on reducing
energy of parallel programs using slack reclamation by DVFS in a power-
scalable high performance cluster, in: CLUSTER, IEEE, 2006.
Tsafack C. Ghislain Landry received the B.S. degree in
mathematics and computer science from Dschang Uni-
versity, Cameroon, in 2006, and the M.S. degree from the
Francophone Institute for Computer Science, Viet Nam,
in 2010. Prior to receiving the M.S. degree, he received
a Maîtrise in computer science from the University of
Yaoundé I, Cameroon, in 2007. He has been working to-
wards the Ph.D. degree in computer science at Ecole
Normale Superieure of Lyon, LIP lab. and Paul Sabatier
University, IRIT lab. His major research interests include
power-aware computing and communication in high per-
formance computing infrastructures, delay/disruption-tolerant networking and
network services.
Laurent Lefèvre obtained his Ph.D. in Computer Science in
January 1997 at LIP Laboratory (Laboratoire Informatique
du Parallelisme) in ENS-Lyon (Ecole Normale Superieure),
France. From 1997 to 2001, he was assistant professor
in computer science in Lyon 1 University and a member
of the RESAM Laboratory (High Performance Networks
and Multimedia Application Support Lab.) Since 2001, he
is research associate in computer science at INRIA (the
French Institute for Research in Computer Science and
Control). He is a member of the INRIA AVALON team
(Algorithms and Software Architectures for Distributed
and HPC systems) from the LIP laboratory in Lyon, France. His research interests
focus on Green and Energy Efficient Computing and Networking. He has organized
several conferences in high performance networking and computing and he is a
member of several program committees. He has co-authoredmore than 100 papers
published in refereed journals and conference proceedings. He participates in
several national and european projects on energy efficiency. For more information,
visit: http://perso.ens-lyon.fr/laurent.lefevre/.
Jean-Marc Pierson serves as a Full Professor in Computer
Science at the University of Toulouse (France) since 2006.
He received his Ph.D. from the ENS-Lyon, France in1996.
He was an Associate Professor at the University Littoral
Cote-d’Opale (1997–2001) in Calais, then at INSA-Lyon
(2001–2006). He is a member of the IRIT Laboratory and
Chair of the SEPIA Team on distributed systems. His main
interests are related to large-scale distributed systems.
He serves on several PCs and editorial boards in the
Cloud, Grid, Pervasive, and Energy-aware computing area.
Since the last years, his researches focus on energy aware
distributed systems, in particularmonitoring, job placement and scheduling, virtual
machines techniques, green networking, autonomic computing, mathematical
modeling. He is chairing the EU funded COST IC804 Action on ‘‘Energy Efficiency
in Large Scale Distributed Systems’’ and participates in several national and
european projects on energy efficiency. For more information, please visit
http://www.irit.fr/~Jean-Marc.Pierson/.
Patricia Stolf is an assistant professor since 2005. She
teaches at the Toulouse University (France). She obtained
a Ph.D. in 2004 in the LAAS-CNRS laboratory (Toulouse-
France) on Tasks scheduling on clusters for remote
services with quality of service. She now works in the
IRIT laboratory in the SEPIA team and is currently working
in the field of distributed algorithms and autonomic
computing in large scale distributed systems like grid
and clouds. She studies resources management, load-
balancing, energy aware autonomic systems and energy
and thermal-aware task scheduling. She is involved in
different research projects : in the ACTION COST IC0804 ‘‘Energy Efficiency in Large
Scale Distributed Systems’’, in the European CoolEmAll project and in the national
ANR SOP project.
Georges Da Costa is a permanent Assistant Professor
in Computer Science at the University of Toulouse. He
received his Ph.D. from the LIG HPC research labo-
ratory (Grenoble, France) in 2005. He is member of
the IRIT Laboratory. His main interests are related to
large-scale distributed systems, algorithmic, performance
evaluation and energy-aware systems. He is Work Pack-
age leader of the European project CoolEmAll which
aims at providing advanced simulation, visualization and
decision support tools along with blueprints of com-
puting building blocks for modular data center envi-
ronment. He is working group chair of the European COST0804 Action on
‘Energy efficiency in large scale distributed systems’. His research currently fo-
cus on energy aware distributed systems. He serves on several PCs in the Energy
aware systems, Cluster, Grid, Cloud and Peer to Peer fields. His research high-
lights are grid cluster & cloud computing, hybrid computing (CPU/GPU), large
scale energy aware distributed systems, performance evaluation and ambient
systems.
