Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems by Aliaga Estellés, José Ignacio et al.
Noname manuscript No.
(will be inserted by the editor)
Assessing the Impact of the CPU Power-Saving Modes
on the Task-Parallel Solution of Sparse Linear Systems
Jose´ I. Aliaga · Mar´ıa Barreda · Manuel F. Dolz ·
Alberto F. Mart´ın · Rafael Mayo · Enrique S. Quintana-Ort´ı
the date of receipt and acceptance should be inserted later
Abstract We investigate the benefits that an energy-
aware implementation of the runtime in charge of
the concurrent execution of ILUPACK —a sophisti-
cated preconditioned iterative solver for sparse linear
systems— produces on the time-power-energy bal-
ance of the application. Furthermore, to connect the
experimental results with the theory, we propose sev-
eral simple yet accurate power models that capture
the variations of average power that result from the
introduction of the energy-aware strategies as well as
the impact of the P-states into ILUPACK’s runtime,
at high accuracy, on two distinct platforms based on
multicore technology from AMD and Intel.
Keywords Sparse linear systems · Iterative
solvers · ILUPACK · Energy-aware methods · CPU
power-saving modes · High performance computing
Jose´ I. Aliaga · Mar´ıa Barreda · Rafael Mayo · En-
rique S. Quintana-Ort´ı
Depto. de Ingenier´ıa y Ciencia de Computadores, Univer-
sidad Jaume I, 12.071–Castello´n, Spain,
E-mail: {aliaga, mvaya, mayo, quintana}@uji.es
Manuel F. Dolz
Dept. of Informatics, University of Hamburg, 22.527–
Hamburg, Germany,
E-mail: manuel.dolz@informatik.uni-hamburg.de
Alberto F. Mart´ın
Centre Internacional de Me`todes Nume`rics a l’Enginyeria
(CIMNE), Parc Mediterrani de la Tecnologia, Universitat
Polite`cnica de Catalunya, 08.860-Castelldefels, Spain,
UPC, Edifici C1, 08.034-Barcelona, Spain,
E-mail: amartin@cimne.upc.edu
1 Introduction
Power consumption has been identified as a cru-
cial challenge that will need to be tackled in or-
der to deliver a sustained EXAFLOPS1 through-
put, at an affordable cost, in high-performance com-
puting (HPC) facilities deployed by the end of this
decade [14,16,21]. In particular, although we have
enjoyed considerable advances in the performance-
power ratio of HPC systems during the last few years
—around 5× in the MFLOPS/Watt ratio for the sys-
tems ranked in the first position of the Green500 list
during the last five years [1]— the ambitious goal
of building an Exascale system by 2020, which dissi-
pates only 20 MWatts, demands even more favorable
improvements rates.
A major part of the progress experienced in the
energy efficiency of HPC systems in these past years
has been due to the hardware, concretely, to the ag-
gregation of low-power processors with an increasing
number of cores and the adoption of hardware accel-
erators for HPC [3]. However, much remains to be
done in order to leverage the power-saving mecha-
nisms available in today’s hardware and a holistic
investigation and multidisciplinary effort has to be
conducted in energy-efficient HPC, comprising ap-
plications, system software as well as hardware [4].
ILUPACK [2] is a numerical package for the so-
lution of sparse linear systems via Krylov-based it-
erative methods. The software implements multilevel
ILU (incomplete LU) preconditioners for general, sym-
1 A ratio of 1 EXAFLOPS=1018 floating-point arith-
metic operations, or flops, per second.
2 Jose´ I. Aliaga et al.
metric indefinite and Hermitian positive definite sys-
tems, as those arising in numerous scientific and en-
gineering applications, in combination with inverse-
based ILUs and Krylov subspace solvers.
In [5] and [6] we presented two concurrent ver-
sions of ILUPACK, for multicore architectures and
distributed-memory (message-passing) platforms re-
spectively, that efficiently exploit the task-parallelism
intrinsic to the iterative solution of the sparse linear
systems (including the calculation of the precondi-
tioner). In response to the increasing energy aware-
ness in HPC, in [7] we analyzed the energy efficiency
of the task-parallel calculation of the preconditioner
on an AMD-based multicore platform, leveraging the
CPU power-saving modes available in this architec-
ture [20]. In this paper we extend our previous work,
with the following new contributions:
– We introduce energy-aware strategies into ILU-
PACK runtime and we address the characteriza-
tion (modeling) of power consumption in the pre-
conditioned solution of symmetric positive defi-
nite (s.p.d.) sparse linear systems, thus expand-
ing our previous work [7] on the computation of
the preconditioner only, to cover the complete
solver.
– We collect precise information on the CPU power-
saving modes, identifying the sources of the power
bottlenecks and the energy gains, and relating
the power consumption and attained savings ob-
served in our detailed experimental results to the
power models.
– We target a server based on the AMD Opteron
6128 processor, already analyzed in [7], but we
also model and evaluate an alternative platform
equipped with two Intel Xeon E5504 processors.
These servers are representative of current mul-
ticore technology and both abide to the CPU
power-saving modes in the ACPI (advanced con-
figuration and power interface) specification [20].
From the points of view of power consumption
and performance, these platforms present two rel-
evant differences, though. First, all cores of the
Intel processor share the same power plane and
the core frequency can be only adjusted at the
processor (socket) level. Instead, on the AMD
server the frequency can be varied per core. Sec-
ond, as our experiments will show, on the In-
tel platform memory bandwidth is independent
of the processor operation frequency while, on
the AMD server, changes to frequency also affect
memory throughput.
– Finally, we refine the experimental evaluation in [7],
using a more accurate wattmeter with a much
higher sampling rate than that utilized in our
previous study.
As one of the goals of this paper is to analyze
the effect that the exploitation of the CPU power-
saving mechanisms exerts on the energy efficiency
of a complex scientific application, we consider only
the power dissipated by the components integrated
in the system motherboard (e.g., CPU and RAM
chips), discarding other power sinks due, e.g., to net-
work interface, disk, inefficiencies of the power sup-
ply unit, etc. For applications such as ILUPACK,
which mainly exercise the floating-point arithmetic
units of the processors and the memory, we can ex-
pect that the contribution of the discarded compo-
nents is a constant that can be simply added to our
models.
There are a large number of software efforts to re-
duce energy consumption in HPC clusters. For brevity,
we next reference only a few. The work in [30] presents
a characterization of energy saving techniques for
clusters with two large groups: static power man-
agement (SPM) and dynamic power management
(DPM). Within DPM the authors further distinguish
between component-based and power-scalable load
balancing (LB) techniques. Our approach, based on
the exploitation of C-states, can be considered anal-
ogous to that in [11,26], however we apply it at
processor level rather than at the node level. From
that perspective, our techniques can be classified as
LB. The authors in [31] leverage DVFS (dynamic
voltage-frequency scaling) to reduce the power-energy
consumption of precedence-constrained tasks run-
ning on a cluster. The heuristic-based strategy used
by the authors minimizes the slack of non-critical
tasks by means of reducing its frequency operation
while maintaining the time-to-solution. A similar ap-
proach is applied to the execution of dense linear al-
gebra algorithms in multicore processors in [9]. In [22],
the authors leverage hierarchical genetic strategy-
based grid scheduling algorithms to exploit span via
DVFS and reduce energy consumption in computa-
tional grids. The simulation results show that pro-
posed scheduling methodologies fairly reduce the en-
ergy usage and can be easily adapted to the dy-
namically changing grid states and various schedul-
ing scenarios. Our strategy considers a static VFS-
Title Suppressed Due to Excessive Length 3
based approach, in which the voltage/frequency pair
is fixed at the beginning of the execution and does
not vary thereafter. Our energy savings come instead
from the exploitation of the C-states which, in gen-
eral, lead to considerable energy savings by relying
on a race-to-halt/race-to-idle strategy. DVFS strate-
gies are part of future work.
The rest of the paper is structured as follows.
In Section 2 we briefly review the numerical ap-
proach underlying ILUPACK and the task-parallel
implementation oriented to multicore architectures
from [5]. In Section 3 we introduce the setup for
our experiments. The next three sections contain the
main contribution of the paper: The power model in
Section 4; the power-aware runtime together with a
theoretical and experimental analysis of the impact
of the C-states on the time-power-energy balance in
Section 5; and an analogous study, from the point of
view of the P-states, in Section 6. Finally, the paper
is closed with a discussion of the results in Section 7.
2 Parallel ILUPACK for Multicore
Processors
The approach to multilevel preconditioning in ILU-
PACK relies on the so-called inverse-based ILU fac-
torizations. Unlike classical threshold-based ILUs,
this approach directly bounds the size of the pre-
conditioned error and results in increased robustness
and scalability, especially for applications governed
by PDEs, due to its close connection with algebraic
multilevel methods [5]. Specifically, for efficient pre-
conditioning, only a small amount of fill-in is allowed
during the factorization, resulting in a modest num-
ber of floating-point arithmetic operations per non-
zero entry of the sparse coefficient matrix.
Parallelism in the computation of ILUPACK pre-
conditioners is exposed by means of nested dissection
applied to the adjacency graph representing the non-
zero connectivity of the sparse coefficient matrix.
Nested dissection is a partitioning heuristic which re-
lies on the recursive separation of graphs. The graph
is first split by a vertex separator into a pair of in-
dependent subgraphs and the same process is next
recursively applied to each independent subgraph.
The resulting hierarchy of independent subgraphs is
highly amenable to parallelization. In particular, the
inverse-based preconditioning approach is applied in
parallel to the blocks corresponding to the indepen-
dent subgraphs while those corresponding to the sep-
arators are updated. When the bulk of the former
blocks has been eliminated, the updates computed
in parallel within each independent subgraph are
merged together, and the algorithm enters the next
level in the nested dissection hierarchy. The same
process is recursively applied to the separators in
the next level and the algorithm proceeds bottom-
up in the hierarchy until the root finally completes
the parallel computation of the preconditioner; see
Figure 1.
The type of parallelism described above can be
expressed by a binary task dependency tree, where
nodes represent concurrent tasks and arcs specify de-
pendencies among them. The parallel execution of
this tree on multi-core processors is orchestrated by
a runtime which dynamically maps tasks to threads
(cores) in order to improve load balance require-
ments during the computation of the ILU precon-
ditioner. At execution time, thread migration is pre-
vented using POSIX routine sched set affinity.
The runtime keeps a shared queue of ready tasks
(i.e., tasks with their dependencies fulfilled) which
are executed by the threads in FIFO order. This
queue is initialized with the tasks corresponding to
the independent subgraphs. Idle threads have to wait
for new ready tasks. When a given thread completes
the execution of a task, its parent task is enqueued
provided the sibling of the former task has been al-
ready completed as well.
The most expensive operation involved in the
preconditioned iterative solution of the linear sys-
tem is the application of the multilevel precondi-
tioner, which is in turn decomposed into two steps:
the multilevel forward (FS) and backward substi-
tutions (BS). The aforementioned task dependency
tree also describes the parallelism available within
both computations. However, while the FS proceeds
bottom-up towards the root of the tree, the BS pro-
ceeds in the opposite direction. In order to maximize
data locality during the parallel multi-threaded ex-
ecution of both operations, the mapping of threads
to tasks resulting from the (dynamic load-balancing)
computation of the preconditioner is re-used, so that
each thread knows in advance which tasks it is in
charge of (i.e., static mapping). The runtime uses a
different task queue for each thread and substitu-
tion algorithm. For the FS, the queue of each thread
is initialized with the leaves it is in charge of, and
new (ready) tasks are enqueued on the correspond-
ing queues as soon as their dependencies are ful-
4 Jose´ I. Aliaga et al.
   
   


  
  
  



 
 
 



 
 
 
 



   
   
   



  
  
  



      
   
   
   



  
  
  



  
  
  


  
   


GA
G(2,1) G(2,2)
G(3,1) G(3,4)
(3,2)G (3,3)G
First ND level finds separator (1,1)
(1,1)
(1,1)
(2,1) (2,2)
Second ND level finds separators (2,1) and (2,2)
(2,2)
(2,1)
(1,1)
(3,1)
(3,4)(3,3)(3,2)
Nested Dissection
Task dependency Tree
A
A −> P APT 
Fig. 1 Nested dissection applied to the adjacency graph associated with a sparse matrix and the corresponding task
dependency tree.
filled (i.e., as soon as their children tasks are com-
pleted). For the BS, only the root task is initially
included in the corresponding queue. As soon as the
root task is completed, its children are enqueued
on the corresponding queues, and the parallel exe-
cution (orchestrated by the runtime) proceeds top-
bottom while taking care of task dependencies un-
til the computation of the leaves is completed. The
other operations involved in the preconditioned iter-
ative solution stage (i.e., sparse matrix-vector prod-
uct and vector operations) are split and mapped con-
formally with the FS and BS steps in order to maxi-
mize data locality. Moreover, a careful management
of shared data, by maintaining consistent or inconsis-
tent copies of the matrix and the vectors, avoids syn-
chronization steps, except those reductions required
in order to compute inner products. Further details
on the mathematical foundations of the parallel al-
gorithms, their implementation, and the runtime op-
eration can be found in [5].
3 Environment Setup
In all our experiments, we employ a scalable sym-
metric positive definite sparse linear system of di-
mension n = N3, resulting from a partial differen-
tial equation −∆u = f in a 3D unit cube Ω = [0, 1]3
with Dirichlet boundary conditions u = g on δΩ,
discretized using a uniform mesh of size h = 1N+1 .
We set N = 252, which yields the largest linear sys-
tem that fitted into the main memory of the target
machine, with about 16 · 106 unknowns and 111 · 106
nonzero entries in the coefficient matrix. All tests
were performed using IEEE double-precision arith-
metic. Execution time, power and energy are always
reported in seconds (s), Watts (W) and Joules (J),
respectively.
We employ two target servers in our experiments.
The first platform, wt amd, is equipped with a sin-
gle AMD Opteron 6128 processor (8 cores), 24 Gbytes
of RAM, and runs the Linux Ubuntu operating sys-
tem (kernel 2.6.32-220.4.1.el6.x86 64). The second
server, wt int, contains two Intel Xeon E5504 pro-
cessors (4 cores per socket), 32 Gbytes of RAM, and
runs Linux Ubuntu (kernel 2.6.32-220.4.1.el6.x86 64)
as well. These two types of processors are representa-
tive of current multicore technology, and adhere to
the ACPI standard [20] for the CPU power-saving
modes. Concretely, the AMD processor features 5
performance states (P-states P0 to P4) and three op-
erating or power states (C-states C0, C1 and C1E).
The Intel processor has 4 P-states (P0 to P3) and
4 C-states (C0, C1, C3 and C6). Information on
the voltage–frequency pairs (Vi−fi) associated with
each P-state (Pi) is collected in Table 1. State C0
is the normal operating mode (i.e., the CPU is op-
erative) while higher numbers correspond to deeper
sleep modes, where more circuits and signals of the
processor are turned off, saving more power, but re-
quiring longer time to go back to C0 mode.
Title Suppressed Due to Excessive Length 5
Platform P-state, Pi Vi fi BWi
wt amd
P0 1.23 2.00 30.29
P1 1.17 1.50 24.63
P2 1.12 1.20 20.46
P3 1.09 1.00 17.48
P4 1.06 0.80 14.00
Platform P-state, Pi Vi fi BWi
wt int
P0 1.04 2.00 12.72
P1 1.01 1.87 12.58
P2 0.98 1.73 12.61
P3 0.95 1.60 12.55
Table 1 P-states, associated voltage–frequency pairs (Vi in Volts and fi in GHz), and core to memory bandwidth (BWi,
in GB/sec.) measured with the stream benchmark.
From the practical point of view, the AMD and
Intel servers differ in two important aspects:
– The frequency of the AMD cores can be adjusted
independently while, on the Intel platform, all
cores in the same processor run at the same fre-
quency. In particular, if the cores of a processor
from wt int operate at frequency fi, and we in-
struct one of these cores to run at fj > fi (using
the Linux cpufreq utility), the remaining three
cores in the same socket will also transition to
operate at fj . On the other hand, if the cores of
a processor from wt int run at frequency fi, and
we instruct one core to run at frequency fj < fi,
there will be no change.
– On the AMD platform, the bandwidth between
the cores and the main memory varies with the
processor frequency while, on the Intel platform,
this bandwidth is independent of the processor
frequency. To illustrate this behavior, columnBWi
of Table 1 reports the bandwidth to the main
memory experienced by a single core running the
stream microbenchmark [29] at different frequen-
cies.
We note that the bandwidth-frequency depen-
dence is a design decision specific of each pro-
cessor type: more recent processors as, e.g., the
Intel Xeon E52670 “Sandy Bridge” seems to fol-
low AMD 6128’s strategy and reduce the band-
width with the processor frequency [18]; on the
other hand, some other processors like the AMD
6274 “Interlagos” apparently abandon this ap-
proach [27].
In our experiments, power samples were obtained
from the 12-Volt lines connecting the power supply
unit to the motherboard of the target platform2, us-
2 On a separate experiment [15], it was determined that
the aggregate power supplied by the 3-Volt and 5-Volt
lines during the execution of ILUPACK on these two plat-
forms remains practically constant and, furthermore, it is
negligible compared with that measured from the 12-Volt
lines.
ing an internal wattmeter composed of a National In-
struments (NI) analog input module (9205) plugged
into a NI chassis (cDAQ-9178) and a board of cur-
rent transducers (LEM HXS 20-NP). The wattmeter
is connected via an Ethernet link to a separate power
tracing server that runs a daemon application to col-
lect power samples form the internal wattmeter. The
measurement application is built by calling routines
from the pmlib library [8,13]. We set the sampling
rate to 1 kSamples/sec., which is high enough to ob-
tain reliable measures for the power model and re-
maining experiments.
Our multithreaded implementation of ILUPACK
is built on top of the OpenMP interface available
with Intel icc (version 12.1.3) on both platforms.
Performance (core activity) traces were captured us-
ing the Extrae+Paraver (versions 2.2.1+4.3.4) trac-
ing environment [25]. Traces of CPU power modes
were recorded using the Linux interface to read and
write model-specific registers (MSRs) and dumped
into Paraver-compatible files for interactive visual-
ization and analysis.
4 The Power Model and the CPU
Power-Saving Modes
We open this section by revisiting the following sim-
ple model from [10] for the total (aggregate) power
dissipated by an application at a given instant of
time t:
PT = PY + PP = PY + PU + PC, (1)
where PP is the power dissipated by the CPU proces-
sor(s) and PY is the power dissipated by the remain-
ing components (system power corresponding, e.g, to
DDR RAM chips, motherboard, etc.). Furthermore,
PU is the power dissipated by the uncore [19] ele-
ments of the processor (e.g., last-level cache, mem-
ory controllers, core interconnect, etc.); and PC is
the power for the cores (including the in-core cache
6 Jose´ I. Aliaga et al.
Platform P-state, Pi PYi αi βburn,i βbusy,i βgemm,i P
U
i
wt amd
P0 84.83 140.94 14.70 13.50 13.82 56.11
P1 77.85 137.47 7.71 6.10 10.01 59.62
P2 73.83 131.35 5.48 4.58 7.40 57.52
P3 71.86 127.87 4.12 3.33 5.46 56.01
P4 72.46 124.89 3.10 2.28 4.27 52.43
wt int
P0
33.43
64.44 9.48 7.10 11.12 31.01
P1 63.38 8.19 6.16 9.84 29.95
P2 64.10 7.33 5.43 9.03 30.67
P3 64.72 6.34 4.64 7.81 31.29
Table 2 Parameters for the simple power model for the cpuburn, busy and gemm benchmarks (kernels) on wt amd and
wt int. Note that PUi = αi − PY and PCk,i(c) = PC1k,i · c = βk,i · c, with i, k and c denoting, respectively, the kernel type,
P-state and number of cores.
levels, floating-point units, branch logic, etc.). While
our power models refer to the power dissipated at a
given instant of time t, in most experiments next, we
will report instead the average power for the appli-
cation, as this allows us to compile the information
for the complete execution in a single figure. Fur-
thermore, this easily connects the results with the
total energy consumption.
Given a platform with all cores in state Pi, for
simplicity we will assume that PYi and P
U
i (i.e.,
the system and uncore powers in state Pi) remain
constant during the execution of the application. In
practice, starting from an idle (cold) platform, these
two factors grow with the system temperature till
their addition reaches a plateau [12]. To avoid this
effect, we assume that there is a continuous workload
to run in the platform and, in order to mimic this
situation, all our tests will be performed on a “hot”
system, with this state reached by initially warming
the cores with an execution of the same kernel or
application for a given period of time. Also, we will
assume that PUi is independent of the application
that runs in the platform. In practice, this is not the
case, but our results will show that the errors intro-
duced by this simplification are small, and can be
easily accommodated into the model.
To obtain practical values for the power model,
we proceed as follows. For simplicity, let us assume
that all c active cores of the platform run the same
type of task (kernel) k, in the same state Pi, during
all the execution time. In this scenario, we can easily
consider that the total power at instant t equals the
average power. For PYi , we thus simply set all the
cores of the platform to each P-state using cpufreq,
and then measure the power with the platform idle
for 30 seconds and average the results: between 71.86
and 84.83 W, depending on the state Pi, for wt amd;
and 33.43 W for all P-states in wt int; see column
PYi in Table 2.
The estimation of PUi and P
C
i is more elaborated,
as it is difficult to separate these two components,
and the second depends also on the application that
is being run. In order to achieve this, let us start
by refining (1), to capture the total power for the
execution of c copies of task k, with the active cores
in state Pi:
PTk,i(c) = P
Y
i +P
U
i +P
C
k,i(c) ≈ PYi +PUi +PC1k,i ·c, (2)
where PC1k,i denotes the power dissipated by a single
core in state Pi running task k.
To estimate the missing parameters in (2), PUi
and PC1k,i , we will leverage three compute-intensive
kernels: the cpuburn microbenchmark3, a simple busy-
wait test consisting of a “while (1) ;” loop, and
the general dense matrix-matrix product (gemm) rou-
tine implemented as part of Intel MKL operating
with double-precision data. Specifically, we executed
these tests for 60 seconds and averaged the power
draw, for an increasing number of cores c (all in
the same P-state) of the machines (8 cores of both
wt amd and wt int), while the remaining cores re-
main in an inactive C-state.
Applying linear regression to the data obtained
from this experimental evaluation, we obtained lin-
ear models for the total power of the form
PTk,i(c) = αk,i + βk,i · c, (3)
3 http://manpages.ubuntu.com/manpages/precise/man1/
cpuburn.1.html.
Title Suppressed Due to Excessive Length 7
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8
P
o
w
e
r
(
W
)
# active cores
Total power for cpuburn test on wt amd
cpuburn at 2.0 GHz
cpuburn at 1.5 GHz
cpuburn at 1.2 GHz
cpuburn at 1.0 GHz
cpuburn at 0.8 GHz
idle at 2.0 GHz
idle at 1.5 GHz
idle at 1.2 GHz
idle at 1.0 GHz
idle at 0.8 GHz
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8
P
o
w
e
r
(
W
)
# active cores
Total power for cpuburn test on wt int
cpuburn at 2.00 GHz
cpuburn at 1.87 GHz
cpuburn at 1.73 GHz
cpuburn at 1.60 GHz
idle at 2.00 GHz
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8
P
o
w
e
r
(
W
)
# active cores
Total power for busy test on wt amd
busy at 2.0 GHz
busy at 1.5 GHz
busy at 1.2 GHz
busy at 1.0 GHz
busy at 0.8 GHz
idle at 2.0 GHz
idle at 1.5 GHz
idle at 1.2 GHz
idle at 1.0 GHz
idle at 0.8 GHz
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8
P
o
w
e
r
(
W
)
# active cores
Total power for busy test on wt int
busy at 2.00 GHz
busy at 1.87 GHz
busy at 1.73 GHz
busy at 1.60 GHz
idle at 2.00 GHz
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8
P
o
w
e
r
(
W
)
# active cores
Total power for dgemm test on wt amd
dgemm at 2.0 GHz
dgemm at 1.5 GHz
dgemm at 1.2 GHz
dgemm at 1.0 GHz
dgemm at 0.8 GHz
idle at 2.0 GHz
idle at 1.5 GHz
idle at 1.2 GHz
idle at 1.0 GHz
idle at 0.8 GHz
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8
P
o
w
e
r
(
W
)
# active cores
Total power for dgemm test on wt int
dgemm at 2.00 GHz
dgemm at 1.87 GHz
dgemm at 1.73 GHz
dgemm at 1.60 GHz
idle at 2.00 GHz
Fig. 2 Power dissipated as a function of the number of active cores for kernels cpuburn, busy and gemm (top, middle
and bottom, respectively) on wt amd (left) and wt int (right).
with the values for αk,i and βk,i in the correspond-
ing columns of Table 2, and the relation between the
models and the experimental data graphically cap-
tured in Figure 2.
These regression models show quite a perfect fit
with the experimental data, offering the same (rough)
value αk,i for all three kernel types (the largest vari-
ation between the three was 2.11 % and the aver-
age difference 0.61 %.) Therefore, in the following we
use αi − PYi as an estimation for PUi (see Table 2),
the uncore power dissipated by a socket in state Pi
(which agrees with our assumption that the uncore
power is independent of the kernel type); and we set
PCk,i(c) = βk,i · c so that PC1k,i = βk,i.
8 Jose´ I. Aliaga et al.
Thread 1
Thread 3
Thread 4
Thread 5
Thread 8
Thread 7
Thread 6
0µs 54.867.754µs
Thread 2
Activity Polling
Watts
67
193
Thread 1
Thread 3
Thread 4
Thread 5
Thread 8
Thread 7
Thread 6
0 µs
0 µs
54.867.754µs
54.867.754µs
Thread 2
C0 C1 C6
Fig. 3 Traces of core activity, power and C-states (top, middle and bottom, respectively) during the computation of
the ILU preconditioner using the performance-oriented, power-oblivious runtime, with all cores of wt int in state P0.
Thread 1
Thread 3
Thread 4
Thread 5
Thread 8
Thread 7
Thread 6
54.342.740µs 64.234.465µs
Thread 2
Activity Polling
Watts
67
193
Thread 1
Thread 3
Thread 4
Thread 5
Thread 8
Thread 7
Thread 6
54.342.740µs
54.342.740µs
64.234.465µs
64.234.465µs
Thread 2
C0 C1 C6
Fig. 4 Traces of core activity, power and C-states (top, middle and bottom, respectively) during (part of) the iterative
solution stage using the performance-oriented, power-oblivious runtime, with all cores of wt int in state P0.
5 Leveraging the C-States in ILUPACK
We next investigate the exploitation of the C-states
made by two implementations of the runtime under-
lying ILUPACK, and relate their costs to the previ-
ous power model. In the experiments in this section,
we employ all the cores of the target platforms; and
we set the Linux governor to ondemand operating
all the active cores in the same state P0 during all
the execution (i.e., we do not allow voltage-frequency
changes).
In Section 2 we exposed that the task-parallel
calculation of the preconditioner in ILUPACK is or-
ganized as a directed task graph, with the struc-
Title Suppressed Due to Excessive Length 9
ture of a binary tree and bottom-up dependencies,
from the nodes (tasks) at each level to those in the
level immediately above it. The subsequent itera-
tive process basically requires the solution of (lower
and upper) triangular linear systems per iteration,
with tasks that are also organized as binary task-
trees, with bottom-up (lower triangular system) or
top-down (upper triangular system) dependencies.
In any case, when the tasks of these trees are dy-
namically mapped to a multicore platform by the
runtime, the execution should result in periods of
time during which certain cores are idle, depending
on the number of tasks of the tree, their computa-
tional complexity, the number of cores of the system,
etc. It is basically these idle periods that we could
expect that the operating system leverages, by pro-
moting the corresponding cores into a power-saving
C-state (sleep mode).
Figure 3 presents the execution trace, power con-
sumption, and C-states observed during the com-
putation of the ILUPACK preconditioner, using the
original (power-oblivious) runtime, with all cores of
wt int in state P0. Surprisingly, the results are quite
different from what we had expected: Idle periods
do not show a transition of the corresponding core
to a power-saving C-state and the associated reduc-
tion of the power rate. Figure 4 reports an analo-
gous behavior for the (preconditioned) iterative so-
lution stage on wt int (and similar results were
also obtained for both stages on wt amd). A closer
inspection of the runtime that leverages the task-
parallelism in ILUPACK reveals the reason for these
surprising results. Concretely, in the original imple-
mentation of ILUPACK runtime, upon encountering
no tasks ready to be executed, “idle” threads sim-
ply perform a “busy-wait” (polling) on a condition
variable, till a new task is available. This strategy
thus prevents the operating system from promoting
the associated cores into a power-saving C-state be-
cause the threads are not actually idle (but doing
useless work). This performance-oriented decision is
far from uncommon, being adopted in runtimes like
OmpSs (SMPSs) [28] or libflame+SuperMatrix [17]
as well. Furthermore, the same performance-oriented
but power-oblivious behavior appears, for example,
when a synchronous GPU kernel is invoked with the
default operation mode of CUDA [24] (the CPU re-
mains in an active polling, waiting for the GPU to
finish), or with the polling mode of certain MPI im-
plementations (e.g., MVAPICH [23]).
As an alternative to the previous power-hungry
strategy, we developed a power-aware version of the
runtime underlying ILUPACK, which applies an “idle-
wait” (blocking) whenever a thread does not en-
counter a task ready for execution and, thus, be-
comes inactive. (Note that setting the necessary con-
ditions for the operating system to promote the cores
into a power-saving C-state is as much as we can
do, since we cannot explicitly enforce the transi-
tion from the application code.) As in the original
version of the runtime, upon completing the exe-
cution of a task, a thread updates the correspond-
ing dependencies identifying those tasks, if any, that
have become ready for execution. However, in the
power-aware runtime, the thread also ensures that
the number of active (non-blocked threads) is, at
least, equal to the number of ready tasks, releasing
blocked threads if needed. The effect of idle-wait on
the power trace and use of the C-states of wt int
is illustrated in Figure 5, for the calculation of the
preconditioner, and Figure 6, for the iterative solu-
tion stage. Compared with the performance-oriented
(but power-hungry) implementation of the runtime
(see Figures 3 and 4), the new runtime effectively al-
lows inactive cores to enter a power-saving C-state,
thus yielding the sought-after power reduction.
The pending question, however, is whether the
adoption of the power-aware runtime comes with
a performance penalty which may blur the energy
benefits, as in most cases the key factor is energy
instead of power. Table 3 compares the execution
time, average power, and energy consumption of the
two runtimes, showing that fortunately this is not
the case for the computation of the preconditioner
and iterative solution, on any of the two target plat-
forms when operating in state P0. Consider for ex-
ample platform wt amd: For a minimal increase in
the total execution time, from 286.28 s to 287.91 s,
we observe reductions in the (average) power from
240.17 W to 227.16 W for the preconditioner; and
from 269.27 W to 230.80 W for the solver. The out-
come is a decrease of the total energy from 75,163.74 J
to 67,758.84 J (−10.88 %). The power reductions at-
tained by the power-aware implementation with re-
spect to the power-oblivious case are given in per-
centage in the columns labeled as “Experimental” of
Table 4 (averaged for 10 repeated executions). Com-
bined with the negligible impact of the runtime on
the execution time, these power figures also justify
similar energy savings.
10 Jose´ I. Aliaga et al.
Thread 1
Thread 3
Thread 4
Thread 5
Thread 8
Thread 7
Thread 6
0µs 56.342.740µs
Thread 2
Activity Blocking
30
189
Thread 1
Thread 3
Thread 4
Thread 5
Thread 8
Thread 7
Thread 6
0µs
0µs
56.342.740µs
56.342.740µs
Thread 2
Watts
C0 C1 C6
Fig. 5 Traces of core activity, power and C-states (top, middle and bottom, respectively) during the computation of
the ILU preconditioner using the power-aware runtime, with all cores of wt int in state P0.
Thread 1
Thread 3
Thread 4
Thread 5
Thread 8
Thread 7
Thread 6
65.876.075µs
Thread 2
56.342.740µs
Activity Blocking
30
189
Thread 1
Thread 3
Thread 4
Thread 5
Thread 8
Thread 7
Thread 6
56.342.740µs
56.342.740µs
65.876.075µs
65.876.075µs
Thread 2
Watts
C0 C1 C6
Fig. 6 Traces of core activity, power and C-states (top, middle and bottom, respectively) during (part of) the iterative
solution stage using the power-aware runtime, with all cores of wt int in state P0.
Let us now relate the power-energy reductions at-
tained by the reimplementation of the runtime that
leverages the CPU C-states to the power model of
the previous section. For this purpose, we need to i)
account the periods of “idle” time during the ex-
ecution of ILUPACK, with both the original and
energy-aware variants, as well as ii) assess the im-
pact of replacing a busy-wait (polling) for an idle-
wait (blocking). In order to tackle i), we follow a
pragmatic approach, and simply execute the actual
codes and measure the actual idle and computa-
tion times in our case, e.g., using the tracing frame-
work Extrae+Paraver, while collecting power sam-
ples with the internal wattmeter and pmlib library.
Title Suppressed Due to Excessive Length 11
Platform
Runtime Preconditioner Iterative solver Total
type Time Power Energy Time Power Energy Time Energy
wt amd
Oblivious 66.11 240.17 15,878.82 220.17 269.27 59,284.92 286.28 75,163.74
Aware 66.52 227.16 15,112.06 221.39 237.80 52,646.41 287.91 67,758.84
wt int
Oblivious 54.01 137.58 7,431.67 146.67 162.34 23,809.87 200.68 31,241.69
Aware 54.47 125.70 6,847.47 148.15 150.17 22,247.93 202.62 29,095.40
Table 3 Execution time, average power and energy of the power-oblivious and power-aware implementations of the
runtime (top and bottom, respectively) with all cores operating in state P0.
For ii), we use the data in Table 2 for PY0 , P
U
0 ;
and estimate PC1ilu,0 = βilu,0 using a procedure for
ILUPACK analogous to that exposed for the three
benchmark kernels combined with linear regression.
Finally, we assume that a core promoted to a sleep
state does not dissipate any core power.
Consider PTpilu,0(c) and P
T
bilu,0(c) denote, respec-
tively, the total power dissipated during the execu-
tion of ILUPACK, using the power-oblivious (polling
pilu) and power-aware runtimes (blocking bilu), with
c cores in state P0. Since now some cores may be in-
active during a certain part of the execution, we need
to adapt (2), which now becomes
PTpilu,0(c) = fpilu,0 · (PYi + PUi + PC1ilu,0 · c)
+ (1− fpilu,0)
· (PYi + PUi + PC1polling,0 · c).
(4)
The first term of the addition captures the cost of the
cores performing useful work during the computa-
tion of ILUPACK (like (2)), and appears multiplied
by fpilu,0, which corresponds to the ratio of the to-
tal time that this computation occupies. Thus, the
second part of the addition represents the remaining
fraction of the total time, (1− fpilu,0), and captures
the power dissipation of the cores performing polling.
In our evaluation, we set PC1polling,0 = P
C1
busy,0, as the
underlying procedures are similar.
On the other hand, for PTbilu,0(c), we have
PTbilu,0(c) = fbilu,0 · (PYi + PUi + PC1ilu,0 · c)
+ (1− fpilu,0) · (PYi + PUi ),
(5)
as we assumed that a core in blocking mode wastes
no power (i.e, PC1blocking,0 = 0).
Table 4 compares the values of the theoretical ra-
tios PTbilu,0(c)/P
T
pilu,0(c) with the experimental data
(averaged for 10 different executions), showing a very
close matching between the two, below 2 %, wt int
and slightly larger, about 4 % for wt amd. These
results confirm the benefits of the power-aware run-
time, but also the accuracy of the power model. For
all other frequencies, as we will see next, the model
always predicted the power-ratio with an error below
2 %.
6 Impact of the P-states on ILUPACK
In this section we evaluate the effect of the differ-
ent P-states available for each processor on the per-
formance-power-energy trade-off of ILUPACK. For
that purpose, we set the Linux governor mode to
userspace, and operate all the cores of the plat-
forms in the same P-state. In the following, we al-
ways employ the power-aware version of the runtime.
Therefore, we assume that, when idle, a core will re-
main in one of the deep power-saving C-states (C1
or higher), consuming a negligible amount of power.
The general consensus is that, for a memory-
bound computation, some benefit may result from
operating the system cores at low frequencies. The
reason is that, although there exists a linear depen-
dence between the core performance and the fre-
quency, the effect on the execution time of a memory-
bound algorithm should be minor because the key for
this type of computation is not core performance but
memory bandwidth. On the other hand, for current
multicore technology, a reduction of frequency is as-
sociated with a decrease of voltage (see Table 1) and,
because of the relation between static power to V 2
and dynamic power to V 2 ·f , in principle we can ex-
pect a significant reduction of the power draw. How-
ever, the balance between these two factors, time and
power, on the energy efficiency is delicate, and other
elements also play a role. Whether these variations
of time and power yield a loss or a gain for ILU-
12 Jose´ I. Aliaga et al.
Platform
Preconditioner Iterative solver
Theoretical Experimental Theoretical Experimental
wt amd 89.58 93.88 91.47 87.36
wt int 90.11 90.64 93.65 92.05
Table 4 Expected and observed (theoretical and experimental, respectively) power ratios, in %, between the power-
aware implementation of the runtime and the power-oblivious one, with all cores running in state P0.
Platform
P-state, Preconditioner Iterative solver
Pi Time Power Energy Time Power Energy
wt amd
P0 66.52 227.17 15,112.06 221.39 237.80 52,656.41
P1 81.56 197.77 16,131.31 252.54 207.98 52,525.50
P2 97.16 172.68 16,778.25 288.31 187.14 53,954.38
P3 113.25 160.16 18,138.44 326.11 176.10 57,426.43
P4 137.62 151.52 20,852.36 284.34 167.36 64,321.79
wt int
P0 54.47 125.70 6,847.47 148.15 150.17 22,247.94
P1 57.31 119.23 6,833.73 147.16 145.37 21,392.83
P2 60.65 114.16 6,924.03 151.35 140.75 21,302.70
P3 65.29 108.95 7,114.07 164.85 132.35 21,819.16
Table 5 Execution time, power and energy of the power-aware implementation of the runtime, with all cores in state
Pi.
Platform Pi/P0 ∆fi ∆BWi
Preconditioner Iterative solver
∆Time ∆Power ∆Energy ∆Time ∆Power ∆Energy
wt amd
P1/P0 −25.00 −18.68 22.60 −12.94 6.74 14.07 −12.54 −0.22
P2/P0 −40.00 −32.45 46.06 −23.99 11.02 30.23 −21.30 2.48
P3/P0 −50.00 −42.29 70.24 −29.50 20.02 47.30 −25.94 9.07
P4/P0 −60.00 −53.78 106.88 −33.30 37.98 73.60 −29.62 22.17
wt int
P1/P0 −6.50 −1.10 5.21 −5.15 −0.20 −0.67 −3.20 −3.84
P2/P0 −13.50 −0.86 11.35 −9.18 1.12 2.16 −6.27 −4.25
P3/P0 −20.00 −1.33 19.86 −13.33 3.89 11.27 −11.86 −1.93
Table 6 Variations of frequency, bandwidth, execution time, power and energy ratios (%), of the power-aware imple-
mentation of the runtime, between state Pi and state P0.
PACK from the point of view of energy efficiency is
thus the question to investigate in this section.
Table 5 reports the impact of the P-states on
the time, (average) power consumption and energy
efficiency of the two stages of ILUPACK, calculation
of the preconditioner and iterative solution, on both
platforms. To help with the analysis of these results,
Table 6 offers the variation ratios of bandwidth and
the results that are experienced when moving from
state P0 to state Pi, calculated as 100·(Mi−M0)/M0,
where M0 and Mi denote, respectively, the values of
the magnitudes (parameters or results) in states P0
and Pi.
The first aspect to notice is that the presumed
independence between execution time and core fre-
quency does not hold on wt amd. This should not
be a surprise as our experiment in Table 1 already
revealed that there is a strong connection between
the core frequency and the memory bandwidth on
this platform (see also column ∆BWi in Table 6).
The combined decreases of frequency and memory
bandwidth when moving from P0 to a higher P-state
(between −25 % and −60 % for the former and from
−18.68 to −53.78 % for the latter) explain the in-
creases of execution time for both the precondition-
ing stage (22.60–106.88 %) and the iterative solver
(12.54–73.60 %) in this platform. The behavior of
Title Suppressed Due to Excessive Length 13
wt int is quite different, which is partially explained
because now the reduction of frequency does not
bring a decrease of memory bandwidth. Still, for
the preconditioner, the reduction of frequency when
moving from P0 to a higher P-state (−6.50 % for
P1, −13.50 % for P2 and −20.00 % for P3), basi-
cally matches the increase of execution time for this
stage (5.21, 11.35 and 19.86 %, respectively). We can
take this as an indicator that the computation of
the preconditioner (or, at least, parts of it) is not
such a memory-bound computation as one could, in
principle, presume. The results are different for the
iterative solve. In this case, there is no significant
difference in the execution time when running the
stage in states P0 or P1, but the time increase when
moving from P0 to P2/P3 is 2.16/11.27 %, which is
still lower than what could be explained by the re-
duction of frequency alone.
From the performance point of view, the major
conclusion of this analysis is that the best solution
is to always run ILUPACK with all the cores op-
erating at the highest frequency (i.e., in state P0),
though in some cases —in particular, the iterative
solver executed in frequencies P0, P1 and P2—, the
differences are small on wt int.
Performance is crucial and, under some circum-
stances, energy efficiency is also vital. From that
point of view, a reduction of power is beneficial only
if it does not yield an increase of execution time that
blurs the positive effects on energy consumption. For
the particular case of ILUPACK, the results in Ta-
ble 5 show that, on wt amd, the most energy effi-
cient solution is to execute the preconditioner with
all cores in state P0 but the iterative solver in state
P1. For wt int, however, using states P1, P2 or P3
for the iterative solver results in small significant en-
ergy savings, from −1.93 to −4.25 %.
Let us connect again the power variations at-
tained with the different P-states and the models
for total power. For this purpose, we relate PTbilu,i(c)
and PTbilu,0(c), using
PTbilu,i(c) = fbilu,i · (PYi + PUi + PC1ilu,i · c)
+ (1− fbilu,i) · (PYi + PUi ),
(6)
and the experimental data. Table 7 reports the ac-
curacy of our model to capture the experimental be-
havior due to the variations of the P-state on ILU-
PACK, with an error at most 3.08 % for wt amd
and even smaller for wt int.
7 Concluding Remarks
We can list two main contributions for this work:
i) the implementation of an energy-aware runtime
for the complete preconditioner+iterative solve pro-
cess in ILUPACK; and ii) the elaboration and ex-
perimental characterization of simple models for the
(average) power that explain/justify the variations
observed for the new energy-aware runtime and the
effect of the different P-states for this particular ap-
plication and two different multicore architectures.
The introduction of the energy-aware runtime re-
sults from the experimental observation that, in an
energy-oblivious execution of the original runtime for
ILUPACK, idle threads with no useful task to exe-
cute simply poll till new work is available. As a re-
sult, these threads dissipate a significant amount of
power in current processors, for no practical perfor-
mance benefit for the particular case of ILUPACK.
Our energy-aware implementation replaces this be-
havior with a more power-friendly implementation,
that blocks idle threads till new work is available.
This requires a careful reorganization of the under-
lying runtime, to avoid deadlocks and ensure a rapid
response that does not impair performance. In our
experiments, we observed savings in the energy us-
age between 7 and 13 %, at practically no cost from
the performance point of view, which are clearly con-
nected to the impact of the C-states by our power
model.
In summary, the approach adopted in this paper
is based on the exploitation of idle periods during
the concurrent execution of a task-parallel version
of ILUPACK for multithreaded architectures. By re-
placing the busy-waits of the runtime in charge of
execution with idle-waits, we favor the introduction
of race-to-idle, which in turn allows the operating
system to promote the hardware into a more energy-
efficient C-state. We believe that the same technique
can be applied to other task-parallel scientific codes
and, as part of ongoing work, we are currently em-
bedding this approach into a general runtime like
OmpSs, modified to embrace idle-wait, so that we
can evaluate the performance-power-energy trade-
offs for other task-parallel applications.
Busy-wait and idle-wait are analogous to well-
known concepts of operating systems like spinlock
and mutex, respectively. The reason that idle-wait
is beneficial for ILUPACK is that the number of
changes between busy and idle periods is moderate
14 Jose´ I. Aliaga et al.
Platform Pi/P0
Preconditioner Iterative solver
Theoretical Experimental Theoretical Experimental
wt amd
P1/P0 88.05 85.48 84.17 87.64
P2/P0 78.96 76.38 76.56 79.65
P3/P0 73.50 70.78 71.60 74.85
P4/P0 69.73 66.65 68.77 70.80
wt int
P1/P0 95.62 95.54 96.47 96.47
P2/P0 90.84 90.80 91.25 91.84
P3/P0 87.21 86.94 85.96 87.77
Table 7 Expected and observed (theoretical and experimental, respectively) power ratios (%), of the power-aware
implementation of the runtime, between state Pi and state P0.
and the duration of these periods is “long enough”.
This depends on a number of factors including not
only the number and duration of the periods but,
e.g., also the costs in time and energy of blocking/re-
leasing the threads, the costs in time and energy of
the changes between different C-states, etc.
In theory, there exists a linear relation between
performance and frequency, which could be expected
to be even sublinear (at least on the Intel processor)
for a presumedly memory-bound computation like
ILUPACK, and a quadratic/cubic relation between
energy and voltage-frequency. However, the analysis
of the time-power-energy trade-off when the cores
operate in a certain P-states (voltage-frequency pair),
with the energy-aware version of the runtime, re-
veals the high impact of idle and, to a minor degree,
uncore power which clearly favor shorter execution
time over lower power dissipation rates. This is also
contrasted to and accurately captured by our power
model.
Acknowledgments
The researchers from the Universidad Jaume I were
supported by project CICYT TIN2011-23283 of the
Ministerio de Ciencia e Innovacio´n and FEDER and
the FPU program of the Ministerio de Educacio´n,
Cultura y Deporte. A. F. Mart´ın was partially funded
by the UPC postdoctoral grants under the programme
“BKC5-Atraccio´ i Fidelitzacio´ de talent al BKC”.
References
1. The Green500 list. Available at http://www.green500.
org.
2. ILUPack. Available at http://www.icm.tu-bs.de/
~bolle/ilupack/.
3. The Top500 list, 2010. Available at http://www.
top500.org.
4. Susanne Albers. Energy-efficient algorithms. Com-
mun. ACM, 53:86–96, May 2010.
5. Jose´ I. Aliaga, Matthias Bollho¨fer, Alberto F. Mart´ın,
and Enrique S. Quintana-Ort´ı. Exploiting thread-level
parallelism in the iterative solution of sparse linear
systems. Parallel Computing, 37(3):183–202, 2011.
6. Jose´ I. Aliaga, Matthias Bollho¨fer, Alberto F. Mart´ın,
and Enrique S. Quintana-Ort´ı. Parallelization of mul-
tilevel ILU preconditioners on distributed-memory
multiprocessors. In State of the Art in Scientific and
Parallel Computing – PARA 2010, number 7133 in Lec-
ture Notes in Computer Science, pages 162–172. 2012.
7. Jose´ I. Aliaga, Manuel F. Dolz, Alberto F. Mart´ın,
Rafael Mayo, and Enrique S. Quintana-Ort´ı. Lever-
aging task-parallelism in energy-efficient ilu precon-
ditioners. In 2nd International Conference on ICT as
Key Technology against Global Warming – ICT-GLOW,
number 7453 in Lecture Notes in Computer Science,
pages 55–63. 2012.
8. P. Alonso, R. M. Badia, J. Labarta, M. Barreda, M. F.
Dolz, R. Mayo, E. S. Quintana-Ort´ı, and Ruyma´n
Reyes. Tools for power-energy modelling and analysis
of parallel scientific applications. In 41st Int. Conf. on
Parallel Processing – ICPP, pages 420–429, 2012.
9. P. Alonso, M. F. Dolz, R. Mayo, and E. S. Quintana-
Ort´ı. Energy-efficient execution of dense linear algebra
algorithms on multi-core processors, Cluster Comput-
ing, 16(3): 497-509, 2013.
10. P. Alonso, M. F. Dolz, R. Mayo, and E. S. Quintana-
Ort´ı. Modeling power and energy of the task-parallel
Cholesky factorization on multicore processors. Com-
puter Science - Research and Development, 2012. Avail-
able online.
11. M. F. Dolz, J. C. Ferna´ndez, R. Mayo, and E. S.
Quintana-Ort´ı. EnergySaving Cluster Roll: Power
Saving System for Clusters. Architecture of Computing
Systems - ARCS 2010, number 5974 in Lecture Notes
in Computer Science, pages 162–173, 2010.
12. AnandTech Forums. Power-consumption scaling with
clockspeed and Vcc for the i7-2600K. http://forums.
anandtech.com/showthread.php?t=2195927, 2011.
13. S. Barrachina, M. Barreda, S. Catala´n, M. F. Dolz,
G. Fabregat, R. Mayo, and E. S. Quintana-Ort´ı. An
integrated framework for power-performance analysis
Title Suppressed Due to Excessive Length 15
of parallel scientific workloads. In 3rd Int. Conf. on
Smart Grids, Green Communications and IT Energy-
aware Technologies, 2013.
14. M. Duranton and et al. The HiPEAC vision for ad-
vanced computing in horizon 2020, 2013.
15. M. El Mehdi Diouri, M. F. Dolz, O. Glu¨ck, L. Lefe`vre,
P. Alonso, S. Catala´n, R. Mayo, and E. S. Quintana-
Ort´ı. Solving some mysteries in power monitoring of
servers: take care of your wattmeters! In Proc. Energy
Efficiency in Large Scale Distributed Systems conference
– EE-LSDS 2013, 2013. To appear.
16. Wu-chun Feng, Xizhou Feng, and Rong Ge. Green su-
percomputing comes of age. IT Professional, 10(1):17
–23, jan.-feb. 2008.
17. FLAME project home page. http://www.cs.utexas.
edu/users/flame/.
18. S. Gunther, A. Deval, and T. Burton. Energy-efficient
computing: Power-management system on the Intel
Nehalem family of processors. Intel Technology Jour-
nal, 15(1), 211.
19. D. L. Hill, T. Huff, S. Kulick, and R. Safranek. The
Uncore: A modular approach to feeding the high-
performance cores. Intel Technology Journal, 14(3),
2010.
20. HP Corp., Intel Corp., Microsoft Corp., Phoenix Tech.
Ltd., and Toshiba Corp. Advanced configuration and
power interface specification, revision 5.0, 2011.
21. J. Dongarra et al. The international ExaScale soft-
ware project roadmap. Int. J. of High Performance
Computing & Applications, 25(1):3–60.
22. J. Kolodziej, S. U. Khan, L. Wang, A. Byrski, N. Min-
Allah, S. A. Madani, Hierarchical genetic-based grid
scheduling with energy optimization, Cluster Comput-
ing, 16(3): 591-609, 2013.
23. MVAPICH: MPI over InfiniBand, 10GigE/iWARP
and RoCE. http://mvapich.cse.ohio-state.edu/.
24. NVIDIA Corporation. CUDA toolkit 4.0 readiness for
CUDA applications, 4.0 edition, March 2011.
25. Paraver: the flexible analysis tool. http://www.cepba.
upc.es/paraver.
26. E. Pinheiro, R. Bianchini, E. V. Carrera, T. Heath.
Dynamic cluster reconguration for power and perfor-
mance. In: Proc. of Workshop on Compilers and Op-
erating Systems for Low Power, pp. 7593, 2003.
27. R. Scho¨ne, D. Hackenberg, and D. Molka. Memory
performance at reduced CPU clock speeds: an analy-
sis of current x86 64 processors. In 2012 USENIX con-
ference on Power-Aware Computing and Systems, 2012.
28. SMP superscalar project home page. http://www.bsc.
es/plantillaG.php?cat_id=385.
29. The STREAM benchmark: Computer memory band-
width. http://www.streambench.org/.
30. G. L. Valentini, W. Lassonde, S. U. Khan, U. Samee,
N. Min-Allah, S.. Madani, J. Li, L. Zhang, L. Wang,
N. Ghani, J. Kolodziej, H. Li, A. Y. Zomaya, C. Xu,
P. Balaji and A. Vishnu, F. Pinel, J. E. Pecero, D. Kli-
azovich, and P. Bouvry, An overview of energy effi-
ciency techniques in cluster computing systems. Clus-
ter Computing, 16(1):3–15, 2013.
31. L. Wang, S. U. Khan, D. Chen, J. Kolodziej, R. Ran-
jan, C. Xu and A. Y. Zomaya. Energy-aware parallel
task scheduling in a cluster. Future Generation Comp.
Syst., 29(7):1661–1670, 2013.
