Leveraging Task-Parallelism in Energy-Efficient ILU preconditioners by Aliaga, José Ignacio et al.
Leveraging Task-Parallelism
in Energy-Eﬃcient ILU Preconditioners
Jose´ I. Aliaga1, Manuel F. Dolz1, Alberto F. Mart´ın2,
Rafael Mayo1, and Enrique S. Quintana-Ort´ı1
1 Dpto. de Ingenier´ıa y Ciencia de Computadores, Universitat Jaume I,
12.071–Castello´n, Spain
{aliaga,dolzm,mayo,quintana}@icc.uji.es
2 Centre Internacional de Me`todes Nume`rics en Enginyeria (CIMNE),
08860–Castelldefels, Spain
amartin@cimne.upc.edu
Abstract. We analyze the energy-performance balance of a task-parallel
computation of an ILU-based preconditioner for the solution of sparse
linear systems on multi-core processors. In particular, we elaborate a
theoretical model for the power dissipation, and employ it to explore
the eﬀect of the processor power states on the time-power-energy in-
teraction for this calculation. Armed with the insights gained from this
study, we then introduce two energy-saving mechanisms which, incorpo-
rated into the runtime in charge of the parallel execution of the algo-
rithm, improve energy eﬃciency by 6.9%, with a negligible impact on
performance.
1 Introduction
The solution of sparse systems of linear equations is an ubiquitous problem in
scientiﬁc and engineering applications which has been tackled in many projects
during the past decades [10]. One ongoing eﬀort has resulted in ILUPACK (In-
complete LU decomposition PACKage), a software package that combines ILU
factorizations with iterative Krylov subspace methods. Compared with sparse
direct solvers, this class of methods have proven quite competitive for a wide
range of applications (specially those arising from 3D PDEs) because of their
moderate computational and memory requirements [10].
Due to the scale of the linear systems appearing in many applications, and the
computational cost of the numericalmethods,most solvers target parallel architec-
tures. Following a trend adopted for dense linear algebra operations, we have re-
cently demonstrated the performance beneﬁts of exploiting task-parallelismwithin
ILUPACK for the solution of sparse linear systems on multi-core processors [2].
Unfortunately, all existing libraries in the domain of linear algebra are mostly
energy-oblivious, in spite of the growingpressure for energy-eﬃcient systems [4,8,9]
and the signiﬁcant assets that energy-aware software can yield [1].
In this paperwe address the energy-eﬃcient computation of ILUpreconditioners
for the solution of large-scale sparse linear systems on multi-core processors. For
A. Auweter et al. (Eds.): ICT-GLOW 2012, LNCS 7453, pp. 55–63, 2012.
c© Springer-Verlag Berlin Heidelberg 2012
56 J.I. Aliaga et al.
this particular purpose, we leverage our task-parallel calculation of an ILUPACK-
based preconditioner [2] to reduce the power dissipated by inactive cores via the
processor power states (C-states) [7]. In order to do so, we analyze the impact of
carefully shifting unused cores to a certain performance state (P-state), combined
with the elimination of busy-waits for idle threads. Our experiments on an AMD-
based platform demonstrate that the performance overhead introduced by these
techniques is negligible, while the energy savings are fair.
The rest of the paper is structured as follows. After a brief introduction to the
environment setup in the next section, in Section 3 we review how to decompose a
sparse linear system into a collection of tasks which can be dynamically issued for
parallel execution to the cores of a multiprocessor. Next, in Section 4, we introduce
a simple energymodel, and oﬀer some insights on the trade-oﬀ between power and
performance for this class of algorithms on the target multi-core architecture. In
Section 5 we describe how to adapt the task-parallel preconditioner computation
to enhance energy eﬃciency, and report the beneﬁts of this approach. Finally, we
oﬀer some concluding remarks and a list of future work in Section 6.
2 Environment Setup
All our experiments were performed using ieee double-precision arithmetic on
wt amd, a platform equipped with 2 AMD Opteron 6128 processors (total of 16
cores) and 48 GB of RAM. An internal DC powermeter, connected to the 12 V
lines between the power supply unit and the mainboard, samples the nodal power
dissipated by the system mainboard with a frequency of 25 Hz. Therefore, in the
following experiments we will focus only on the power dissipated by the elements
contained in the mainboard and neglect power sinks due to other components
such as disk, graphics card, network card, etc.
The P-states available in wt amd and the associated voltage/frequency pairs
(columns labelled as V CC/f) are listed in the ﬁrst three columns of Table 1. A core
ofwt amd can be promoted into one of these P-states via, e.g., the cpufrequtility.
In our experiments we employ a standard benchmark problem for the solution
of PDEs: the Laplacian equation −Δu = f in a 3D unit cube Ω = [0, 1]3 with
Dirichlet boundary conditions u = g on ∂Ω, discretized using a uniform mesh
of size h = 1N+1 . The resulting linear system Au = b presents an n × n sparse
symmetric positive deﬁnite coeﬃcient matrix with seven nonzero elements per
row, and n = N3 unknowns. We set N=252, which results in a linear system
with roughly 16 millions of unknowns and 111 millions of nonzero entries in A.
3 Task-Parallel Computation of ILU Preconditioners
The approach to multilevel preconditioning in ILUPACK relies on the so-called
inverse-based ILU factorizations. Unlike other classical threshold-based ILUs, this
approach directly bounds the size of the preconditioned error and results in in-
creased robustness and scalability, specially for applications governed by PDEs,
due to its close connection with algebraic multilevel methods [2]. The kernels in
Leveraging Task-Parallelism in Energy-Eﬃcient ILU Preconditioners 57
Table 1. P-states and associated performance parameters: voltage/frequency pairs
(V CCi in Volts/fi in GHz); model of total power using c cores P
T
i (c) = αi+βi·c, with αi
in Watts and βi in Watts/core; variations of static, dynamic and total dissipated power,
ΔPSi , ΔP
D
i and ΔP
T
i (16) (all in %), respectively; and processor-RAM bandwidth
(BW, in GB/sec.) and its variation (in %).
P-state Pi V CCi fi αi βi ΔP
S
i ΔP
D
i ΔP
T
i (16) BWi ΔBWi
P0 1.23 2.00 168.59 9.12 – – – 30.29 –
P1 1.17 1.50 161.10 5.77 -9.52 -32.14 -17.58 24.63 -18.67
P2 1.12 1.20 155.90 4.23 -17.09 -50.25 -28.34 20.46 -32.44
P3 1.09 1.00 152.94 3.15 -21.47 -60.73 -33.26 17.48 -42.30
P4 1.06 0.80 150.61 2.44 -25.73 -70.30 -39.85 14.00 -53.77
charge of the computation of inverse-based ILUs are typically memory-bounded.
Speciﬁcally, for eﬃcient preconditioning, only a small amount of ﬁll-in is allowed
during the factorization, resulting in a modest number of ﬂoating-point arithmetic
operations per non-zero entry of the sparse coeﬃcient matrix.
Parallelism in the computation of ILUPACK preconditioners is exposed by
means of nested dissection applied to the adjacency graph representing the non-
zero connectivity of the sparse coeﬃcient matrix. Nested dissection is a parti-
tioning heuristic which relies on the recursive separation of graphs. The graph
is ﬁrst split by a vertex separator into a pair of independent subgraphs and the
same process is next recursively applied to each independent subgraph. The re-
sulting hierarchy of independent subgraphs is highly amenable to parallelization.
In particular, the inverse-based preconditioning approach is applied in parallel to
the blocks corresponding to the independent subgraphs while those correspond-
ing to the separators are updated. When the bulk of the former blocks has been
eliminated, the updates computed in parallel within each independent subgraph
are merged together, and the algorithm enters the next level in the nested dis-
section hierarchy. The same process is recursively applied to the separators in
the next level and the algorithm proceeds bottom-up in the hierarchy until the
root ﬁnally completes the parallel computation of the preconditioner.
The type of parallelism described above can be expressed by a binary task
dependency tree, where nodes represent concurrent tasks and arcs specify depen-
dencies among them. The parallel execution of this tree on multi-core processors
is orchestrated by a runtime which dynamically maps tasks to threads (cores)
in order to improve load balance requirements during the computation of the
ILU preconditioner. At execution time, thread migration between cores is pre-
vented using POSIX routine sched set affinity. This runtime keeps a shared
queue of ready tasks (i.e., tasks with their dependencies fulﬁlled) which are ex-
ecuted by the threads in FIFO order. This queue is initialized with the tasks
corresponding to the independent subgraphs. Idle threads spin in a busy-wait
polling for new ready tasks. When a given thread completes the execution of
a task, its parent task is enqueued provided the sibling of the former task has
been already completed. Further details on the mathematical foundations of the
parallel algorithms and the runtime operation can be found in [2].
58 J.I. Aliaga et al.
4 Time-Power-Energy and the P-States
Many past studies have analyzed the eﬀect of DVFS on the performance-power
trade-oﬀ; see, e.g., [5]. In order to perform a similar study for the speciﬁc domain
of sparse linear algebra operations on current multi-core processors, we employ
the following simple power model, borrowed from [3]:
PT = PC + PY = P S + PD + PY, (1)
where PT(otal) is the total power consumption, decomposed into the power dis-
sipated by the CPU, PC(PU), and that of the remaining components not part
of the CPU logic (system power corresponding, e.g, to RAM), P (S)Y(stem). We
further decompose the CPU power into its static1 (leakage) and dynamic parts,
P S(tatic) and PD(ynamic) respectively.
We start by obtaining rough estimates of the parameters of the model in (1)
for a system with c active cores in state Pi, P
T
i (c) = P
S
i + P
D
i (c) + P
Y. In
the top plot of Figure 1 we report the power consumption when activity (in
the form of cores performing a “while(1);” loop) is added to the system. (To
ensure stabilized values, each test was run during 700 secs before the power was
measured with our internal DC powermeter.) The power dissipated when the
platform is idle, also reported in the ﬁgure, is 80.15 Watts and can be taken
as an estimate for PY. (When idle, the power dissipated by the platform at
other frequencies did not vary signiﬁcantly). On the other hand, applying linear
regression to adjust, e.g., the total power in state P0 as a function of the number
of active cores yields the linear model PT0 (c) = α0 + β0 · c = 168.59 + 9.12 · c
Watts. (The values for αi and βi for all the P-states can be consulted in Table 1.)
Thus, α0 = 168.59 Watts accounts for the power needed to maintain the diﬀerent
components in the mainboard in a power-active mode, and we can approximate
P S0 ≈ α0 − PY = 88.44 Watts and PD0 (c) ≈ β0c = 9.12 · c Watts.
Consider now the relation between power and the processor voltage/frequency.
In particular, P Si depends on V CC
2
i while P
D
i (c) is a function of the product
V CC2i · fi · c [3]. Therefore, moving all c cores of the system from state P0 to
a diﬀerent P-state Pi, we can expect a reduction of P
S
i and P
D
i as reported,
respectively, in the columns labelled as ΔP Si (= ΔV CC
2
i ) and ΔP
D
i (= Δ(V CC
2
i ·
fi)) of Table 1. (We deﬁne the application of the variation operator Δ to a
magnitude xi as Δxi = (xi − x0)/x0, where xi and x0 denote the values of the
magnitude obtained in a platform in state Pi and P0, respectively.) For example,
according to this model, promoting all cores of wt amd from state P0 to state
P1 should result in a total power consumption
PT1 (16) = P
S
0 (1− 0.0952) + PD0 (16)(1− 0.3214) + PY = 259.19 Watts;
i.e., a reduction of 17.58% with respect to PT0 (16), with 9.52% due to the re-
duction of the static power and 32.14% for the reduction of dynamic power.
The savings in total power when P0 is abandoned for a less expensive P-state
1 Static power is intimately linked with uncore power [6].
Leveraging Task-Parallelism in Energy-Eﬃcient ILU Preconditioners 59
 50
 100
 150
 200
 250
 300
 350
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
P
o
w
er
 (
w
at
ts
)
# active cores
Power dissipated as function of number of active cores
Idle-wait at 2.00 GHz
Busy-wait at 2.00 GHz
Busy-wait at 1.50 GHz
Busy-wait at 1.20 GHz
Busy-wait at 1.00 GHz
Busy-wait at 0.80 GHz
-100
-50
 0
 50
 100
P0 P1 P2 P3 P4
%
 v
ar
ia
ti
o
n
P-states
Impact of P-states on performance of the ILU preconditioner
ΔTi
ΔPTi
ΔEi
 ΔBWi
Fig. 1. Power dissipated as a function of number of active cores (top); and impact
of the P-states on diﬀerent performance parameters of the computation of the ILU
preconditioner: time, avg. power, energy and bandwidth (bottom).
60 J.I. Aliaga et al.
Table 2. P-states and performance parameters for the computation of the ILU precon-
ditioner: execution time Ti (in sec.); average power P¯
T
i (in Watts); energy consumption
Ei (in Joules); and the corresponding variations (in %).
P-state Pi Ti P¯
T
i Ei ΔTi ΔP¯
T
i ΔEi
P0 34.06 282.87 9,634.78 – – –
P1 43.57 235.64 10,267.72 21.88 -16.69 6.53
P2 54.48 210.86 11.478.79 59.91 -25.45 19.20
P3 61.58 197.01 12.132.79 80.73 -30.35 25.87
P4 76.50 186.86 14,295.18 124.47 -33.94 48.28
are given in column ΔPTi (16) of Table 1. These values agree to an error below
2.5% with the reductions observed from the linear regression models that ﬁt the
results in the top plot of Figure 1 (see the corresponding values for αi and βi
in Table 1). We take this as a sign that our model for system/static/dynamic
components provides a sound approximation.
Nevertheless, except for systems with thermal and/or power constraints, the
crucial ﬁgure is energy, not power. Whether the previous reductions in power
lead to energy savings will depend on how the transition to a more power-saving
state aﬀects the execution time. Table 2 reports the execution time, average
power, energy and the corresponding variations of our task-parallel construction
of an ILU preconditioner with the processor cores in diﬀerent P-states, set with
the cpufreq userspace Linux governor. The bottom plot in Figure 1 depicts
the variations graphically.
In principle, for a moderately memory-bounded operation2 such as the com-
putation of the preconditioner, one could expect that the reduction of the proces-
sors frequency fi resulted in a minor impact on the execution time. Surprisingly,
this is not the case. This can be explained by the combined decreases of the
computational power (ﬂoating-point arithmetic rates vary linearly with fi) and
the memory bandwidth (see columns BWi and ΔBWi of Table 1, which report
values obtained using the stream benchmark) that occur on this particular plat-
form when the frequency is diminished. As a result, the reduction in the average
power does not counterbalance the increase of the execution time, which globally
renders a higher energy consumption when moving away from state P0 to more
power-friendly states. In this line, note that PY does not depend on fi while P
S
i
only depends linearly on V CCi/fi which, combined with the small improvements
to P Si and P
D
i due to the reduction of V CCi/fi, are insuﬃcient to compensate
for the loss of performance.
5 Saving Energy of Idle Threads for ILUPACK
The analysis in the previous section illustrated that the time-power-energy bal-
ance is rather delicate. In response to this, in this section we propose a strategy
2 In a separate experiment, we observed that the computation of the preconditioner
exhibits burst with performance peaks around 300 · 106 ﬂops/sec. while, in this
platform, the matrix-matrix product attains a sustained rate of 1.1 · 109 ﬂops/sec.
Leveraging Task-Parallelism in Energy-Eﬃcient ILU Preconditioners 61
that aims at reallocating a core into a power-friendly state only when the as-
sociated thread is idle. In order to do so, we leverage the existence of inactive
periods during the computation of the preconditioner. In particular, given the
binary tree structure of the task dependencies arising in this operation, we can
expect that the degree of concurrency decreases as the computation proceeds,
yielding the sought-after energy-saving opportunities.
The top plot in Figure 2 illustrates the core activity during the computation
of the ILU preconditioner, indeed reporting the existence of inactive periods.
In our “conservative” energy-saving strategy, when a thread ﬁnds no task to
execute, it promotes the associated core into state P4. On acquiring a task, the
thread changes the state of the core back to P0, so that work is carried out at
the highest throughput rate.
Nevertheless, the combined use of P-states/idle threads alone has a minor
impact on the performance (execution time) of the preconditioner computation
as, in our initial task-parallel implementation of this operation, an “idle” thread
spins in a busy-wait, polling till a new task is ready for execution, and thus
wasting power. Therefore, we also implemented a more aggressive saving policy,
one where upon becoming jobless, an idle thread promotes the corresponding
core to P4 and then explicitly blocks using POSIX semaphores. In this approach,
when a thread adds t new tasks to the ready queue (because their dependencies
are satisﬁed), it also releases up to min(t, tb) blocked threads, with tb denoting
the number of blocked threads at that instant. Upon becoming active, a thread
immediately raises back the state of the corresponding core to P0. Note that the
use of explicitly blocking can potentially introduce a non-negligible overhead as
the time needed to move back a suspended core is considerably longer.
The two plots in Figure 2 illustrate the power dissipated during the compu-
tation of the ILU preconditioner when all threads operate in state P0 (top), as
well as when idle threads are blocked and the corresponding cores are promoted
to state P4 (bottom). The experiments report energy savings of 6.92% for the
strategy that leverages inactive periods with respect to the execution with the
original runtime. If we only consider the CPU energy consumption corresponding
to the application (i.e., we subtract the constant factor PY), the savings raise up
to 9.92%. Since an improvement factor of 6.92% may seem small, let us relate it
with the potential savings dictated by the length of the inactive periods and the
distribution of the energy consumption among its system, static and dynamic
parts. In particular, we determined that the length of the inactive periods in the
execution with the original runtime accounts for 23.70% of the time. Consider
next the total energy consumption, ET = ES +ED+EY = (P S+PD+PY) ·T ,
where T is the execution time. From the experimental data, we have that
ED = ET − (P S + PY) · T = 9287.46− (88.44+ 80.15) · 33.42 = 3652.42 Joules.
Therefore the dynamic component of the energy represents 39.32% of the total
energy and, by blocking idle threads we can expect, at most, a reduction of the
total energy by 39.32 · 0.2370 = 9.32%. Therefore, the savings attained by the
blocking mechanism, 6.92%, are close to this theoretical upper bound.
62 J.I. Aliaga et al.
Task exec. Busy-wait Blocking
Fig. 2. Trace of core activity and power during the computation of the ILU precondi-
tioner using the original runtime (top) and the energy-enhanced version (bottom)
The impact in the execution time, on the other hand, is insigniﬁcant: vari-
ations are below ±1%, likely due to the particular mapping of tasks to cores,
demonstrating the negligible overhead of the energy-saving mechanism.
6 Conclusions and Future Work
A general conclusion from this study is that, for a mildly memory-bounded
operation, the reduction of power attained by lowering the voltage/frequency
does not necessarily result in energy savings due to the increase of execution
time. The computation of an ILU preconditioner for the application considered
in this paper is one such example where a reduction of voltage/frequency renders
an increase in energy consumption. This is partly due to the large fraction of
power dissipation that corresponds to the system and static components, which
do not beneﬁt or do little beneﬁt from a reduction of the frequency.
Therefore, any eﬀort at reducing the energy consumption of these compu-
tations must carefully leverage the processor performance (or P-) states so as
to avoid increasing the execution time. Fortunately, in the case of our parallel
Leveraging Task-Parallelism in Energy-Eﬃcient ILU Preconditioners 63
code for the computation of the ILU preconditioner, the operation is already
divided into well-deﬁned tasks, which allows us to avoid busy-waits and exploit
the presence of inactive periods by promoting cores running vacant threads into
a power-friendly state.
As future work, we recognize it is important to conﬁrm the results obtained
for the Laplacian benchmark on the AMD-platform, using other applications
leading to large-scale sparse linear systems as well as Intel-based platforms.
(While we have partially conducted experiments towards this goal, we could
not include them due to lack of space.) Furthermore, we also plan to integrate
the energy-saving mechanisms into the iterative CG method, thus yielding a
complete energy-aware iterative solver (calculation of preconditioner+iterative
solver) for large-scale sparse systems. Finally, despite its simplicity, we found the
energy model rather useful and, therefore, we plan to enhance it by incorporat-
ing, e.g., the eﬀect of temperature.
Acknowledgments. This research was supported by the CICYT project
TIN2011-23283 and FEDER.
References
1. Albers, S.: Energy-eﬃcient algorithms. Commun. ACM 53, 86–96 (2010)
2. Aliaga, J.I., Bollho¨fer, M., Mart´ın, A.F., Quintana-Ort´ı, E.S.: Exploiting thread-
level parallelism in the iterative solution of sparse linear systems. Parallel Com-
puting 37(3), 183–202 (2011)
3. AnandTech Forums. Power-consumption scaling with clockspeed and Vcc for the
i7-2600K (2011), http://forums.anandtech.com/showthread.php?t=2195927
4. Feng, W.-C., Feng, X., Ge, R.: Green supercomputing comes of age. IT Profes-
sional 10(1), 17–23 (2008)
5. Freeh, V.W., Lowenthal, D.K., Pan, F., Kappiah, N., Springer, R., Rountree, B.L.,
Femal, M.E.: Analyzing the energy-time trade-oﬀ in high-performance computing
applications. IEEE Trans. Parallel Distrib. Syst. 18, 835–848 (2007)
6. Gupta, V., Brett, P., Koufaty, D., Reddy, D., Hahn, S., Schwan, K., Srinivasa, G.:
The forgotten ’uncore’: On the energy-eﬃciency of heterogeneous cores. In: Proc.
2012 USENIX Annual Technical Conference (to appear, 2012)
7. HP Corp., Intel Corp., Microsoft Corp., Phoenix Tech. Ltd., and Toshiba Corp.
Advanced conﬁguration and power interface speciﬁcation, revision 5.0 (2011)
8. Dongarra, J., et al.: The international ExaScale software project roadmap. Int. J.
of High Performance Computing & Applications 25(1), 3–60
9. Duranton, M., et al.: The HiPEAC vision (2010),
http://www.hipeac.net/roadmap
10. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM Publications (2003)
