Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors by Cavelan, Aurélien et al.
Voltage Overscaling Algorithms for Energy-Efficient
Workflow Computations With Timing Errors
Aure´lien Cavelan, Yves Robert, Hongyang Sun, Fre´de´ric Vivien
To cite this version:
Aure´lien Cavelan, Yves Robert, Hongyang Sun, Fre´de´ric Vivien. Voltage Overscaling Algo-
rithms for Energy-Efficient Workflow Computations With Timing Errors. FTXS ’15: 5th
Workshop on Fault Tolerance for HPC at eXtreme Scale, Jun 2015, Portland, United States.
ACM, FTXS ’15 Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme
Scale, pp.8, 2015, <10.1145/2751504.2751508>. <hal-01199250>
HAL Id: hal-01199250
https://hal.inria.fr/hal-01199250
Submitted on 25 Sep 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Voltage Overscaling Algorithms for Energy-Efficient
Workflow Computations With Timing Errors
Aurélien Cavelan2,1, Yves Robert1,2,3, Hongyang Sun1,2 and Frédéric Vivien2,1
1. ENS Lyon, France
2. INRIA, France
3. University of Tennessee Knoxville, USA
aurelien.cavelan|yves.robert|hongyang.sun|frederic.vivien@inria.fr
ABSTRACT
We propose a software-based approach using dynamic volt-
age overscaling to reduce the energy consumption of HPC
applications. This technique aggressively lowers the supply
voltage below nominal voltage, which introduces timing er-
rors, and we use Algorithm-Based Fault-Tolerance (ABFT)
to provide fault tolerance for matrix operations. We intro-
duce a formal model, and we design optimal polynomial-time
solutions, to execute a linear chain of tasks. Evaluation re-
sults obtained for matrix multiplication demonstrate that our
approach indeed leads to significant energy savings, compared
to the standard algorithm that always operates at nominal
voltage.
Categories and Subject Descriptors
C.4 [Performance of Systems]: Fault tolerance
Keywords
Timing errors; ABFT; voltage overscaling; energy efficiency
1. INTRODUCTION
Reducing energy consumption has become a key challenge,
for both economic and environmental reasons. In the scope
of High-Performance Computing (HPC), green computing en-
compasses the design of energy-efficient algorithms, circuits,
and systems. The dynamic power consumption of micropro-
cessors is typically of the form αfV 2, where α denotes the
effective capacitance, f the frequency and V the operating
voltage [1, 9]. One approach to reduce the energy consump-
tion is thus to lower the frequency and/or the voltage at which
cores operate. This approach is called Dynamic Voltage and
Frequency Scaling (DVFS). Lowering the supply voltage may
seem the best option because it has a quadratic impact on
the dynamic power, while frequency has only a linear impact.
Voltage and frequency, however, cannot be set independently
and at any value. Indeed, the lower the voltage, the higher
the circuit latency, that is, the longer the delay for logic gates
to produce their outputs. Therefore, for any frequency value,
there is a minimal threshold or nominal voltage Vth, at which
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
FTXS’15, June 15 2015, Portland, OR, USA
the core can safely be used. In practice, given a choice for
the frequency, one can always set the voltage at this thresh-
old value to save energy, because using a higher voltage would
lead to paying more energy without any benefit.
For a given frequency, if the core is used at a voltage be-
low the nominal voltage Vth, timing errors could occur, that
is, the results of some logic gates could be used before their
output signals reach their final values. The output of the cir-
cuitry could then be incorrect and the result of the overall
computation, at the core level, could be incorrect. Here, we
intentionally use the conditional “could” repeatedly. First,
the whole circuitry can still produce a correct result even
if some logic gates suffer from timing errors. Second, not
all computation paths in a core have the same latency; the
threshold voltage is computed for the worst case and not
all computations correspond to the worst case. Third, there
are process variations in the production of cores and, once
again, the threshold voltage is defined so that the worst core
works safely at that setting. For all these reasons, there is a
significant probability that a computation performed below
nominal voltage completes successfully, at least if the volt-
age is not “too low”. Moreover, circuit manufacturers keep a
safety margin. Cores are specified to be run under a supply
voltage Vdd, which is considerably larger than the threshold
(Vdd > Vth). How to take advantage of these potential mar-
gins to reduce voltage and, thus, energy consumption?
A first approach is called near-threshold computing (NTC)
[3, 10], where the supply voltage is chosen very close to (but
larger than) the threshold voltage (Vdd ≈ Vth). Work in that
scope mainly concerns the design of NTC circuits that oper-
ate safely and (almost) as quickly as non-NTC circuits, while
providing great energy savings. A more aggressive approach
is to use cores with a supply voltage below the threshold
voltage (Vdd < Vth), which is called voltage overscaling [6,
7]. Most existing work targeting voltage overscaling is hard-
ware oriented and requires special hardware mechanisms to
detect timing errors [6, 8, 7, 4]. Our work is among the very
few existing purely software-based approaches that do not
require any special hardware [11]. Because cores are oper-
ated below threshold voltage, they may be victims of timing
errors, which may induce silent data corruptions (SDC): the
output of the Arithmetic and Logic Unit (ALU) may be in-
correct. Therefore, using cores in such a context requires to
have mechanisms to detect these errors and to correct them.
Furthermore, the energy cost of these detection and correc-
tion mechanisms should not offset the energy saving due to
the low operational voltage.
One key characteristic of voltage overscaling makes it fun-
damentally different from most work on resilience for High-
Performance Computing (HPC) applications. Indeed, a ubiq-
uitous assumption in HPC is that failures are random. In
other words, like lightning, failures do not strike twice “at
the same place”. A consequence of this lightning assump-
tion is used in most, if not all, fault-tolerant solutions for
HPC. Assume that your computation has been the victim of
a fault. Then, whatever your preferred fault-tolerant solu-
tion, you are going to re-execute your application one way
or the other, in the same computational context. (This re-
execution may be a temporal later-on re-execution, through
a checkpoint-rollback mechanism, a simultaneous spatial re-
execution through replication, etc.) Because of the light-
ning assumption, we know that with high probability the
re-execution will not be the victim of the same failure. If we
are unlucky, the re-execution will fail again, but because of
another failure: the lightning will strike somewhere else.
On the contrary, faults in our context are timing errors: a
signal is used before the processor circuitry has finished com-
puting it. Therefore, these timing errors are deterministic:
if the very same computation is performed in the very same
context (temperature, voltage, value of operands, content of
registers, history of instructions, etc.), the very same faulty
result will be produced. In other words, the lightning always
strikes twice! Consequently, none of the many existing solu-
tions for dealing with failures in HPC can be used to cope
with failures in a voltage overscaling context. Because timing
errors are reproducible1, we have no choice but to re-execute
faulty computations in a different context.
In this paper, we investigate whether one can aggressively
use voltage overscaling, in a purely software-based approach,
to reduce the energy cost of executing a chain of tasks. We
illustrate this approach using matrix multiplication. The
bottom-line question is the following: is it possible to ob-
tain the (correct) result of a matrix multiplication for a lower
energy budget than that of the best DVFS solution? As in
common blocking approaches such as [2], the matrix multipli-
cation is decomposed into a series of multiplications of sub-
matrices. In order to detect potential errors in the product of
these submatrices, we use Algorithm-Based Fault Tolerance
(ABFT) [5]. The idea is to execute an elementary matrix
multiplication at a very low voltage and to check its correct-
ness through ABFT. If the result is incorrect, we would then
recompute it in another setting, that is, at a higher voltage,
or in the worst case at nominal voltage. We assume the en-
ergy cost of executing an elementary matrix multiplication
at each available voltage is known, as well as the probability
of encountering timing errors at each voltage, through prior
profiling. The algorithmic problem is then to decide, knowing
these costs and probabilities, at which voltage to start exe-
cuting the elementary matrix multiplications and at which
voltage to re-execute them in case of failure. Should we go
directly for the nominal voltage or should we risk once again
an execution at a voltage below threshold?
We stress a major difference between our work and other al-
gorithmic work focusing on finding “good” tradeoffs between
performance (e.g., execution time, throughput) and energy
consumption. In this work, we solely target energy mini-
mization. This is because we use a fixed frequency (hence
guaranteeing performance), and use aggressive voltage over-
scaling to save energy. The main contributions of this paper
are:
• A formal model for the problem, which includes the
1Although timing errors are deterministic, they cannot be forecast
in practice. Indeed, this would require to consider all potential pa-
rameters in the execution: voltage, temperature, operands, content
of registers, etc.
mathematical consequences of the “failure strikes twice”
property when computing conditional probabilities.
• An optimal polynomial-time strategy to execute either
a single task or a chain of tasks.
• A set of numerical evaluations demonstrating that our
approach does lead to significant energy savings.
The rest of this paper is organized as follows. In Section 2,
we introduce a formal model for timing errors. We present
the optimal algorithms in Section 3 and their evaluation in
Section 4. We provide some final remarks in Section 5.
2. MODEL
In this section, we formally state our assumptions on timing
errors. Then, we introduce the main notations. Finally, we
investigate the impact of the assumptions on the conditional
probabilities of success/failure. Because of the “lightning
strikes twice” property, conditional properties are completely
different from what is usually enforced for the resilience of
HPC applications.
2.1 Timing errors
Silent errors caused by electro-magnetic radiation or cosmic
rays strike the system randomly. On the contrary, timing er-
rors are deterministic in nature. Suppose that a timing error
has occurred under a given execution scenario (voltage, fre-
quency, operand, etc.). Then the same error will occur with
very high probability for another execution under the same
scenario. Fundamentally, timing errors occur because one ad-
justs the system operating voltage below the threshold volt-
age Vth, for a given frequency. Lowering the voltage increases
the delay of the circuit, thereby potentially impacting the
correctness of the computation. Different operations within
the ALU may have different critical-path lengths. Similarly,
for a given operation, different sets of operands may lead
to different critical-path lengths (take a simple addition and
think of a carry rippling to different gates depending upon
the operands). In a nutshell, operations and operands are
not equal with respect to timing errors.
In this paper, we focus on a fixed frequency environment
(this frequency may have been chosen to achieve a given per-
formance). Timing errors depend upon the voltage selected
for execution, and we model this with the following two as-
sumptions:
Assumption 1. Given a computation and an input I, there
exists a threshold voltage Vth(I): using any voltage V be-
low the threshold (V < Vth(I)) will always lead to an in-
correct result, while using any voltage above that threshold
(V ≥ Vth(I)) will always lead to a successful execution. Note
that different inputs for the same computation may have dif-
ferent threshold voltages.
Assumption 2. When a computation is executed under a
given voltage V , there is a probability pV that the computation
will fail, i.e., produces at least one error, on a random input.
This failure probability is computed as pV = | If (V )|/| I |,
where I denotes the set of all possible inputs and If (V ) ⊆ I
denotes the set of inputs for which the computation will fail
at voltage V . Equivalently, If (V ) is the set whose threshold
voltage is strictly larger than V , according to Assumption 1.
For any two voltages V1 and V2 with V1 ≥ V2, we have
If (V1) ⊆ If (V2) (because of Assumption 1), hence pV1 ≤ pV2 .
Since timing errors are essentially silent errors, they do
not manifest themselves until the corrupted data has led to
an unusual application behavior, which may be detected long
after the error has occurred, wasting the entire computation
done so far. Hence, an error-detection mechanism is neces-
sary to ensure timely detection of timing errors, for instance,
after the execution of each task. In this paper, we apply
Algorithm-Based Fault Tolerance (ABFT), which uses check-
sums to detect errors. ABFT has been shown to work well
on matrix operations with low overhead. However, we stress
that the algorithms presented in Section 3 are fully general
and agnostic of the error-detection technique (checksum, er-
ror correcting code, coherence tests, etc.).
To the best of our knowledge, the only paper targeting a
pure algorithmic approach for near-threshold computing or
voltage overscaling is [11]. The work in [11] also considers
matrix multiplication using ABFT. However, it makes the
classical assumption that failures do not strike twice, which
does not apply to timing errors. Their approach only works
when ABFT can detect and correct all the errors struck dur-
ing the computation of an elementary matrix product. In
practice, ABFT is limited to single error correction, which
makes their approach viable only for small matrix blocks and
infrequent errors. Timing errors may strike often when using
very low voltages to decrease energy consumption.
2.2 Notations
We consider computational workflows that can be modeled
as a chain of n tasks, T1, T2, . . . , Tn, where Ti+1 depends on
Ti for 1 ≤ i ≤ n− 1. All tasks share the same computational
weight, including the work to verify the correctness of the
result at the end. Hence, all tasks have the same execution
time and energy consumption under a fixed voltage and fre-
quency setting. This framework applies to matrix multiplica-
tion, which we use to instantiate our model in Section 4. The
system frequency is fixed (for application performance). To
reduce energy consumption, we apply dynamic voltage over-
scaling (DVOS), which enables a tradeoff between energy cost
and failure probability. The platform can choose an operat-
ing voltage among a set V = {V1, V2, · · · , Vk} of k discrete
values, where V1 < V2 < · · · < Vk. Each voltage V` has an
energy cost per task c` that increases with the voltage, i.e.,
c1 < c2 < · · · < ck. Based on Assumption 2, each voltage
V` also has a failure probability p` that decreases with the
voltage, i.e., p1 > p2 > · · · > pk. We assume that the highest
voltage Vk equals the nominal voltage Vth with failure prob-
ability pk = 0, thus guaranteeing error-free execution for all
possible inputs. For convenience, we also use a null voltage
V0 with failure probability p0 = 1 and null energy cost c0 = 0.
Switching the operating voltage also incurs an energy cost.
Let o`,h denote the energy consumed to switch the system
operating voltage from V` to Vh. We have o`,h = 0 if ` = h
and o`,h > 0 if ` 6= h. Moreover, we assume that the voltage
switching cost follows the triangle inequality, i.e., o`,h ≤ o`,p+
op,h for any 1 ≤ `, h, p ≤ k, which is true in practice. It
basically means that, to switch from V` to Vh, no energy will
be gained by first switching to an intermediate voltage Vp
and then switching to the target voltage Vh.
The objective is to determine a sequence of voltages to
execute each task in the chain, so as to minimize the expected
total energy consumption.
2.3 Conditional probabilities
We now consider the implications of Assumptions 1 and 2
on the success and failure probabilities of executing a task
following a sequence of voltages. For the ease of writing,
we assume that the execution of each task has already failed
under the null voltage V0 (at energy cost c0 = 0).
Lemma 1. Consider a sequence 〈V1, V2, · · · , Vm〉 of m volt-
ages, where V1 < V2 < · · · < Vm, under which a given task is
executed.
(i) For any voltage V`, where 1 ≤ ` ≤ m, given that the
execution of the task on a certain input has already failed
under voltages V0, V1, · · · , V`−1, the probability that the task
execution will fail under voltage V` on the same input is
P(V`-fail | V0V1 · · ·V`−1-fail) = p`
p`−1
(ii) For any voltage V`, where 1 ≤ ` ≤ m, let P(V`-fail) de-
note the probability that the task execution will fail under all
voltages V0, V1, · · · , V`, and let P(V`-succ) denote the probabil-
ity that the execution will fail under voltages V0, V1, · · · , V`−1
but will succeed under V`. We have
P(V`-fail) = p`
P(V`-succ) = p`−1 − p`
Proof. We prove property (i) using the fundamental as-
sumptions on the error model — Assumptions 1 and 2. The
task under study is the execution of some computation on
some input I. Since this task execution has failed under
voltages V0, V1, · · · , V`−1, we know that input I satisfies I ∈⋂`−1
h=0 If (Vh) = If (V`−1), where If (Vh) denotes the set of
inputs on which the computation will fail under voltage Vh.
Then, the task execution will fail under voltage V` if input
I also falls in If (V`) ⊆ If (V`−1). Given that the input is
randomly chosen (we have no a priori knowledge on it), the
probability is P(V`-fail | V0V1 · · ·V`−1-fail) = | If (V`)|| If (V`−1)| =
| If (V`)|/| I |
| If (V`−1)|/| I | =
p`
p`−1
.
To prove (ii), we note that, in both cases, the task has failed
under all voltages before V`. Using the result of (i), we get
P(V`-fail) =
∏`
h=1 P(Vh-fail | V0 · · ·Vh−1-fail) =
∏`
h=1
ph
ph−1
=
p`, and P(V`-succ) =
(∏`−1
h=1 P(Vh-fail | V0 · · ·Vh−1-fail)
)
×(
1−P(V`-fail | V0 · · ·V`−1-fail)
)
=
(∏`−1
h=1
ph
ph−1
)(
1− p`
p`−1
)
=
p`−1 − p`.
3. OPTIMAL SOLUTION
In this section, we present a dynamic programming algo-
rithm to minimize the expected energy consumption for ex-
ecuting a linear chain of tasks. We start with a single task
(Section 3.1) before moving to a chain (Section 3.2).
3.1 For a single task
We first focus on a single task. The following lemma gives
the expected energy consumption for any given voltage se-
quence starting at the current system voltage (preset voltage
Vp before the first execution) and ending at the nominal volt-
age Vk (which guarantees successful completion).
Lemma 2. Suppose a sequence L = 〈V1, V2, · · · , Vm〉 of m
voltages is scheduled to execute a task, where V1 < V2 <
· · · < Vm, V1 = Vp, and Vm = Vk. The expected energy
consumption is
E(L) = c1 +
m∑
`=2
p`−1 (o`−1,` + c`) (1)
Proof. The task may be completed before all voltages in
the sequence are used, so let P(V`-exec) be the probability
that voltage V` is actually used to execute the task, which
happens when voltage V`−1 has failed. Since the first voltage
V1 is always used, we have P(V1-exec) = 1, and the cor-
responding energy consumption is P(V1-exec)c1 = c1. For
2 ≤ ` ≤ m, based on Lemma 1(ii), we have P(V`-exec) =
P(V`−1-fail) = p`−1, so the expected energy consumption
of executing the task under V` is P(V`-exec)(o`−1,` + c`) =
p`−1(o`−1,` + c`). Summing up the expected energy of all
voltages in the sequence leads to the result.
Lemma 2 shows that the expected energy consumed by a
voltage in any sequence is (only) related to the failure proba-
bility of the voltage immediately preceding it. We make use
of this property to design a dynamic programming algorithm.
Theorem 1. To minimize the expected energy consump-
tion for a single task, the optimal sequence of voltages to ex-
ecute the task with any preset voltage Vp ∈ V can be obtained
by dynamic programming with complexity O(k2).
Proof. Let L∗s denote the optimal sequence of voltages
among all possible sequences that start with voltage Vs ∈ V,
and when the system preset voltage is also at Vs. Let E(L
∗
s)
denote the corresponding expected energy consumption by
carrying out this sequence. According to Lemma 2, adding
a new voltage before any sequence of voltages will only af-
fect the expected energy of the first voltage in the original
sequence. By using this property, we can formulate the fol-
lowing dynamic program to compute
E(L∗s) = cs + min
s<`≤k
{E(L∗` )− c` + ps(os,` + c`)}
= cs + min
s<`≤k
{E(L∗` ) + psos,` + (ps − 1)c`} (2)
and the optimal sequence starting with voltage Vs is con-
structed as L∗s = 〈Vs, L∗`′〉, where
`′ = arg min
s<`≤k
{E(L∗` ) + psos,` + (ps − 1)c`}
The dynamic program is initialized with E(L∗k) = ck and
L∗k = 〈Vk〉, and it is computed based on Equation (2) for
s = k − 1, k − 2, · · · , 1. The complexity is clearly O(k2).
Note that the above formulation ensures every sequence L∗s ,
for s = 1, 2, · · · , k, ends with the nominal voltage Vk, so the
task is guaranteed to be completed.
Given a preset voltage Vp ∈ V, the optimal expected energy
is then
E∗(Vp) = min
1≤s≤k
{op,s + E(L∗s)}
and the optimal voltage sequence to execute the task with
preset voltage Vp is
L∗(Vp) = L
∗
s′
where s′ = arg min1≤s≤k {op,s + E(L∗s)}.
3.2 For a chain of tasks
We now present a dynamic programming algorithm to ex-
ecute a linear chain T1 ≺ T2 ≺ · · · ≺ Tn of n tasks. We
point out that, due to the voltage switching cost, the optimal
sequence of voltages to execute each task depends on the ter-
minating voltage of the preceding task as well as the expected
energy consumption to execute the subsequent tasks. Hence,
the sequence could be very different for different task counts.
We first define some notations. Let L∗s(Ti) denote a se-
quence of voltages that starts with Vs for executing task Ti,
and which leads to the optimal expected energy E(
−→
Ti , L
∗
s(Ti))
for executing the subchain Ti ≺ · · · ≺ Tn. The optimal ex-
pected energy to execute Ti ≺ · · · ≺ Tn with any preset
voltage Vp ∈ V is therefore E∗(Vp,−→Ti) = min1≤s≤k{op,s +
E(
−→
Ti , L
∗
s(Ti))}.
The following lemma gives the expected energy consump-
tion to execute the subchain Ti ≺ · · · ≺ Tn, given a voltage
sequence for executing its first task Ti, the system preset volt-
age Vp (before the execution of Ti), and the optimal expected
energy to execute the subsequent tasks.
Lemma 3. Suppose a sequence L(Ti) = 〈V1, V2, · · · , Vm〉
of m voltages is scheduled to execute task Ti, where V1 <
V2 < · · · < Vm, V1 = Vp, and Vm = Vk. The expected energy
consumption to execute the subchain Ti ≺ · · · ≺ Tn by carry-
ing out sequence L(Ti) for task Ti and the optimal sequence
for each subsequent task is
E(
−→
Ti , L(Ti)) = c1 + (1− p1)E∗(V1,−−→Ti+1)
+
∑m
`=2
(
p`−1 (o`−1,` + c`) + (p`−1 − p`)E∗(V`,−−→Ti+1)
)
(3)
Proof. As in Lemma 2, the expected energy consumed by
carrying out sequence L(Ti) for task Ti can be similarly com-
puted to follow Equation (1), i.e., c1+
∑m
`=2 p`−1 (o`−1,` + c`).
For a chain of tasks, the optimal sequence of voltages to
execute each task depends on the terminating (successful)
voltage of its preceding task. Suppose task Ti is successfully
executed by voltage V`, then the optimal expected energy
to execute the rest of the chain is E∗(V`,
−−→
Ti+1). Based on
Lemma 1(ii), for any 1 ≤ ` ≤ m, the probability that task Ti
is successfully executed by V` is given by P(V`-succ) = p`−1−
p`. Hence, the expected energy consumption by executing the
remaining tasks is
∑m
`=1 (p`−1 − p`)E∗(V`,
−−→
Ti+1).
Summing up the expected energy for task Ti and for the
rest of chain gives the result.
Theorem 2. To minimize the expected energy consump-
tion for a linear chain of tasks, the optimal sequence of volt-
ages to execute each task, given the terminating voltage of its
preceding task (or given the preset voltage Vp ∈ V for the first
task), can be obtained by dynamic programming with com-
plexity O(nk2).
Proof. Observe from Lemma 3 that the expected energy
incurred by any voltage in a sequence to execute a task is
related to the failure probability of the voltage itself as well
as that of the immediately preceding voltage in the sequence.
Hence, to determine the optimal voltage sequence to execute
each task, we can establish the following dynamic program:
E(
−→
Ti , L
∗
s(Ti))
= cs + (1− ps)E∗(Vs,−−→Ti+1) + min
s<`≤k
{
E(
−→
Ti , L
∗
` (Ti))
−c` − (1− p`)E∗(V`,−−→Ti+1) + ps (os,` + c`)
+(ps − p`)E∗(V`,−−→Ti+1)
}
= cs + (1− ps)E∗(Vs,−−→Ti+1) + min
s<`≤k
{
E(
−→
Ti , L
∗
` (Ti))
+psos,` + (ps − 1)
(
c` + E
∗(V`,
−−→
Ti+1)
)}
for 1 ≤ i ≤ n− 1 and
E(
−→
Tn, L
∗
s(Tn))
= cs + min
s<`≤k
{
E(
−→
Tn, L
∗
` (Tn))− c` + ps (os,` + c`)
}
= cs + min
s<`≤k
{
E(
−→
Tn, L
∗
` (Tn)) + psos,` + (ps − 1)c`
}
for i = n. The optimal voltage sequence L∗s(Ti) for each task
Ti is constructed as L
∗
s(Ti) = 〈Vs, L∗l′(Ti)〉, where l′ yields the
minimum value of E(
−→
Ti , L
∗
s(Ti)) in the two equations above.
The dynamic program is initialized with E(
−→
Tn, L
∗
k(Tn)) =
ck and L
∗
k(Tn) = 〈Vk〉. For 1 ≤ i ≤ n − 1, it is initialized
with E(
−→
Ti , L
∗
k(Ti)) = ck + E
∗(Vk,
−−→
Ti+1) and L
∗
k(Ti) = 〈Vk〉.
For each task Ti, starting from i = n, we first compute
E(
−→
Ti , L
∗
s(Ti)) and construct L
∗
s(Ti) for all s = k − 1, k −
2, · · · , 1. Then, for all 1 ≤ h ≤ k, we need to compute the op-
timal expected energy to execute the subchain Ti ≺ · · · ≺ Tn
when task Ti−1 terminates at voltage Vh:
E∗(Vh,
−→
Ti) = min
1≤s≤k
{oh,s + E(−→Ti , L∗s(Ti)}
and the optimal voltage sequence to execute task Ti when
task Ti−1 terminates at voltage Vh:
L∗(Vh, Ti) = L
∗
s′(Ti)
where s′ = arg min1≤s≤k{oh,s+E(
−→
Ti , L
∗
s(Ti)}. After that, we
can move on to task Ti−1. The optimal expected energy to ex-
ecute the entire chain with preset voltage Vp is then given by
E∗(Vp,
−→
T1), and we start executing the first task with voltage
sequence L∗(Vp, T1). Again, the above formulation ensures
the optimal sequence for each task ends with the nominal
voltage Vk, so all tasks are guaranteed to be completed.
For the complexity, the computation of both E(
−→
Ti , L
∗
s(Ti))
and E∗(Vh,
−→
Ti) for each task Ti takes O(k
2) time, and it is
clearly linear in the number of tasks.
4. EVALUATION
In this section, we numerically evaluate the performance of
the proposed dynamic programming solutions. We instanti-
ate the application workflow with matrix multiplication and
perform error detection (and correction) with ABFT.
4.1 Application workflow
Consider the blocked version of the inner-product algo-
rithm for computing the matrix product C = A×B, where A
and B are square matrices of size m×m. The block size is b,
and the matrices are partitioned into dm
b
e2 blocks (or subma-
trices). Assuming that all elements of matrix C are initialized
to zero, the following shows the sequential implementation of
the algorithm:
for i = 1 to dm
b
e do
for j = 1 to dm
b
e do
for k = 1 to dm
b
e do
Ci,j ← Ci,j +Ai,k ×Bk,j
which forms a chain of n = dm
b
e3 tasks with each task incur-
ring O(b3) multiply-add operations. The block size b is chosen
so as to enforce maximal cache re-use during the computa-
tion of one task. Setting b too small, however, incurs a larger
overhead in loading and storing the data, thereby reducing
the efficiency of the computation.
4.2 Algorithm-based fault tolerance
Algorithm-based fault tolerance (ABFT) is a technique de-
veloped by Huang and Abraham [5] to detect, locate and
correct errors in matrix operations with low computational
overhead. The idea is to add redundancy to the matrices in
the form of checksums, which have been shown to be con-
sistently maintained during the computation of many matrix
operations. The following demonstrates the encoding scheme
for the matrix multiplication C = A×B.
First, define the column checksum matrix of matrix A as
Ac :=
(
A
eTA
)
, where e = [1 1 · · · 1]T is an all-one col-
umn vector. Define the row checksum matrix of matrix B as
Br :=
(
B Be
)
. Finally, define the full checksum matrix of
matrix C as Cf :=
(
C Ce
eTC eTCe
)
. Instead of multiplying
the original matrices A and B, we multiply the checksum ma-
trices Ac and Br, which produces the full checksum matrix
Cf as follows:
Ac ×Br =
(
A
eTA
)
× (B Be)
=
(
AB ABe
eTAB eTABe
)
=
(
C Ce
eTC eTCe
)
= Cf
Suppose an error has occurred during the above computa-
tion, then the checksum property in matrix Cf will no longer
be satisfied, which can be easily detected by recomputing the
checksums of Cf and comparing them to the results in the
matrix product. Moreover, if only one error has occurred,
then exactly one row and one column will violate the check-
sum property. In this case, we can locate the error (at the
intersection of the inconsistent row and inconsistent column)
and then correct it (by reversing the checksum computation).
The same encoding scheme also works for matrix addition,
thus we can apply it to detecting and (possibly correcting
some) errors after each iteration of the blocked algorithm
shown in the Section 4.1.
The overhead of performing ABFT on matrix blocks of
size b, including the computation of the checksums them-
selves, the extra computation during the multiplication, and
the error detection and correction, takes O(b2) operations,
which is much lower than the O(b3) operations in the matrix
multiplication for reasonable block sizes.
4.3 Evaluation setup
This section describes the various parameters used to in-
stantiate the model. In the evaluation, we assume that timing
errors occur only in the ALU, while the memory is protected
and is thus error-free.
Matrix parameters.
We set the dimension of the matrices to be m = 16384, and
vary the block size b in the evaluation. The maximum block
size is set to be 256 for cache efficiency. The number of tasks
is thus n = dm/be3 = d16384/be3. For fault tolerance, the
matrices are protected by the ABFT scheme described above.
Hence, the computation of each task requires w = b(b+1)2+σ
operations, where σ denotes the overhead of initiating the
matrix multiplication, which essentially prevents the use of
very small blocks. In the evaluation, the overhead is assumed
to be equivalent to multiplying two matrices of size 8×8, i.e.,
σ = 83 = 512. The time to compute each task is therefore
t = τ · w/η, where τ = 1/f denotes the time to do one
cycle at frequency f and η denotes the percentage of the
peak processor performance that can be efficiently utilized.
Since optimized matrix multiplication codes are known to be
efficient, we set η = 0.8 in the evaluation.
Platform setting.
We adopt the set of voltages and the associated failure
probabilities due to timing errors measured in [4] for a field-
Voltage Vℓ
1.14 1.18 1.22 1.26 1.3 1.34 1.38 1.42 1.46 1.5 1.54 1.58 1.62
E
rr
or
P
ro
b
ab
il
it
y
p
(1
)
ℓ
0     
1e-09 
1e-08 
1e-07 
1e-06 
1e-05 
0.0001
0.001 
0.01  
0.1   
1     
Figure 1: Set of voltages of an FPGA multiplier block and
the associated error probabilities measured on random inputs
at 90MHz and 27◦C [4].
programmable gate array multiplier block at f = 90MHz and
27◦C. Figure 1 shows the set V of available voltages, as well
as the error probability p
(1)
` of each voltage V` ∈ V when
performing a single operation on a random input. We take
the zero margin voltage 1.54V that produces no error for all
inputs as the nominal voltage Vk. As some errors can be
corrected by hardware recovery mechanisms with little extra
overhead, such as the technique reported in [4], they will
not show up at the application level. Hence, we scale the
associated error probability of each voltage by a factor of γ,
which will be varied as a parameter in the evaluation. For any
voltage V` ∈ V, the probability of having at least one error
in the computation of a task with w operations can therefore
be computed as p` = 1− (1− p(1)` /γ)w.
Energy cost.
The dynamic power consumption of microprocessors is typ-
ically modeled as P (V, f) = αfV 2 [1, 9], where α denotes the
effective capacitance, f the frequency, and V the operating
voltage. By scaling the unit of power, we can assume wlog
that αf = 1 under a fixed frequency. Hence, for a given volt-
age V` and a block size b, the energy consumed to execute
one task can be computed as c` = V
2
` t, where t is the time to
execute the task. Note that we ignore the energy cost due to
the O(b2) load and store operations in order to execute the
task, which is incurred regardless of the execution algorithm.
The energy consumption to switch the operating voltage is
assumed to be a linear function of the difference between the
starting voltage and ending voltage. We model the switching
energy as
o`,h =
{
0, if ` = h
β · |V`−Vh|
Vk−V1 otherwise
(4)
where β denotes the cost to switch between the nominal volt-
age Vk and the lowest possible voltage V1. In the evaluation,
we will vary β to capture the relative cost of voltage switching
compared to computing.
4.4 Evaluated algorithms
We evaluate the following algorithms and compare their
performance. For all algorithms, the preset voltage is set to
be the nominal voltage to begin the computation.
• N-Voltage: This is the baseline algorithm that applies
near-threshold computing and always uses the nominal
voltage Vth = Vk to execute all the tasks. Since many
systems operate at a much higher supply voltage Vdd
than the nominal voltage, their energy consumption is
at least as high as the N-Voltage algorithm.
• DP1-detect & DP1-correct : These two algorithms use
the dynamic program for a single task described in Sec-
tion 3.1. Specifically, they apply the optimal sequence
of voltages computed for one task to all the tasks in the
chain. The expected energy consumption can be com-
puted iteratively based on Equation (3). DP1-detect
uses ABFT for error detection, and in case of error
it re-executes the task with a higher voltage. DP1-
correct also does error correction if exactly one error is
detected, so that re-execution is not necessary in that
case. If more than one error occurs, then DP1-correct
also re-executes the task.
• DPn-detect & DPn-correct : These two algorithms work
similarly as the previous ones, using ABFT for error de-
tection and correction, but they make direct use of the
dynamic program for a chain of tasks described in Sec-
tion 3.2. Hence, they are able to better take switching
costs into account than DP1-detect & DP1-correct .
Due to the additional checksums in ABFT, the number of
operations needed to execute each task in the DP-based algo-
rithms is b(b+1)2 instead of b3. Because the DPn-correct and
DP1-correct algorithms are able to correct up to one error,
the failure probability of a voltage V` becomes the probabil-
ity of having at least two errors in the execution, which for a
task with w operations can be computed as
p` = 1−
(
1− p
(1)
`
γ
)w
−
(
w
1
)(
1− p
(1)
`
γ
)w−1
p
(1)
`
γ
·
Since computational errors in matrix multiplications do not
propagate, the above probability is a pessimistic estimation:
indeed, more than two errors can be corrected if they happen
to occur on the same element in matrix C. The extra over-
head to correct an error is simply one additional operation.
4.5 Evaluation results
We now present the evaluation results. The first set of
experiments is devoted to the evaluation of the algorithms
when the voltage switching cost β is set to zero, and the
probability scaling factor γ is fixed at 10.
Impact of block size b.
Figures 2(a) and 2(b) present the impact of block size b
on the expected energy consumption. Figure 2(a) shows that
using a small block size dramatically increases the energy
consumption. This is partly due to the extra computation of
ABFT, which reaches 13%
(
= (16+1)
2−162
162
)
of the total com-
putation when b = 16, and partly due to the extra overhead
needed to handle the increasing number of blocks. For larger
block sizes, the overhead becomes negligible and the energy
consumption of the N-Voltage algorithm, which does not need
ABFT, remains almost constant.
Surprisingly, the energy consumption of the DP algorithms
does not seem to be much affected neither. In fact, decreasing
the block size decreases the number of computations needed
to execute one task, and thus the probability of failure. Then
the algorithms can choose lower voltages, which saves energy,
but the overhead of ABFT associated with smaller blocks in-
creases. These gain and loss cancel out, so the energy con-
sumptions of these algorithms remain quite stable as long as
the block size b is not too small.
Figure 2(b) shows the normalized energy consumption of
the algorithms with respect to the baseline algorithm N-Voltage.
The DPn-correct algorithm, which can tolerate one error, is
less likely to fail than the DPn-detect algorithm under the
0 50 100 150 200 250
120000
130000
140000
150000
160000
170000
Block Size b
Ex
pe
ct
ed
 E
ne
rg
y
DP_n-detect
DP_n-correct
DP_1-detect
DP_1-correct
N-Voltage
(a)
0 50 100 150 200 250
0.85
0.9
0.95
1
1.05
Block Size b
N
or
m
al
iz
ed
 E
xp
ec
te
d 
En
er
gy
DP_n-detect
DP_n-correct
DP_1-detect
DP_1-correct
N-Voltage
(b)
0 50 100 150 200 2500.75
0.8
0.85
0.9
0.95
1
1.05
Block Size b
N
or
m
al
iz
ed
 E
xp
ec
te
d 
En
er
gy
 
 
γ = 1
γ = 10
γ = 100
γ = 1000
γ = 10000
N−Voltage
(c)
Figure 2: Impact of b and γ on the expected energy consumption for zero voltage switching cost. Only the results for the
DPn-correct algorithm are shown in (c).
1.3 1.34 1.38 1.42 1.46 1.5 1.540
0.2
0.4
0.6
0.8
1
Voltage Vℓ
F
a
il
u
re
P
ro
b
a
b
il
it
y
p
ℓ
 
 
b=16
b=32
b=64
b=128
b=256
b=1024
Figure 3: Failure probabilities for one task under different
block sizes and voltages.
same voltage. This ability enables it to either choose a lower
voltage while maintaining the same error probability, or use
the same voltage while undergoing fewer failures. Both cases
lead to savings in energy, between 5-10% in the experiment
depending on the block size. Note that the DP1 and DPn
algorithms yield the same energy consumption when the volt-
age switching cost is zero. In fact, the optimal sequence of
voltages for one task turns out to be the same for all the tasks,
which is the result of not having to pay the cost needed to
reset the voltage after the completion of each task.
To better understand the performance of the algorithms,
we plot in Figure 3 the failure probability of a single task
under different voltages and block sizes. It shows that, for
a given block size, there is at least one voltage below the
nominal voltage with a failure probability that is low enough,
so that the nominal voltage itself is almost never needed: the
execution of a task will almost always succeed at a lower
voltage. Therefore, the DP algorithms always yield better
energy consumptions than the N-Voltage algorithm, as shown
in Figure 2(b). This is only true for small block sizes (e.g.,
up to b = 256). For larger block sizes, such as b = 1024, only
the nominal voltage can guarantee a failure-free execution.
Impact of probability scaling factor γ.
Figure 2(c) presents the effect of γ on the expected energy
consumption of the DPn-correct algorithm. Recall that γ is
used to scale the error probabilities of the available voltages
given in Figure 1 in order to account for the error-handling
ability of the hardware. When γ equals 1, the probabilities
are actually equal to the ones measured in [4]. Although
it represents a pessimistic configuration, our algorithm still
yields about 5% improvement over the N-Voltage algorithm.
Higher values of γ offer more optimistic settings owing to
better hardware error-handling technologies. This allows our
algorithm to use lower voltages thus to save more energy.
In particular, DPn-correct is able to achieve nearly 15% en-
ergy saving compared to the baseline algorithm if there is
a thousand-fold reduction in the failure probability of any
single operation.
Impact of voltage switching cost β.
The second set of experiments focuses on the evaluation
of the algorithms with non-negligible voltage switching costs.
Figures 4(a) and 4(b) present the impact of voltage switch-
ing costs on the expected energy consumption of the algo-
rithms under different block sizes. These figures are simi-
lar to Figures 2(a) and 2(b), except that the voltage switch-
ing cost between nominal and minimal voltages has been set
to be equivalent to the energy consumed to multiply two
32 × 32 matrices at the nominal voltage without overhead,
i.e., β = V 2k · τ · 323/η. This is large enough to have an
impact on the behavior of the algorithms.
Remember that the execution starts at the nominal volt-
age, so in order to lower the voltage for the first time, it
is mandatory to pay some voltage switching cost. For the
DP1 algorithms, designed for a single task, it might not be
beneficial to lower the voltage. This is especially true for
small tasks, where the ratio between voltage switching cost
and computation is relatively high. As a result, they tend to
stick to the nominal (or a high) voltage, leading to more en-
ergy consumption. On the other hand, the DPn algorithms
consider the execution of the entire chain for deriving the
optimal solution. In this case, the high voltage switching
cost is amortized over all tasks. Hence, these algorithms will
continue to explore the lower voltages, which, according to
Figure 3, can still enjoy a good success probability for small
tasks. Note that the N-Voltage algorithm never switches volt-
ages and thus is not affected by the voltage switching cost.
Lastly, Figure 4(c) presents the impact of β when the block
size is fixed to be b = 128. In particular, it shows the thresh-
old values of β for which the DP1 algorithms will stop explor-
ing lower voltages and stick to the nominal voltage instead.
By using the nominal voltage, they consume more energy
and become worse than the DPn algorithms, or even the N-
0 50 100 150 200 250
120000
130000
140000
150000
160000
170000
180000
190000
Block Size b
Ex
pe
ct
ed
 E
ne
rg
y
DP_n-detect
DP_n-correct
DP_1-detect
DP_1-correct
N-Voltage
(a)
0 50 100 150 200 250
0.85
0.9
0.95
1
1.05
1.1
1.15
Block Size b
N
or
m
al
iz
ed
 E
xp
ec
te
d 
En
er
gy
DP_n-detect
DP_n-correct
DP_1-detect
DP_1-correct
N-Voltage
(b)
0 20 40 60 80 100 120
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
Switching Cost $eta=V_k^2cdot 	au x^3/eta$
N
or
m
al
iz
ed
 E
xp
ec
te
d 
En
er
gy
DP_n-detect
DP_n-correct
DP_1-detect
DP_1-correct
N-Voltage
(c)
Figure 4: Impact of b and β on the expected energy consumption. In (a) and (b), the voltage switching cost is equivalent to
the energy consumed to multiply two 32× 32 matrices at nominal voltage without overhead, i.e., β = V 2k · τ · 323/η.
Voltage algorithm because of the ABFT overhead. The result
shows that for small voltage switching costs, both DP1 and
DPn yield the same energy, but as soon as the switching cost
reaches a threshold, only the more advanced DPn algorithms
are able to provide energy savings.
5. CONCLUSION
In this paper, we have proposed a software-based approach
for reducing the energy consumption of HPC applications.
The approach exploits dynamic voltage overscaling, which
aggressively lowers the supply voltage below the nominal volt-
age. Because this technique introduces timing errors, we have
used ABFT to provide fault tolerance for matrix operations.
Based on a formal model of timing errors, we have derived
an optimal polynomial-time solution to execute a linear chain
of tasks. The evaluation results obtained for matrix multi-
plication demonstrate that our approach indeed leads to sig-
nificant energy savings compared to the standard algorithm
that always operates at (or above) the nominal voltage. The
approach seems quite promising, and we plan to extend it to
deal with other scientific application workflows.
Acknowledgment
This research was funded in part by the European project
SCoRPiO, by the LABEX MILYON (ANR-10-LABX-0070)
of Universite´ de Lyon, within the program “Investissements
d’Avenir” (ANR-11-IDEX-0007) operated by the French Na-
tional Research Agency (ANR), by the PIA ELCI project,
and by the ANR Rescue project. Yves Robert is with Insti-
tut Universitaire de France.
6. REFERENCES
[1] D. M. Brooks, P. Bose, S. E. Schuster, H. Jacobson,
P. N. Kudva, A. Buyuktosunoglu, J.-D. Wellman,
V. Zyuban, M. Gupta, and P. W. Cook. Power-aware
microarchitecture: Design and modeling challenges for
next-generation microprocessors. IEEE Micro,
20(6):26–44, 2000.
[2] J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet,
D. Walker, and R. C. Whaley. The design and
implementation of the ScaLAPACK LU, QR, and
Cholesky factorization routines. Scientific
Programming, 5:173–184, 1996.
[3] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester,
and T. Mudge. Near-threshold computing: Reclaiming
moore’s law through energy efficient integrated circuits.
Proceedings of the IEEE, 98(2):253–266, 2010.
[4] D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin,
T. Mudge, N. S. Kim, and K. Flautner. Razor:
circuit-level correction of timing errors for low-power
operation. IEEE Micro, 24(6):10–20, 2004.
[5] K.-H. Huang and J. A. Abraham. Algorithm-based
fault tolerance for matrix operations. IEEE Trans.
Comput., 33(6):518–528, 1984.
[6] G. Karakonstantis and K. Roy. Voltage over-scaling: A
cross-layer design perspective for energy efficient
systems. In European Conference on Circuit Theory
and Design (ECCTD), pages 548–551, 2011.
[7] P. Krause and I. Polian. Adaptive voltage over-scaling
for resilient applications. In Proceedings of DATE,
pages 1–6, 2011.
[8] S. Ramasubramanian, S. Venkataramani,
A. Parandhaman, and A. Raghunathan.
Relax-and-retime: A methodology for energy-efficient
recovery based design. In Proceedings of DAC, 2013.
[9] N. B. Rizvandi, A. Y. Zomaya, Y. C. Lee, A. J.
Boloori, and J. Taheri. Multiple frequency selection in
DVFS-enabled processors to minimize energy
consumption. In A. Y. Zomaya and Y. C. Lee, editors,
Energy-Efficient Distributed Computing Systems. John
Wiley & Sons, Inc., 2012.
[10] M. Seok, G. Chen, S. Hanson, M. Wieckowski,
D. Blaauw, and D. Sylvester. CAS-FEST 2010:
Mitigating variability in near-threshold computing.
IEEE Journal on Emerging and Selected Topics in
Circuits and Systems, 1(1):42–49, 2011.
[11] T. M. Smith, E. S. Quintana-Orti, M. Smelyanskiy, and
R. A. van de Geijn. Embedding fault-tolerance,
exploiting approximate computing and retaining high
performance in the matrix multiplication. In Workshop
On Approximate Computing (WAPCO), 2015.
