Recomputation Enabled Efficient Checkpointing by Akturk, Ismail & Karpuzcu, Ulya R.
Recomputation Enabled Efficient Checkpointing
Ismail Akturk
University of Missouri, Columbia
akturki@missouri.edu
Ulya R. Karpuzcu
University of Minnesota, Twin Cities
ukarpuzc@umn.edu
ABSTRACT
Systematic checkpointing of the machine state makes restart
of execution from a safe state possible upon detection of an er-
ror. The time and energy overhead of checkpointing, however,
grows with the frequency of checkpointing. Amortizing this
overhead becomes especially challenging, considering the
growth of expected error rates, as checkpointing frequency
tends to increase with increasing error rates. Based on the
observation that due to imbalanced technology scaling, re-
computing a data value can be more energy efficient than
retrieving (i.e., loading) a stored copy, this paper explores
how recomputation of data values (which otherwise would be
read from a checkpoint from memory or secondary storage)
can reduce the machine state to be checkpointed, and thereby
reduce the checkpointing overhead. Specifically, the resulting
amnesic checkpointing framework AmnesiCHK can reduce
the storage overhead by up to 23.91%; time overhead, by
11.92%; and energy overhead, by 12.53%, respectively, even
in a relatively small scale system.
1. INTRODUCTION
Scalable checkpointing is the key to enable emerging high-
performance computing applications. Ready to expand their
problem sizes as more hardware resources (e.g., more cores
under weak scaling) become available, these applications
challenge processing capabilities. More hardware resources
translate into more components subject to errors, which,
along with a higher expected component error rate as an
artifact of technology scaling, results in a higher probability
of (system-wide) errors. Therefore, proper error detection
and recovery becomes a must for successful completion of
any execution.
Systematic (often, periodic) checkpointing of the machine
state enables backward error recovery (BER) upon detec-
tion of an error, by rolling back to and restarting execution
from a safe (i.e., error-free and consistent) machine state. En-
ergy and time overhead of checkpointing the machine state,
however, grow with the frequency of checkpointing. The
expected increase in error rates makes amortization of this
overhead especially challenging, as a higher probability of
error directly implies more frequent checkpointing.
The overhead of BER spans the overhead of checkpoint-
ing and the overhead of recovery (which entails roll-back +
restart). The time or energy overhead of checkpointing, ochk,
applies every time the system generates a checkpoint; the
time and energy overhead of recovery, orec, every time the
execution restarts from the most recent checkpointed (safe)
state after detection of an error. Depending on the interaction
among parallel tasks of execution during checkpointing and
recovery, BER schemes typically form two major classes:
coordinated and uncoordinated [1, 2]. Coordinated schemes
enforce tight lock-step coordination (i.e., synchronization)
among all parallel tasks every time the system generates a
checkpoint or triggers recovery, and hence, generally incur a
higher overhead. Uncoordinated schemes address this over-
head by omitting coordination or confining it only to tasks
interacting with each other during computation, which as
a downside complicates the establishment of a consistent
error-free global state.
The checkpointing overhead, ochk is proportional to the
time or energy spent on storing the checkpointed state (to
memory or secondary storage), owr,chk, and the number of
checkpoints, #chk (which represents a proxy for the check-
pointing frequency). Putting it all together,
ochk = #chk×owr,chk (1)
applies. The recovery overhead, orec, on the other hand,
includes the time or energy (spent on useful work and) lost
since the most recent safe checkpoint, owaste, and the time
or energy spent on restoring the state captured by the most
recent safe checkpoint, oroll−back. Under an error probability
of perr, which dictates the number of recoveries, the recovery
overhead becomes:
orec = perr× (owaste +oroll−back) (2)
Imbalances in technology scaling render the energy con-
sumption (and latency) of data storage and communication
significantly higher than the energy consumption (and la-
tency) of actual data generation, i.e., computation [3, 4]. As
a result, whenever a data value is needed (i.e., has to be
loaded from memory), re-generating (i.e., recomputing) the
respective value can easily become more energy-efficient than
retrieving the stored copy from memory [5]. During recovery,
recomputation of a data value, which otherwise would be
read from a checkpoint, can therefore be less energy hungry
and time consuming than retrieving the respective check-
point from main memory or secondary storage. This can
further eliminate the need for checkpointing such recom-
putable data values, which would never be retrieved from
memory or secondary storage, but recomputed. The result is
an amnesic BER framework, AmnesiCHK, which can oppor-
tunistically omit checkpointing of (recomputable) data values,
1
ar
X
iv
:1
71
0.
04
68
5v
2 
 [c
s.D
C]
  2
1 M
ar 
20
18
and thereby can reduce the machine state to be checkpointed,
by relying on the ability to recompute the respective data
values when needed during recovery.
Under recomputation, time or energy spent on storing
the checkpointed state, owr,chk, can decrease since a (recom-
putable) subset of the updated memory values would be omit-
ted from checkpointing. This in turn can decrease ochk, even
if #chk remains the same. However, the recovery overhead
orec now has to incorporate the overhead of recomputation (of
the values which were omitted from checkpointing), orcmp.
Still, we expect the time or energy spent on restoring the state
of the most recent safe checkpoint, oroll−back to decrease,
since the size of checkpoints would simply reduce under re-
computation. Putting it all together, the recovery overhead
under recomputation becomes:
orec,rcmp = perr× (owaste,rcmp +oroll−back,rcmp +orcmp) (3)
Therefore, for AmnesiCHK to hold recovery overhead at bay,
orec,rcmp ≤ orec should be the case, which implies:
oroll−back,rcmp +orcmp ≤ oroll−back (4)
Recomputation in this case is fundamentally different than
classic replay: recomputation refers to the recalculation of
a data value to cut any energy-hungry memory access asso-
ciated with the respective value. This can be regarded as
restricted replay of a small backward slice of instructions just
to generate that respective data value.
In this paper, we explore how AmnesiCHK can help reduce
the overhead of checkpointing without compromising the
overhead of recovery in terms of time, energy, and storage.
AmnesiCHK is:
• hybrid (hardware/software): AmnesiCHK relies on a com-
piler pass to generate (and embed into the binary) instruc-
tions required to recompute the respective data values,
which can be excluded from checkpointing. Under re-
covery, AmnesiCHK’s runtime scheduler in turn triggers
recomputation of these values.
• transparent: Both, amnesic binary generation and trigger-
ing recomputation upon recovery are transparent to the
application developer and user.
• low overhead: AmnesiCHK trades the data storage and
retrieval overhead of checkpointing for the overhead of
recomputing the respective data values. AmnesiCHK can
significantly reduce the overhead of checkpointing, while
holding recomputation-incurred overheads (particularly
during recovery) at bay.
• scalable: Traditional checkpointing and recovery becomes
more challenging at larger scale. AmnesiCHK can effec-
tively reduce the checkpoint size, hence, is by construction
more scalable.
In the following, we will detail a proof-of-concept AmnesiCHK
implementation. Specifically, Section 2 provides the back-
ground; Section 3 discusses AmnesiCHK basics; Sections 4
and 5 provide the evaluation; Section 6 covers the related
work; and Section 7 concludes the paper.
2. BACKGROUND
2.1 Backward Error Recovery (BER)
Checkpointing: Checkpointing serves establishment of a
safe (i.e., error-free and consistent) machine state to roll-
back to and recover from upon detection of a error, thereby
ensuring forward progress in execution in the presence of
errors. Without loss of generality, we consider shared mem-
ory many-cores featuring directory-based cache coherence.
We start our analysis with global coordinated checkpointing
and recovery [6, 7, 8, 9], but provide a sensitivity study for
local coordinated schemes [10, 11], as well. Under global
checkpointing, all cores periodically cooperate to checkpoint
the respective machine state. Specifically, at the beginning
of each checkpointing period, all cores stop computation to
participate in checkpoint generation.
As a running example (and a relatively lower-overhead
baseline for comparison, not to favor AmnesiCHK), we will
use a log-based incremental in-memory checkpointing vari-
ant similar to [12, 8, 9], where upon each memory update,
a record for the old value goes into a log stored in memory.
This log corresponds to the checkpoint. The log constitutes
a record of values updated only within the time window be-
tween two consecutive checkpointing events, as opposed to
the entire machine state. Establishing a checkpoint involves
writing all dirty cache lines back to memory and recording
(the rest of) each core’s architectural state. For dirty lines,
the memory controller only updates the log with the corre-
sponding old value, if the update represents the very first
modification since the last checkpoint. Thus, similar to [8],
a modified cache line gets logged only once between a pair
of consecutive checkpoints. The directory controller keeps
an additional bit per memory line to keep track of whether
the line has already been logged for the current checkpoint
interval. The controller sets this bit upon logging the line, and
clears it upon establishing a new checkpoint. In the following,
we will refer to this bit as log.
In-memory checkpointing, by construction, incurs a lower
time and energy overhead when compared to (more tradi-
tional) checkpointing to secondary storage. In-memory check-
pointing may correspond to a stand-alone checkpointing
scheme or represent the first level in a hierarchical check-
pointing framework. Our observations generally apply under
both options.
Figure 1: Recovery from an error.
Error Detection and Recovery:
In the following, we assume a fail-stop error model, where
data memory and checkpoint logs do not suffer from any
errors, similar to [12]. Various protection mechanisms such
as ECC [13] or memory raiding [14] can achieve this. To
2
detect errors, the system can rely on modular redundancy [15]
or error detection codes (e.g., CRC). Error detection is not
instantaneous, therefore, a lag between the occurrence of an
error and its detection generally applies, which is referred to
as error detection latency. As a consequence, corrupted state
may get checkpointed, even if the error detection latency is
no longer than the checkpoint period. Figure 1 illustrates
an example, where an error occurs right before Ckpt2 gets
taken, and is detected only after Ckpt2 is established, thereby
corrupting the respective checkpointed state. In this particular
case, the time elapsed between establishment of Ckpt2 and the
detection of the error is less than the error detection latency,
hence, there is no guarantee for Ckpt2 to be error-free. To
recover from the error, the system should roll-back to the
second most recent checkpoint at hand, i.e., Ckpt1, instead
of the most recent Ckpt2. If the error detection latency is no
longer than the checkpoint period, which applies throughout
this study, keeping most recent two checkpoints suffices.
2.2 Data Recomputation for Energy Efficiency
Imbalances in technology scaling render the energy consump-
tion (and latency) of data storage and communication signifi-
cantly higher than the energy consumption (and latency) of
actual data generation, i.e., computation [3, 4]. As a result,
whenever a data value is needed (i.e., has to be loaded from
memory), re-generating (i.e., recomputing) the respective
value can easily become more energy-efficient than retrieving
the stored copy from memory [5]. The basic idea behind
data recomputation is to eliminate memory accesses (be it a
read, or a write) by relying on the ability to recalculate the
respective data values, when needed. To this end, the system
has to record the sequence of instructions which can produce
the respective data values. As a representative example, the
recently proposed Amnesiac machine [5] details compiler
and (micro)architecture support for opportunistic substitu-
tion of memory reads with a sequence of arithmetic/logic
instructions to recompute the data values which would oth-
erwise be retrieved from the memory hierarchy. Following
Amnesiac’s terminology, we will refer to these sequences of
instructions as RSlices, each forming a backward slice of arith-
metic/logic instructions. To perform recomputation along an
RSlice, its input operands should be available at the expected
time of recomputation. Not all RSlice input operands suit
themselves to (re)generation by recomputation, particularly,
if input operands correspond to read-only values residing in
memory (e.g., program inputs), or register values which are
overwritten at the time of recomputation. Amnesiac refers
to such input operands as non-recomputable inputs, and to
make sure that they are available at the anticipated time of re-
computation, stores them in designated buffers. To facilitate
recomputation, we assume similar hardware-software sup-
port as Amnesiac, with Section 3 detailing the fundamental
differences.
3. AmnesiCHK BASICS
In this section, we cover the basics and execution semantics of
a practical AmnesiCHK implementation under checkpointing,
and recovery upon the onset of an error.
Impact on Checkpointing: At the end of each checkpoint-
ing interval, AmnesiCHK identifies and omits the recom-
putable subset of data values (which otherwise would be
included in the checkpoint being taken) from checkpointing.
Thereby, AmnesiCHK can reduce the checkpoint size, which
in turn reduces the owr,chk component of the checkpointing
overhead per Equation 1, i.e., the time or energy spent on
storing the checkpointed state to memory. At the extreme, all
values which otherwise would be included in a checkpoint
may be recomputable. If this is the case, AmnesiCHK would
also be able to eliminate a subset of checkpoints entirely,
and thereby reduce the #chk component of the checkpointing
overhead per Equation 1, i.e., the number of checkpoints.
Impact on Recovery: Upon the onset of an error, the am-
nesic recovery handler triggers the recomputation of any data
value which was omitted from the checkpoint being restored.
Such recomputation incurs the overhead captured by orcmp
in Equation 3, but, at the same time, can cut back on the
time or energy spent on restoring the checkpointed state from
memory (i.e., oroll−back in Equation 2).
Overview: AmnesiCHK trades the checkpoint storage and
retrieval overhead from memory for the overhead of recom-
puting the respective data values. Accordingly, any practical
AmnesiCHK implementation has to address:
• how to identify recomputable data values in a checkpoint
interval;
• how to omit recomputable data values from a checkpoint;
and
• how to trigger recomputation of the respective data values
during recovery.
3.1 Amnesic Checkpointing
We will first cover how to identify recomputable data values
which can be omitted from checkpointing.
Compiler Support: AmnesiCHK relies on a compiler pass
to identify recomputable data values, which can be omitted
from checkpointing. Under incremental in-memory check-
pointing (Section 2.1), only a subset of the store instruc-
tions would trigger checkpointing (specifically, only the first
updates to the same memory address). The compiler pass
therefore tracks store instructions, and using data dependency
graphs, extracts backward slices, i.e., sequences of arith-
metic/logic instructions which produce the respective data
values to be stored. Following the terminology from [5],
we refer to each such backward slice as an RSlice. Fig. 2
shows an example, where the arrows point to the direction
of dataflow, and each node corresponds to an instruction. In-
structions i3, i4, i5 are producers of the (input operands of)
instruction i2; instructions i1 and i2, of the value v to be stored
by the store instruction st(v). Depending on the specifics of
the instruction set architecture (ISA), such backward slices
can take different forms.
In selecting which RSlices to embed into the binary, the
compiler has choice. One option is, using probabilistic anal-
ysis, estimating the anticipated cost of recomputation along
each RSlice when compared to reading, i.e., loading the re-
spective data value from a checkpoint in memory, and in-
cluding the RSlice only if more cost-effective (where cost
can be delay, energy or a combination of both, without loss
of generality). In this study, we instead take a more greedy
approach of minimal complexity, and consider all RSlices
3
st(v)
i1
i3 i5i4
i2
Figure 2: Backward recomputation slice (RSlice).
which have a lower number instructions than a preset thresh-
old (which typically remains less than 10, and in Section 5
we quantify the impact). The insight is that the overhead of
recomputation along an RSlice increases with its number of
instructions. Therefore, capping the instruction count can
effectively hold recomputation overhead under control (as we
will further demonstrate in Section 5.5.1).
The next question is how to embed RSlices into the binary,
to facilitate invocation upon recovery. The only critical piece
of information is associating the start address of each RSlice
(i.e., the address of the first instruction in the backward slice)
with the memory address of the respective data value (which
will be regenerated by recomputation along the RSlice). Such
memory addresses correspond to the destination memory
addresses of the stores, and the compiler uses each such store
as a proxy in identifying target values for recomputation.
One way to communicate this information to the runtime
is introducing a special instruction to associate these two
effective addresses (and enforcing atomic execution of it with
the corresponding store). We will refer to this instruction as
ASSOC-ADDR.
While the compiler analysis to bake recomputing instruc-
tions into the binary looks similar to the compiler pass in [5],
there is a fundamental difference: The goal in [5] is swapping
each energy-hungry load with an RSlice to recompute the
respective data value (which otherwise would be loaded from
the memory hierarchy). In this case, the swapped load in-
structions are never performed. In exploiting recomputation
for checkpointing, on the other hand, AmnesiCHK leaves
load instructions intact, and only tracks store instructions to
identify data values which can be omitted from checkpointing.
In this case, the corresponding store instructions are always
performed; what is omitted is the inclusion of the respective
(recomputable) data value into the corresponding checkpoint.
Amnesic Checkpoint Handler: Each time an ASSOC-ADDR
instruction is encountered, amnesic checkpoint handler records
the corresponding <memory address,RSlice address>
association into a dedicated buffer called Address Map, Ad-
drMap. Next, the handler asks the memory controller to
exclude the corresponding (recomputable) value from the
next checkpoint (which is achieved by setting the dedicated
log bit, as explained in Section 2.1). Eventually, the size of
the next checkpoint reduces as more (recomputable) values
are excluded from checkpointing via ASSOC-ADDR instruc-
tions. Such <memory address,RSlice address> pairs
have to remain in AddrMap as long as the established check-
point for the corresponding interval remains in memory, such
that upon detection of an error, recomputation along RSlices
can restore the values omitted from checkpointing, in coor-
dination with the established checkpoint for roll-back. As
covered in Section 2.1, under the assumption that the error
detection latency does not exceed the checkpointing period,
retaining two most recent checkpoints suffices. Therefore,
ASSOC-ADDR should only record the mappings for the two
most recent checkpoints.
3.2 Amnesic Recovery
Upon detection of an error, amnesic recovery handler orches-
trates roll-back to the most recent safe global recovery line,
by triggering recomputation along RSlices for each value ex-
cluded from checkpointing, in coordination with the restora-
tion of the most recent safe checkpoint. There is no need
for separate bookkeeping for the values missing from the
most recent safe checkpoint, since AddrMap contains all the
necessary information to fire recomputation of these values
along the respective RSlices. After recomputing the missing
values and storing them back to their destination addresses,
amnesic recovery handler restores the remaining states in the
checkpoint, and resumes execution from this point onward.
In this study, we confine recomputation to memory val-
ues only. Therefore, upon recomputation of a missing value
from the checkpoint, we have to access memory to store the
respective value. Register values are checkpointed, as well,
as part of the architectural state, but are not considered for
recomputation. This is likely to render the proof-of-concept
AmnesiCHK implementation conservative, as a register value
would not incur an expensive memory write upon recompu-
tation. In the end, during recovery, AmnesiCHK can only
cut the overhead of retrieving (i.e., loading) the checkpointed
state from memory (due to the omission of recomputable
values from the checkpoint), which can be easily masked by
the overhead of writing such omitted (memory) values back
to memory upon recomputation.
3.3 Microarchitecture Support
To facilitate amnesic checkpointing, the memory controller
takes a similar form to [8], and maintains the log bit to
determine if the old value of a given write-back should be
logged (i.e., checkpointed). For each write-back request,
the memory controller has to decide (i) whether the request
would result in the first update to the respective memory
line since the last checkpoint was taken, and (ii) whether the
current data value v of the respective memory line (i.e., the
value before the write-back takes place) can be recomputed.
While the memory controller can manage the log bit itself
for (i), it should coordinate with amnesic checkpoint handler
for (ii). As explained in Section 3.1, upon encountering a
recomputable value, the amnesic checkpoint handler sends
a request to the memory controller to let it know that the
respective value v can be recomputed, and therefore, should
be omitted from checkpointing. The memory controller sets
the log bit accordingly, when it receives such requests from
the amnesic checkpoint handler.
The number of (stores corresponding to the) values that
can be excluded from checkpointing depends on the size of
AddrMap, specifically, on how many RSlices AddrMap can
keep track of. Fortunately, we do not need an excessively
large AddrMap to this end: Recall that we only need to check-
4
point the old values upon the very first write-backs (to unique
addresses) when a new checkpoint is established. Therefore,
the number of RSlices is not a function of how many times an
address is updated, but how many unique memory addresses
are updated within a given checkpoint interval. Naturally,
the latter is bounded by the period of checkpointing. As the
period gets longer, the probability of having a higher number
of unique memory addresses updated increases. At the same
time, as the period gets longer, the amount of useful work
lost upon detection of an error increases. The checkpointing
period cannot get too long to reduce this amount of useful
work lost. The checkpointing period hence puts an upper
bound on how many unique RSlices we should keep track
of at runtime. Finally, to prevent corruption of architectural
state during recomputation, AmnesiCHK relies on a similar
renaming scheme as [5].
3.4 Putting It All Together
AmnesiCHK can reduce the number of values to be logged
for checkpointing, and thereby reduce both the performance
and energy overhead of checkpointing. AmnesiCHK can also
reduce the size of each checkpoint, and thereby the storage
overhead, by cutting the number of values to be checkpointed
in each interval. A reduction in checkpoint size can easily
translate into energy savings, as well as performance gain,
due to the lower number of expensive memory read (dur-
ing recovery) and write operations (during checkpointing),
respectively.
Recovery upon detection of an error involves recomputa-
tion of missing values from the checkpoint and restoring the
rest of the state using the established checkpoint. Recom-
putation along each RSlice incurs a performance and energy
overhead; however, it is not prohibitive since the number of
instructions in RSlices are bounded. During recovery, Am-
nesiCHK introduces the extra overhead of recomputation, but
at the same time, it reduces the number of values to be read
from the checkpoint in memory for restoration. The benefit of
the latter may or may not be comparable to the overhead of re-
computation. However, considering the anticipated frequency
of checkpointing and recovery, one can argue that recovery is
a much less frequent event compared to checkpointing, thus
AmnesiCHK’s gain under checkpointing is more likely to
outweigh its potential loss under recovery.
4. EVALUATION SETUP
To evaluate the impact of amnesic checkpointing and recov-
ery on execution time and energy, we experimented with
eight benchmarks from the NAS [16] suite1. We ran these
benchmarks with 8-32 threads on a simulated 8-32 core sys-
tem. We implemented recomputation, checkpointing, and
recovery under AmnesiCHK in Snipersim [17]. We extracted
energy estimates from McPAT [18] integrated with Sniper-
sim. Table 1 summarizes the configuration for the simulated
architecture.
We implemented AmnesiCHK’s compiler pass to embed
RSlices into the binary as a Pin [19] tool. Recall that Sniper-
sim relies on a Pin-based front-end, which facilitated seam-
less integration. We used a predetermined threshold for
1with the exception of ep due to simulation complications
Technology node: 22nm
Operating frequency: 1.09 GHz
4-issue, in-order, 8 outstanding ld/st
L1-I (LRU): 32KB, 4-way, 3.66ns
L1-D (LRU, WB): 32KB, 8-way, 3.66ns
L2 (LRU, WB): 512KB, 8-way, 24.77ns
Main Memory 120ns, 7.6 GB/s/controller
1 mem. contr. per 4-cores
Network Bandwidth 128 GB/s
Table 1: Simulated architecture.
RSlice length: RSlices exceeding threshold are excluded
from the binary to prohibit excessive recomputation over-
head along RSlices. In Section 5.5.1, we will discuss the
impact of the threshold value on checkpointing overhead.
We considered the following configurations:
• NoCkpt : Error-free execution without any checkpointing
or recovery support. This baseline does not incur any
checkpointing or recovery overhead.
• CkptNE : Periodic coordinated global checkpointing under
error-free execution, which incurs no recovery overhead.
Only checkpointing overhead becomes visible.
• CkptE : Periodic coordinated global checkpointing in the
presence of errors, such that recovery overhead becomes
visible on top of checkpointing overhead.
• AmnNE : AmnesiCHK incorporated into coordinated global
checkpointing, under error-free execution, which incurs
no recovery overhead. Only checkpointing overhead be-
comes visible. AmnesiCHK can reduce checkpoint size by
omitting data values from checkpointing.
• AmnE : AmnesiCHK incorporated into coordinated global
checkpointing, in the presence of errors, such that recovery
overhead becomes visible on top of checkpointing over-
head. AmnesiCHK can reduce checkpoint size by omitting
data values, which can be recomputed upon recovery, from
checkpointing.
• CkptNE,Loc: Coordinated local checkpointing under error-
free execution, which incurs no recovery overhead. Only
checkpointing overhead becomes visible.
• CkptE,Loc: Coordinated global checkpointing in the pres-
ence of errors, such that recovery overhead becomes visible
on top of checkpointing overhead.
• AmnNE,Loc: AmnesiCHK incorporated into coordinated
local checkpointing, under error-free execution, which in-
curs no recovery overhead. Only checkpointing overhead
becomes visible.
AmnesiCHK can reduce checkpoint size by omitting data
values from checkpointing.
• AmnE,Loc: AmnesiCHK incorporated into coordinated lo-
cal checkpointing, in the presence of errors, such that re-
covery overhead becomes visible on top of checkpointing
overhead. AmnesiCHK can reduce checkpoint size by
omitting data values, which can be recomputed upon re-
covery, from checkpointing.
We adjust the checkpointing frequency to the expected
error rates and the execution times of the applications. With-
out loss of generality, we distribute the checkpoint intervals
5
uniformly over the execution time. As a result, applications
with longer execution times checkpoint more.
5. EVALUATION
5.1 Checkpointing Overhead
We start the evaluation with a characterization of the check-
pointing overhead under AmnesiCHK. For a crisp compari-
son, we use the configurations from Section 4 under error-free
execution, which only incur the overhead of checkpointing.
Specifically, we use NoCkpt as a baseline for comparison,
where no checkpointing takes place. Fig. 3 shows the execu-
tion time overhead of checkpointing and recovery. The first
and third columns in each group show the execution time
overhead of checkpointing for the evaluated benchmarks un-
der CkptNE and AmnNE , respectively. As expected, CkptNE
and AmnNE perform consistently worse than NoCkpt due to
the checkpointing overhead. However, via recomputation,
AmnNE is very effective in reducing the CkptNE ’s time over-
head due to checkpointing, by up to 28.81% (for is), and
11.92%, on average. The smallest reduction is 2.12% for cg,
where CkptNE ’s time overhead is already relatively low. This
is because cg’s checkpoint size per checkpointing interval
is relatively small and the % of time spent in checkpointing
accounts for only ≈ 9% of the total execution time.
Fig. 4 shows the corresponding energy overhead of check-
pointing and recovery, normalized to NoCkpt . The first and
third columns in each group show the energy overhead of
checkpointing for the evaluated benchmarks under CkptNE
and AmnNE , respectively. The general trend is similar to
the time overhead. AmnNE reduces the energy overhead of
CkptNE by up to 26.93% (for is), and 12.53%, on average.
Among the benchmarks, is is very amenable to recompu-
tation: as the majority of the updated memory values can
be recomputed (in case of recovery), AmnNE can exclude
these from checkpoints, which leads to a higher reduction in
checkpointing overhead w.r.t. CkptNE . The smallest energy
reduction is 1.75% (for cg), in line with Fig. 3.
5.2 Recovery Overhead
In Section 5.1, we characterized purely the overhead of check-
pointing by assuming error-free execution where periodic
checkpointing still takes place. In this section, the goal is
quantifying the overhead of recovery, in the presence of errors.
Recovery requires the establishment of a globally consistent
state among all cores. For CkptE , this translates into each core
rolling back to restore the machine state corresponding to the
most recently established checkpoint. This also applies to
AmnE , but AmnE needs to recompute the data values omitted
from checkpointing, on top. Such data values have the corre-
sponding RSlices baked into the binary. Therefore, although
AmnE can reduce the checkpointing overhead, it incurs an
extra overhead due to recomputation during recovery. Fig. 3,
the second and fourth columns in each group show the exe-
cution time overhead of CkptE and AmnE , respectively (w.r.t
NoCkpt ). Notice that in CkptE and AmnE , we have an error dur-
ing execution. As expected, we observe higher time overhead
under CkptE and AmnE than under CkptNE and AmnNE , re-
spectively. CkptE and AmnE both incur the recovery overhead
on top of the checkpointing overhead, as shown in the Fig. 3.
Still, AmnE is very effective in reducing the time overhead
of CkptE : although AmnE needs to recompute the omitted
values (from checkpointing), thus incurs additional recovery
overhead, reduction of checkpointing overhead (due to the
reduced checkpoint size) and reduction of the restore over-
head (again, due to the reduced checkpoint size) outweighs
the corresponding overhead of recomputation. As a result,
AmnE reduces the time overhead of CkptE by up to 26.68%
(for is), and 12.39%, on average. The smallest reduction is
1.9% for cg, in line with our previous observations.
The second and fourth columns of each group in Fig. 4
show the percentage of the energy overhead of CkptE and
AmnE (w.r.t NoCkpt). The energy overhead follows the very
same trend as the time overhead. AmnE reduces the energy
overhead of CkptE by up to 30% (for dc), and 13.47%, on
average. The smallest energy reduction is 1.86% (for cg).
Putting it all together, Fig. 5 shows the percentage reduc-
tion of energy-delay product (EDP) of AmnNE and AmnE w.r.t.
CkptNE and CkptE respectively, as a proxy for energy effi-
ciency. EDP provides a notion of balance between the time
overhead and energy consumption. We observe that AmnNE
reduces EDP by up to 47.98% (for is), and 22.47%, on av-
erage, when compared to CkptNE . Similarly, AmnE reduces
EDP by up to 48.07% (for dc), and 23.41%, on average, when
compared to CkptE . Although is benefits more from AmnE in
terms of performance, dc has a higher energy reduction due
to AmnE , which in turn leads to a higher EDP reduction.
Overall, we observe that AmnesiCHK can effectively re-
duce the overhead of checkpointing, as well as, of recovery.
The effectiveness highly depends on the overhead of recompu-
tation along RSlices and on how many values can be omitted
from checkpointing. We will revisit the impact of RSlice
length on checkpoint size reduction in Section 5.5.1.
5.3 Storage Complexity
The main benefit of AmnesiCHK stems from the reduction of
checkpoint size, which has two critical implications: reducing
the data size to be (i) moved to (and retrieved from); (ii) stored
in the designated memory area for checkpointing. In addition
to (i), (ii) can also reduce the energy consumption, e.g., due
to less leakage or refresh in case of DRAM. At the same
time, a reduction in checkpoint sizes can lead to a reduction
in the memory footprint of checkpointing, reducing storage
complexity.
The Overall columns in Fig. 6 show % reduction in the
overall checkpoint size (i.e. total amount of data to be check-
pointed) under AmnNE w.r.t. to CkptNE . Among all bench-
marks, is benefits the most from recomputation, where the
overall checkpoint size reduces by 75.74% under AmnNE .
On the other hand, cg is less responsive, and the checkpoint
size reduces by only 6.99%. The average checkpoint size
reduction over all benchmarks is 38.31%.
Recall that, per Section 2.1, if the error detection latency is
no longer than the checkpoint period, which applies through-
out this study, keeping most recent two checkpoints suffices
to have ability of recovering the global state (in case of error
in execution). Therefore, the size of the largest checkpoint
under AmnesiCHK represents a more accurate proxy for the
anticipated memory footprint reduction than the total size of
all checkpoints (as Overall columns in Fig. 6 capture). The
6
bt cg dc ft is lu mg sp
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
0
25
50
75
Ex
e
cu
tio
n 
Ti
m
e 
O
ve
rh
ea
d 
(%
)
Recovery
Checkpoint
Figure 3: Time overhead of checkpointing and recovery.
bt cg dc ft is lu mg sp
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
Ck
pt
N
E
Ck
pt
E
Am
n
N
E
Am
n
E
0
25
50
75
En
er
gy
 O
ve
rh
ea
d 
(%
)
Recovery
Checkpoint
Figure 4: Energy overhead of checkpointing and recovery.
bt cg dc ft is lu mg sp
0
10
20
30
40
50
ED
P 
R
ed
uc
tio
n 
(%
)
AmnEAmnNE
Figure 5: EDP reduction under AmnNE and AmnE w.r.t.
CkptNE and CkptE respectively.
bt cg dc ft is lu mg sp
0
20
40
60
Si
ze
 R
ed
uc
tio
n 
(%
)
Overall
Max
Figure 6: % checkpoint size reduction under AmnNE .
Max columns in Fig. 6, hence show % reduction in the size
of the largest checkpoint under AmnNE w.r.t. to CkptNE . If
there is no value that can be recomputed within the largest
checkpoint, AmnesiCHK cannot reduce the footprint size (al-
though it may still reduce the the total size of all checkpoints
in an application). Fig. 6 reveals such a case: is has very
limited Max reduction (2.04%) under AmnNE ; but the highest
Overall reduction. For the rest of the benchmarks, dc shows
the largest reduction in Max of 58.3%; and ft, the smallest
of 0.05%. For ft, AmnesiCHK practically cannot reduce the
size of largest checkpoint (as the Max column reveals), but
the total checkpoint size can still reduce by 23.27% (as the
Overall column reveals).
As explained in Section 4, CkptNE and AmnNE exclude re-
covery due to error-free execution, hence cleanly capture the
overhead, and particularly size implications of checkpointing.
That said, the corresponding reductions under AmnE would
be exactly the same as under AmnNE , since the presence of
errors does not change the set of values that can be omitted
from checkpointing.
5.4 Coordinated Local Checkpointing
In our discussion so far we covered coordinated global check-
pointing. As explained in Section 2.1, a viable alternative is
coordinated local checkpointing [20, 9], which does not force
all cores to participate in checkpointing: only cores that have
been communicating in a given checkpoint interval check-
point and rollback (in case of an error) together. Coordinated
7
bt cg dc ft is lu mg sp
Ck
pt
N
E.
Lo
c
Ck
pt
E.
Lo
c
Am
n
N
E.
Lo
c
Am
n
E.
Lo
c
Ck
pt
N
E.
Lo
c
Ck
pt
E.
Lo
c
Am
n
N
E.
Lo
c
Am
n
E.
Lo
c
Ck
pt
N
E.
Lo
c
Ck
pt
E.
Lo
c
Am
n
N
E.
Lo
c
Am
n
E.
Lo
c
Ck
pt
N
E.
Lo
c
Ck
pt
E.
Lo
c
Am
n
N
E.
Lo
c
Am
n
E.
Lo
c
Ck
pt
N
E.
Lo
c
Ck
pt
E.
Lo
c
Am
n
N
E.
Lo
c
Am
n
E.
Lo
c
Ck
pt
N
E.
Lo
c
Ck
pt
E.
Lo
c
Am
n
N
E.
Lo
c
Am
n
E.
Lo
c
Ck
pt
N
E.
Lo
c
Ck
pt
E.
Lo
c
Am
n
N
E.
Lo
c
Am
n
E.
Lo
c
Ck
pt
N
E.
Lo
c
Ck
pt
E.
Lo
c
Am
n
N
E.
Lo
c
Am
n
E.
Lo
c
0.00
0.25
0.50
0.75
1.00
N
or
m
a
liz
e
d 
Ex
e
cu
tio
n 
Ti
m
e
Figure 7: Normalized execution time of CkptNE,Loc, CkptE,Loc, AmnNE,Loc and AmnE,Loc.
local checkpointing is generally more scalable as the over-
head of checkpointing and recovery evolves with the num-
ber of communicating cores (as opposed to all cores under
coordinated global checkpointing). Identifying communicat-
ing cores in a checkpointing interval, however, necessitates
a mechanism to track inter-core data dependencies, which
usually translates into continuous and dynamic monitoring
and recording of inter-core interactions that may challenge
scalability. We next investigate recomputation-enabled co-
ordinated local checkpointing. In the following, we use the
global coordinated checkpointing correspondent for each con-
figuration as a baseline for normalization.
Fig. 7 shows the normalized execution time under coordi-
nated local checkpointing, specifically, CkptNE,Loc, CkptE,Loc,
AmnNE,Loc and AmnE,Loc w.r.t. their global checkpointing
counterparts (i.e. CkptNE , CkptE , AmnNE and AmnE , respec-
tively). We observe that coordinated local checkpointing
results in a lower time overhead for CkptNE,Loc as indicated
by a y-intercept < 1 for the majority of the benchmarks. The
lower overhead is due to the lower number of cores check-
pointing together. However, this is not the case for bt, cg and
sp, where practically all cores communicate with one another
each checkpointing interval. For the rest of the benchmarks
the time overhead of CkptNE,Loc reduces by up to ≈42% for
ft, 17% for dc, 36% for is, 32% for mg, and 10% for lu w.r.t.
CkptNE .
AmnesiCHK incorporated into coordinated local check-
pointing remains as effective as in global checkpointing. For
all the benchmarks, the checkpointing (time) overhead under
AmnNE,Loc remains below (or at most the same as) the over-
head under the global checkpointing correspondent AmnNE .
The reductions under AmnNE,Loc are not as pronounced as
under CkptNE,Loc, mainly because the potential for recompu-
tation does not change considerably under local schemes w.r.t
global.
Specifically, bt, cg, lu, and sp do not observe any sizable
reduction (≈≤ 1%) of the time overhead under AmnNE,Loc
w.r.t. the global checkpointing counterpart AmnNE . For the
rest of the benchmarks, the time overhead of AmnNE,Loc re-
duces by up to ≈8% for dc, 33% for ft, 15% for is, and 26%
for mg w.r.t. the global checkpointing counterpart AmnNE .
We observe similar trends for CkptE,Loc and AmnE,Loc. One
difference is that the gap in the time overhead w.r.t. to the
global checkpointing counterparts shrinks. We do not ob-
serve any sizable reduction in the time overhead of bt, cg, lu
and sp under CkptE,Loc. For the rest of the benchmarks the
performance overhead of CkptE,Loc reduces by up to ≈14%
for ft, 6% for dc, 31% for is, and 2% for mg w.r.t. the global
checkpointing counterpart CkptE . On the other hand, the time
overhead of AmnE,Loc reduces up to ≈8% for dc, 10% for ft,
9% for is, and 26% for mg w.r.t. the global checkpointing
counterpart AmnE .
The reduction of execution time overhead under coordi-
nated local checkpointing is followed by the EDP reduc-
tion. EDP reduces under CkptNE,Loc by up to 35.68% for
dc, 67.15% for ft, 58.26% for is, 19.99% for lu, and 57.92%
for mg w.r.t. the global checkpointing counterpart CkptNE .
On the other hand, EDP reduces under AmnNE,Loc by up to
15.85% for dc, 55.68% for ft, 26.24% for is, and 49.75% for
mg w.r.t. AmnNE . Similarly, EDP reduces under CkptE,Loc by
up to 18.33% for dc, 33.24% for ft, 51.46% for is, and 11.29%
for mg w.r.t. the global checkpointing counterpart CkptE . On
the other hand, EDP reduces under AmnE,Loc by up to 15.80%
for dc, 23.81% for ft, 17.99% for is, and 47.32% for mg w.r.t.
AmnE .
Based on this outcome, we can conclude that recomputation-
enabled checkpointing and recovery incorporated into coordi-
nated local checkpointing is at least as effective as its global
checkpointing counterpart.
5.5 Sensitivity Analysis
5.5.1 Impact of RSlice Length on Checkpoint Size
RSlice length (in terms of instructions) dictates the overhead
of recomputation. Longer RSlices incur a higher recomputa-
tion overhead. The overhead of recomputation is invisible un-
der error-free execution, as recomputation may be necessary
only during recovery upon detection of an error. Throughout
the evaluation, we used a threshold of 10 instructions (ex-
cept is, where threshold is 5) to identify the RSlices to be
embedded into the binary.
A higher threshold usually translates into being able to
include more RSlices into the binary, and therefore, a higher
likelihood for any value to find a corresponding RSlice in the
8
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l20
40
60
80
100
execution timeC
he
ck
po
in
t S
ize
 R
ed
uc
tio
n 
(%
)
l 10
20
30
40
50
Figure 8: Impact of RSlice length on checkpoint size over time for bt.
Benchmark
Checkpoint Size Reduction (%)
Threshold
10 20 30 40 50
bt 36.54 45.14 85.36 88.36 89.91
cg 6.99 67.06 89.71 89.82 89.82
ft 23.27 70.65 88.45 99.53 99.70
ft 23.27 70.65 88.45 99.53 99.70
is2 97.39 97.42 99.54 99.54 99.54
lu 42.69 46.65 64.43 74.69 81.11
mg 11.58 19.65 87.96 90.34 90.22
sp 37.43 47.93 71.83 93.83 96.08
Table 2: Total checkpoint size reduction as a function of
RSlice Length.
binary (and thereby to get omitted from checkpointing). As a
result, the checkpoint sizes tend to reduce.
Table 2 shows the impact of RSlice length on the overall
checkpoint size under AmnNEA˙s an example, fot bt, we ob-
serve that the total checkpoint size reduces by up to 89.91%
when the threshold for RSlice length is allowed to grow up
to 50 instructions, and 36.54% when the threshold for RSlice
length remains less than or equal to 10. Threshold is a critical
design parameter which dictates the overhead of recomputa-
tion (during recovery in case of an error), and the storage com-
plexity of the microarchitectural support for AmnesiCHK (as
larger buffers are necessary to keep track of larger RSlices).
At the same time, data values that have the corresponding
RSlices baked into the binary (and hence are recomputable)
are not necessarily uniformly distributed over the checkpoint
intervals. Therefore, for each checkpoint interval, the im-
pact of recomputation may vary (if recomputation is possible
at all). Fig. 8 shows this effect for bt, by capturing how
% reduction in checkpoint size changes over the execution
time, considering different threshold values. We observe that
AmnNE reduces checkpoint size more in certain checkpoint
intervals when compared to others. Such temporal variation
points to more optimization opportunities for AmnesiCHK:
for example, instead of checkpointing periodically, adjusting
the time to checkpoint to exploit more recomputation oppor-
tunities. We leave the exploration of this to future work.
275.74% for threshold of 5. Not shown in Table to keep it simple.
5.5.2 Impact of Error Rate
The expected (system-wide) error rate (perr) dictates the roll-
back and recovery overhead, as captured by Equations 2 and 3.
Our discussion so far characterized the recovery overhead
under CkptE and AmnE assuming a single error within the
course of execution. In this section we expand this analysis
to execution under more frequent onset of errors.
With increasing error rates, the expected number of errors
within the course of execution increases, which in turn in-
creases the recovery overhead due to more frequent recoveries
within the course of execution. Fig. 9 shows the % execution
time overhead of CkptE and AmnE w.r.t. NoCkpt , considering
different numbers of (up to 5) errors within the course of exe-
cution. We assume that the errors in each case are uniformly
distributed over the execution time. Not surprisingly, the
execution time overhead increases with increasing number
of errors. Some benchmarks experience very high time over-
head as the error rate increases. This is mainly because the
execution time under NoCkpt is relatively small such that the
overhead of rollback and recovery becomes proportionally
higher. Among the benchmarks, ft suffers the most as its per
recovery overhead is relatively high.
While the execution time overhead patterns are very similar
for CkptE and AmnE the overheads are lower in AmnE since
overall recovery overhead (including restoring the check-
pointed values and recomputing missing values on top) is
considerably low in AmnE . Specifically, the time overhead re-
duces by up to 26.68% (for is) for a single error, 25.35% (for
dc) for two errors, 26.87% (for dc) for three errors, 21.58%
(for dc) for four errors, and 19.92% (for is) for five errors,
respectively, in AmnE w.r.t. CkptE . On average, execution
time overhead reduction ranges from ≈9% up to 12% for
different error rates under AmnE .
EDP also increases with increasing error rates. The general
trend is similar to the time overhead, but more pronounced.
Under AmnE EDP reduces by up to 48.07% (for is) for a
single error, 47.77% (for dc) for two errors, 50.04% (for dc)
for three errors, 42.99% (for dc) for four errors, 34.99% (for
is) for five errors. On average, EDP reduction ranges from
≈18% up to 24% for different error rates under AmnE .
5.5.3 Impact of Checkpointing Frequency
As captured by Equation 1, the time or energy overhead of
checkpointing is a function of the frequency of checkpointing,
9
bt cg dc ft is lu mg sp
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0
25
50
75
100
Ex
e
cu
tio
n 
Ti
m
e 
O
ve
rh
ea
d 
(%
)
CkptE
AmnE
Figure 9: Time overhead of CkptE and AmnE considering different error rates.
bt cg dc ft is lu mg sp
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
0
50
100
150
200
Ex
e
cu
tio
n 
Ti
m
e 
O
ve
rh
ea
d 
(%
)
CkptNE
AmnNE
Figure 10: Time overhead of CkptNE and AmnNE considering different numbers of checkpoints.
as well as the amount of machine state being updated during
each checkpointing interval. In Section 5.5.2, we evaluated
the impact of the error rate on recovery overhead under a
fixed checkpointing frequency. In this section, we evaluate
the impact of the checkpointing frequency on checkpointing
overhead under a fixed error rate. To do so, we vary the
checkpointing frequency for each benchmark to yield 25, 50,
75 and 100 checkpoints within the course of execution. These
checkpoints are uniformly distributed over the execution time.
Fig. 10 shows the execution time overhead of CkptNE
and AmnNE (w.r.t. NoCkpt), considering different number
of checkpoints. Naturally, the time overhead of checkpoint-
ing increases with the number of checkpoints. Among all the
benchmarks, ft experiences the largest time overhead.
The general trend for AmnNE is very similar to CkptNE ,
however, AmnNE considerably reduces the time overhead of
checkpointing. An interesting point in Fig. 10 is the lower
overhead of 75-checkpointed runs when compared to 50-
checkpointed. Although it seems unintuitive at first, there is
catch: when we change the checkpointing frequency, the start
time of each checkpoint interval becomes different (since we
uniformly distribute the checkpoints over the execution time).
The ability of recomputation to reduce the checkpoint size
(and thereby the checkpoint overhead) depends on whether
the corresponding RSlices in a given checkpoint interval ex-
ist (i.e., were baked into the binary). If the checkpoints fall
into the intervals of execution with a small number of recom-
putable values, AmnesiCHK cannot reduce the checkpointing
overhead significantly. Such a corner case is is, where the
50-checkpointed run has very limited RSlice coverage w.r.t.
the 75-checkpointed. As the data size that can be recom-
puted (i.e., excluded from checkpointing) is smaller, the time
overhead is higher for the 50-checkpointed run. The time
overhead reduces by up to 28.81% (for is) for 25; 25.3% (for
dc) for 50; 50.86% (for is) for 75; and 43.52% (for is) for
100 checkpoints in AmnNE w.r.t. CkptNE . On average, the
time overhead reduction ranges from ≈10% up to 14% for
different checkpoint counts in AmnNE .
A similar trend holds for EDP. AmnNE reduces the EDP
(w.r.t. CkptNE ) by up to 47.98% (for is) for 25; 47.74% (for
dc) for 50; 74.19% (for is) for 75; and 63.45% (for is) for 100
checkpoints, respectively. On average, EDP reduction ranges
from ≈20% up to 26% for different checkpoint counts under
AmnNE .
5.5.4 Scalability
The number of threads involved in execution affect the over-
head of checkpointing, due to both an increase in the cost
of coordination (among threads) and a potential increase in
the machine state to be checkpointed. As a consequence,
the memory bandwidth requirement tends to increase, as
well. We next look into the scalability of AmnesiCHK with
increasing thread count. We experiment with 8-, 16-, and 32-
threaded executions where each thread is pinned to a separate
core.
We observe that the checkpointing overhead always ex-
10
ceeds 9% for any thread count. On average, the checkpoint-
ing overhead is ≈ 45%, 55%, and 60% for 8-, 16-, and 32-
threaded executions, respectively, under CkptNE . We also
observe that AmnNE can reduce the checkpointing overhead
by up to 28.81% (for is), 17.78% (for is), and 19.12% (for
mg). when running with 8-, 16-, and 32-threads, respectively.
Average reduction is ≈12% for 8-threaded, and ≈11% for
16- and 32-threaded executions.
The corresponding EDP reduction under AmnNE reaches
up to 47.98% (for is), 31.81% (for dc), and 33.8% (for mg)
when running with 8-, 16-, and 32-threads, respectively. Av-
erage EDP reduction under AmnNE becomes≈22%, 21% and
20% for 8-, 16-, and 32-threaded executions. The correspond-
ing reductions under AmnE closely follow the trends AmnNE .
6. RELATED WORK
Checkpointing and recovery solutions are extensively studied
over the decades. The proposed solutions can be categorized
into software-based or hardware-based checkpointing; and
application or system level checkpointing. Software-based
proposals use periodic barriers to perform system-level [21],
application-level [22], or hybrid checkpoints [23].
Hardware proposals [12, 8, 9] reduce the checkpoint and
restart penalties, but can increase hardware complexity. For
example, in Rebound [12] when a core is checkpointing,
the L2 controller writes dirty lines back to main memory
while keeping clean copies in L2, and the memory controller
logs the old values of the updated memory addresses. In
addition, between checkpoint times, when a dirty cache line
is written back to memory, the memory controller has to log
the old value, as well. This is done for the first write-back
and consecutive writes to the same memory address can be
excluded from being logged. SafetyNet [9], on the other hand,
explicitly checkpoints the register file, and incrementally
checkpoints the memory state by logging the old values.
Compiler-assisted checkpointing [24] improves the per-
formance of automated checkpointing by presenting a com-
piler analysis for incremental checkpointing, aiming to re-
duce checkpoint size. In incremental checkpointing, memory
updates are monitored and are omitted from checkpointing
if a particular memory location has not been modified be-
tween two adjacent checkpoints. This mechanism reduces
the amount of data to be checkpointed, and is widely used
in many checkpointing schemes. We also employ incremen-
tal checkpointing in our analysis. In [24], instead of using
runtime mechanisms (such as exploiting cache coherency
protocol to identify updates memory locations), they rely on
compiler analysis to track the memory updates that can be
excluded from checkpoints. To facilitate the compiler analy-
sis, the source code should be manually annotated, indicating
the starting point of each checkpoint. However, it has limited
applicability in practice, since it may not be always feasible
to obtain and/or annotate the source code.
A relevant work presented in [25], introduces the notion
of idempotent execution that does not need explicit check-
points to recover from errors. Instead, in case of an error, re-
executing the idempotent region suffices for recovery. Such
idempotent regions are constructed by the compiler. As the
name suggests, idempotent regions regenerate the same out-
put regardless of how many times they are executed with the
given program state. In comparison to AmnesiCHK, idempo-
tent execution has limited flexibility. Generally, idempotent
regions are large, and therefore incur high overhead during
recovery, while we employ fine-grained data recomputation
(along a short separate RSlice for each value), and each RSlice
contains only the necessary instructions to generate a single
value. Identifying idempotent regions is also a daunting task,
and it may not be easy to find fine-grained idempotent re-
gions for a large class of applications. RSlices provide more
flexibility on values to be checkpointed and be recomputed
in this regard.
A recent work demonstrates the applicability of recompu-
tation to loop-based code [26] to reduce the checkpointing
overhead. Similar to our approach, they try to reduce the
checkpoint size by logging enough state to enable recom-
putation in case of error in execution. When error occurs,
they determine which parts of the computation were not com-
pleted and they eventually recompute them by reexecuting
the corresponding loop iterations. Although, it is very similar
to our approach in spirit, their approach is more restricted to
loop-based code, whereas our approach can target arbitrary
data as long as its corresponding RSlice exist.
Similar to [26], the authors of [27] exploit the regularity
of workloads, such as matrix-vector multiplication and it-
erative linear solver to reduce the performance overhead of
checkpointing by relying on partial recomputation. Their
fundamental observation is that although error occurs in com-
putation, most of the results are still correct for those types of
workloads. So, instead of simply rolling back and repeating
the entire segment of computation, they employ algorithmic
error localization and partial recomputation to efficiently cor-
rect the erroneous results.
In [28], authors explore energy concerns for checkpointing
and evaluate a wide-range checkpointing policies to under-
stand their respective energy, performance and I/O tradeoffs.
They provide detailed insights into the energy overhead, as
well as the performance impact, associated with different
checkpointing policies.
7. CONCLUSION
In the presence of errors, systematic checkpointing of the ma-
chine state makes recovery of execution from a safe state pos-
sible. The performance and energy overhead, however, can
become overwhelming with increasing frequency of check-
pointing and recovery, as dictated by the growth in the fre-
quency of anticipated errors. In this paper, we discuss how
recomputation of data values which otherwise would be read
from a checkpoint (from main memory or secondary storage)
can help reduce these overheads. We observe that recomputa-
tion can reduce the memory footprint by up to 23.91%, which
is accompanied by a reduction in time, energy and EDP over-
head by up to 11.92%, 12.53%, and 23.41%, respectively,
even considering a relatively small-scale system. We expect
the reduction to become much higher and more visible in
larger scale systems, where checkpointing overhead becomes
more prominent.
8. REFERENCES
11
[1] B. W. Johnson, ed., Design & Analysis of Fault Tolerant Digital
Systems. Boston, MA, USA: Addison-Wesley Longman Publishing
Co., Inc., 1988.
[2] P. A. Lee and T. Anderson, Fault Tolerance: Principles and Practice.
Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2nd ed., 1990.
[3] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally,
M. Denneau, P. Franzon, W. Harrod, J. Hiller, and S. Karp, “Exascale
computing study: Technology challenges in achieving exascale
systems,” DARPA Information Processing Techniques Office (IPTO)
sponsored study, 2008.
[4] M. Horowitz, “Computing’s Energy Problem (and what we can do
about it),” Keynote at International Conference on Solid State Circuits,
April 2014.
[5] I. Akturk and U. R. Karpuzcu, “AMNESIAC: Amnesic Automatic
Computer - Trading Computation for Communication for Energy
Efficiency,” 2017.
[6] Y. Tamir and C. H. Sequin, “Error recovery in multicomputers using
global checkpoints,” in Proceedings of the International Conference
on Parallel Processing, pp. 32–41, 1984.
[7] C. Morin, A. Gefflaut, M. Banâtre, and A.-M. Kermarrec, “Coma: An
opportunity for building fault-tolerant scalable shared memory
multiprocessors,” in Proceedings of the 23rd Annual International
Symposium on Computer Architecture, ISCA ’96, 1996.
[8] M. Prvulovic, Z. Zhang, and J. Torrellas, “Revive: Cost-effective
architectural support for rollback recovery in shared-memory
multiprocessors,” in Proceedings of the 29th Annual International
Symposium on Computer Architecture, ISCA ’02, 2002.
[9] D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood, “Safetynet:
Improving the availability of shared memory multiprocessors with
global checkpoint/recovery,” in Proceedings of the 29th Annual
International Symposium on Computer Architecture, ISCA ’02, 2002.
[10] R. Koo and S. Toueg, “Checkpointing and rollback-recovery for
distributed systems,” IEEE Trans. Softw. Eng., vol. 13, no. 1,
pp. 23–31, 1987.
[11] P. J. Leu and B. Bhargava, “Concurrent robust checkpointing and
recovery in distributed systems,” in Proceedings of International
Conference on Data Engineering, pp. 154–163, 1988.
[12] R. Agarwal, P. Garg, and J. Torrellas, “Rebound: Scalable
checkpointing for coherent shared memory,” in Proceedings of the
38th Annual International Symposium on Computer Architecture,
ISCA ’11, 2011.
[13] S.-L. Gong, M. Rhu, J. Kim, J. Chung, and M. Erez, “Clean-ecc: High
reliability ecc for adaptive granularity memory system,” in
Proceedings of the 48th International Symposium on
Microarchitecture, MICRO-48, 2015.
[14] T. J. Dell, “A white paper on the benefits of chipkill- correct ecc for pc
server main memory,” 1997.
[15] S. Nomura, M. D. Sinclair, C.-H. Ho, V. Govindaraju, M. de Kruijf,
and K. Sankaralingam, “Sampling + dmr: Practical and low-overhead
permanent fault detection,” in Proceedings of the 38th Annual
International Symposium on Computer Architecture, ISCA ’11, 2011.
[16] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter,
L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S.
Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga,
“The NAS Parallel Benchmarks: Summary and Preliminary Results,”
in Conference on Supercomputing (SC), 1991.
[17] T. E. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploring the
level of abstraction for scalable and accurate parallel multi-core
simulation,” in International Conference for High Performance
Computing, Networking, Storage and Analysis (SC), November 2011.
[18] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “McPAT: An Integrated Power, Area, and Timing
Modeling Framework for Multicore and Manycore Architectures,” in
International Symposium on Microarchitecture, December 2009.
[19] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,
S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building
Customized Program Analysis Tools with Dynamic Instrumentation,”
in Conference on Programming Language Design and Implementation,
2005.
[20] M. Banatre, A. Gefflaut, P. Joubert, C. Morin, and P. A. Lee, “An
architecture for tolerating processor failures in shared-memory
multiprocessors,” Transactions on Computers, vol. 45, pp. 1101–1115,
Oct 1996.
[21] R. Gioiosa, J. C. Sancho, S. Jiang, F. Petrini, and K. Davis,
“Transparent, incremental checkpointing at kernel level: A foundation
for fault tolerance for parallel computers,” in Proceedings of the 2005
ACM/IEEE Conference on Supercomputing, SC ’05, 2005.
[22] G. Bronevetsky, D. Marques, K. Pingali, P. Szwed, and M. Schulz,
“Application-level checkpointing for shared memory programs,” in
Proceedings of the 11th International Conference on Architectural
Support for Programming Languages and Operating Systems,
ASPLOS XI, 2004.
[23] D. Marques, G. Bronevetsky, R. Fernandes, K. Pingali, and P. Stodghil,
“Optimizing checkpoint sizes in the c3 system,” in Proceedings of the
19th IEEE International Parallel and Distributed Processing
Symposium (IPDPS’05) - Workshop 10 - Volume 11, IPDPS ’05, 2005.
[24] G. Bronevetsky, D. J. Marques, K. K. Pingali, R. Rugina, and S. A.
McKee, “Compiler-enhanced incremental checkpointing for openmp
applications,” in Proceedings of the 13th ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming, PPoPP ’08,
2008.
[25] M. de Kruijf and K. Sankaralingam, “Idempotent Processor
Architecture,” in International Symposium on Microarchitecture
(MICRO), 2011.
[26] H. Elnawawy, M. Alshboul, J. Tuck, and Y. Solihin, “Efficient
checkpointing of loop-based codes for non-volatile main memory,” in
Proceedings of International Conference on Parallel Architectures and
Compilation Techniques (PACT), pp. 318–329, Sept 2017.
[27] J. Sloan, R. Kumar, and G. Bronevetsky, “An algorithmic approach to
error localization and partial recomputation for low-overhead fault
tolerance,” in Proceedings of IEEE/IFIP International Conference on
Dependable Systems and Networks (DSN), pp. 1–12, June 2013.
[28] N. El-Sayed and B. Schroeder, “To checkpoint or not to checkpoint:
Understanding energy-performance-i/o tradeoffs in hpc checkpointing,”
in Proceedings of International Conference on Cluster Computing
(CLUSTER), pp. 93–102, Sept 2014.
12
