ACEDR: Automatic Compiler Error Detection and

Recovery for COTS CPU and Caches by Nezzari, Yasser & Bridges, Christopher
IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 3, SEPTEMBER 2019 859
ACEDR: Automatic Compiler Error Detection and
Recovery for COTS CPU and Caches
Y. Nezzari and C. P. Bridges
Abstract—Recently there has been an increasing demand for
more powerful processors for the next-generation space missions,
such as communication and earth observation. The challenge is how
to improve the reliability of the processor under the “single event
effects” in orbit. We have previously proposed a new way of imple-
menting any traditional software error detection and correction
techniques at instruction level, capable of covering both the CPU
and caches of “commercial off the shelf” processors. In this paper,
a novel way of evaluation of the software protection is presented,
based on a theoretical model and software injection experiments
to predict the reliability of the whole processing architecture. The
fault injection will evaluate the ability of the protection code to de-
tect and recover errors in addition to the accuracy of the reliability
models, by comparing the reliability of the theoretical predictions
to the reliability of the injection experiments. Automatic compiler
error detection and recovery improves the reliability of the system
by reducing the error rate of “single event upsets.” In some bench-
marks, the error rate was reduced to less than 1%. This research
has been tested in two machines; Intel core i5-3470 and a Raspberry
Pi 3. On the first processor, the overhead was less than 15%, and
on the second one, the overhead was less than 17%. This research
can also be ported to multiple high level languages, with the ability
to cover multiple instructions and datatypes.
Index Terms—Cache, compiler, CPU, error detection and recov-
ery, processing architecture, reliability, single event upsets (SEUs).
I. INTRODUCTION
IN RECENT years, dramatic technology scaling in integratedcircuits (ICs) is occurring. The scaling comes with smaller
and faster transistors, enabling higher transistor counts that led
to higher performing processing architectures with low power
consumption, less size, and at a fraction of the cost. However,
these new technologies are sensitive to noise margins with low
threshold voltages that give rise to transient effects caused by en-
vironmental or external factors, such as background or cosmic
radiation effects. Another disturbance source is internal inter-
mittent effects specific to the IC, such as when operating condi-
tions change (e.g., temperature, component wear out, component
Manuscript received July 5, 2018; revised December 25, 2018 and April 7,
2019; accepted June 15, 2019. Date of publication July 18, 2019; date of current
version August 29, 2019. This work was supported by in part by the Algerian
Space Agency, in part by the U.K. Space Agency, and in part by the Surrey Space
Centre at the University of Surrey. Associate Editor: F. Belli. (Corresponding
author: Yasser Nezzari.)
The authors are with the Surrey Space Centre at the University of Sur-
rey, GU27XH Guildford, U.K. (e-mail: y.nezzari@surrey.ac.uk; c.p.bridges@
surrey.ac.uk).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TR.2019.2925086
overload) [1]–[4]. These are often soft errors and do not cause
permanent damage. As transistors become increasingly prone
to soft errors, high reliability should not be exclusive to mis-
sion critical processing purposes, but should also be extended
to processors used in mainstream computing and embedded
architectures.
For mission critical domains like the space, radiation hardened
(RH) and RH by design (RHBD) processing architectures are de-
veloped for their high reliability in a harsh environment. Even
with their efficiency in mitigating single event effects (SEEs),
RH and RHBD still lag behind in terms of performance, com-
pared to their commercial off the shelf (COTS) relatives. The
performance gap is estimated to be from five to ten years [5]. In
addition, RH and RHBD are costly and consume higher power.
All the previous limitations have steered new space developers
to consider the use of COTS.
To mitigate errors, hardware redundancy can detect and re-
cover SEEs [6], [7]; however, it is only suitable for the domains
that have no budget restrictions. An example of the hard re-
dundancy is the use of hard error correction codes (ECC) that
could be costly in terms of both performance and power [8], [9].
Hard redundancy can be impractical in embedded systems due to
power constraints. Software error detection and recovery tech-
niques are more appealing to tackle the problem of soft errors
because of their flexible implementation and low cost. Soft error
detection and recovery can be applied on COTS, allowing de-
signers to have an order of magnitude in performance compared
to the hard redundancy.
The current trend in the implementation of soft-protection
techniques still lags in terms of performance and ability to de-
tect and correct errors. The lag in performance is due to the
inefficiency in using the memory and CPU abundant resources
of the processing architecture leading to a bottleneck. Most of
the techniques applied to detect and recover errors only cover
the CPU registers; however most of literature ignores the pro-
tection of the memory system of the processing architecture,
because of the assumption that the memory has hard-ECC pro-
tection. In this paper, we demonstrate that Read/Write (R/W)
operations from/to the memory have as much importance as the
rest of instructions in the study of reliability.
In this paper, we propose “automatic compiler error detection
and recovery” (ACEDR), an original software error detection
and recovery technique, for automatically applying protection
code in the LLVM compiler framework. The applied codes are
capable of automatically detecting and recovering soft errors at
runtime. This work is based on two LLVM passes: An analysis
0018-9529 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
860 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 3, SEPTEMBER 2019
and a transformation. The analysis pass will be executed on the
intermediate representation (IR) code intended to be protected
and provides the statistics and information about memory in-
struction dependencies. The transformation pass will add the
protection code by adding redundant instructions and calling a
voter function to detect and recover at runtime. The voter works
as in triple modular redundancy (TMR). We will show in this
paper the importance of protecting both the memory (R/W) in-
structions and CPU instructions (Arithmetic and logic operators,
and branching) on the reliability predictions outcome. We start
by studying the different error rates reductions of using partial
protection techniques, where we start by only protecting CPU
instructions and then memory instructions, and then combining
both—following on from our previous work [10]. Using instruc-
tion level redundancy will allow for the mitigation of one or more
bit errors and could tolerate single event upsets (SEUs) and mul-
tiple bit upsets. Our paper contributes to the state of the art with
the following points.
1) Low overhead compared to the state of the art, less than
15% in Intel core i5-3470 and a less than 17% in Raspberry
Pi 3.
2) A reliability prediction model that predicts the reliabili-
ties of all the processing architecture components, and to
quantify the reliability added using the software protection
code.
3) High error detection and recovery from a new injector
where error rates can be reduced to less than 1% in some
benchmarks.
4) Multiple data and instruction types (i32, i32, i1, i8, i8,
i64, float and double, float pointers and double pointers)
[11] have been protected, after [10] was extended, in ad-
dition to both the CPU and the memory R/W instructions
types.
5) Comparison of the reliability predictions with the reliabil-
ity obtained from the injection experiment of the protected
code.
The rest of the paper is organized as follows. In Section II,
the solutions that have been applied to overcome the SEEs ob-
stacle will be introduced, focusing on the soft-based protection
techniques. In Section III, the reliability prediction equations
will be derived for the different processing component (CPU,
Caches, and RAM), in the unprotected and the protected modes.
In Section IV, the implementation of ACEDR will be explained,
including how the compiler’s optimizer has been altered to add
different protection techniques. In Section V, the fault injection
experiment will be explained, and the different error rates reduc-
tion from the unprotected to the protected codes will be shown.
In Section VI, the predictions of reliability will be compared to
the reliability obtained from the injection experiment. Section
VII concludes this paper.
II. FAULT TOLERANCE
A system is categorized as fault tolerant if it always produces
a correct outcome, even if its components were corrupted. To en-
sure correctness, the fault tolerant system must detect and correct
errors before their propagation to the external state. Concurrency
can be used for error detection. For the recovery, re-execution,
rollback recovery, roll forward recovery, and check pointing [12]
can be used. Fault tolerance can be achieved either by masking
errors or detection and recovery [12].
A. Fault Tolerance by Redundancy
Redundancy in software or hardware can be used for fault tol-
erance. Hardware errors can be detected and recovered by adding
extra memory or processing units. In software, programs can be
replicated entirely or partially. N-modular redundancy (NMR)
is a widely used scheme for error detection and/or recovery,
where N represents the number of identical cloned processing
elements, with the same data.
Dual modular redundancy (DMR) is an NMR scheme with
two identical elements, the original and a replicated one [13].
DMR is not able to recover error, but using its comparator en-
ables it to detect errors. On the other hand, TMR has three re-
dundant elements, enabling it to detect and recover errors using
majority voting logic. In terms of reliability, TMR has a higher
rate than DMR because of its extra element allowing it not only
to detect errors, but also to recover from them. The downside
of TMR compared to DMR is its performance, DMR is just de-
tecting so it takes less time than the TMR. In general, there are
three classes of redundancy:
1) the addition of redundant hardware elements allowing the
same tasks on various components, known as the spatial
redundancy;
2) the implementation of NMR schemes on the same hard-
ware, where N redundant similar operations are executed
sequentially, known as temporal redundancy;
3) replication of data and CPU registers can be achieved by
computing extra check bits of the original data, using error
detection codes or ECC. This is known as the class of
information redundancy.
B. Software-Based Fault Tolerance and
Redundancy Approaches
1) Process-Level Replication (PLR): PLR’s implementation
makes changes at kernel level. This technique uses the re-
dundancy of maestro processing architecture [14] to execute
the same application on three different cores. TMR is used to
decide the right outcome of the three redundant cores.
This technique applied on image compression of size 904K.
A total overhead of 2 187 742 009 cycles has been recorded,
including both the compression and PLR overhead [15]. This
technique can only be applied on processing architectures having
shared memory protected with hard-ECC schemes.
2) Thread-Level Replication (TLR): This technique is based
on the N-version programming [16]. In this scheme, the user
function is replicated on three threads, running on different cores
in parallel. TMR is used to decide the right core, using majority
voter. The abundance of processing resources in Maestro archi-
tecture [14] has been harnessed by TLR. The overhead caused
by TLR is application dependent [15]. On an empty function the
total overhead was 12 021 270 Cycles. TLR’s shared memory
requires extra hardware (hard-ECC).
NEZZARI AND BRIDGES: ACEDR: AUTOMATIC COMPILER ERROR DETECTION AND RECOVERY FOR COTS CPU AND CACHES 861
3) Instructions Level Replication: Decoupled compiler
based instruction level fault tolerance (DRIFT) uses the com-
piler’s functionality for error detection. It adds redundant in-
structions at the compilation process, in addition to inserting
check instructions. The goal of DRIFT is to improve the system’s
performance by reducing the overhead resulted from the error
detection, without affecting the error coverage. The overhead is
reduced by decoupling the execution of the original code and its
replica, and implementing checks [17]. This research was tested
using Mediabench II video [18] and SPEC CPU2000 [19]. The
average time overhead of this protection technique was 29%. A
key issue is that the memory of the processing architecture is
assumed to be protected and because this technique only cov-
ers the CPU, only allows this technique to be applied to special
architectures (memory with hard-ECC).
In the composite datatype protection (CDTP) algorithm, vari-
ables are encoded to enable their protections. Memory accessing
operations are replaced with new encoded ones. Operations for
writing to memory are decoded to check for errors [20]. The cc1
compiler was used to implement the CDTP algorithm automati-
cally as an independent phase of the compilation. The protected
code has to be transformed to GIMPLE internal representation
[20]. The protection code is added at the initial optimization
stage.
The overhead is application dependent and differs from one
benchmark to another. The overhead also depends on the protec-
tion algorithm used. Using the Hamming code produce 86% time
overhead, extended Golay algorithm 146%, full iterated cod-
ing scheme 116%, and selective iterated coding scheme 117%,
which are very large compared to DRIFT. Using encoding and
decoding protection schemes is only suitable for the protection
of the memory system of the processing architecture; the CPU
is still vulnerable to SEEs in this case.
SWIFT-R uses instruction redundancy to detect and correct
errors. It adds two replicas of an instruction to the original
code, and then calls a TMR voter to decide by majority the
correct instruction outcome. This technique does not protect all
instructions, especially the memory ones (store, load, alloca,
and GetElementPointer) and the branch instructions. Soft errors
originating from the SEEs are random in nature, and all instruc-
tion types are susceptible to them [21]. Protecting all instruction
types will improve the reliability of the processing architecture.
This technique adds up to 198% time overhead of the execution
time.
Further techniques include Shoestring [22], fault tolerance
software checking [23], error detection by duplicated instruc-
tions (EDDI) [24], and software implemented fault tolerance
(SWIFT) [25]. The mentioned schemes are implemented by
modifying compilers, the LLVM, the GCC, and the Open-
IMPACT [26], respectively. The modifications include adding
redundant instructions and inserting compare instructions if
needed. The mentioned techniques in this section are only for
CPU protection, except for EDDI, which is capable of covering
both the CPU and memory from the SEEs, but EDDI is archi-
tecture dependent and works exclusively on MIPS architectures,
and it has no recovery scheme (EDDI detects errors only, and
has no recovery).
TABLE I
NOTATION
DAFT [27] and SRMT [28] techniques use LLVM, Intel pro-
duction and ICC 9.0 compilers, respectively, to automate the
generation of redundant threads for an application, and then in-
serts comparisons for their outcome to detect errors. DAFT has
better performance compared to SRMT. Both of these techniques
are for CPU protection, the memory system is still vulnerable to
the SEEs soft errors.
CASTED [29] is implemented at the backend stage of the
compilation process; it uses the GCC-4.5.0 [30] compiler frame-
work. This technique includes two passes, one for error detection
and another one for recovery. The cache memory is not protected
since CASTED only targets CPU protection.
C. Summary
The trend in software mitigation against SEUs of radiation
effects is to detect errors, and disregard the error recovery part.
862 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 3, SEPTEMBER 2019
Fig. 1. Markov TMR states.
Another assumption is that the memory system has hard-ECC,
which is not affordable in some cases. The addition of software
protection techniques will always add extra overhead, depending
on the scheme used and the benchmark. Our solution to the
SEUs problem is to find an optimal trade between reliability and
performance, without ignoring the protection of cache memory
system of the processing architecture.
III. RELIABILITY PREDICTIONS
In this section, a novel model for the reliability of the pro-
tected and unprotected processing architecture was developed,
starting from modeling the reliability of the basic components,
where the different processor parts, like the CPU, caches, and
RAM, have been modeled, and using combinational logic, the
reliability of the whole system has been modeled. This section
is a preparation for the next one, where the reliability of the pre-
dictions of the protected architecture, will be compared with the
reliability obtained from the fault injection experiment of the
protected code.
In the predictions for the reliability of the unprotected code,
the sensitivity of every instruction type to the fault injection
caused by our software to simulate SEE will be included. The
sensitivity of an instruction type is the number of errors divided
by the total number of injections.
Redundancy has been an efficient way of protecting memory
systems. A nonredundant memory system fails if a fault occurs
in one of its words. Assuming Poisson process, and the statisti-
cal independence among failures, the reliability of the memory
system is the product of the reliabilities of all its N words
R (t) = e(−λWNt). (1)
The failure rate for each word in the TMR system is λ and W;
the Markov chain is given in Fig. 1 [31].
State 0 represents the state where all the TMR bits in the
words are correct. State 1 represents the case where a fault has
occurred in any of the three replicated words. State 2 represents
the state where more than one word has an error.
The following set of differential equations represents the
dynamics of the probability system:
dP0 (t)
dt
= − 3λWP0 (t) (2)
dP1 (t)
dt
= 3λWP0 (t)− 2λWP1 (t) (3)
dP2 (t)
dt
= 2λWP1 (t) . (4)
Using Laplace transforms, and assuming that the system
starts without faults at time t = 0, meaning, P0 (0) = 1, and
Fig. 2. Memory modules connection in software perspective.
P1 (0) = P2 (0) = 0. We obtain
P0 (t) = e
−3λWt (5)
P1 (t) = 3e
−2λWt − 3e−3λWt (6)
P2 (t) = 1− 3e−2λWt + 2e−3λWt . (7)
The reliability using TMR is
r (t) = 3e−2λWt − 2e−3λWt. (8)
The final expression in (8) represents the reliability of a single
word of W bits, when protected with TMR.
A. Reliability Prediction for N Word Memory With TMR
The software TMR is a serial combination of N TMRs, if
two errors occur on the same TMR causing it to fail, the whole
memory system will fail. The expression for reliability for TMR
without repair is
R (t) =
(
3e(−2λWt) − 2e(−3λWt)
)
.N (9)
B. Reliability Prediction of RAM, Caches, and
CPU Without Protection
In this section, the reliability of the whole processing archi-
tecture was deduced. The reliabilities of the RAM, caches, and
CPU were combined, knowing that the components are serially
connected in Fig. 2, and taking into consideration that the er-
ror rate λ changes depending on the component’s cross-section
area.
The reliabilities for each component are given by the
following expressions:
R1 (t) = e
(−WH1λct
∑i=l
i=0 SiNi) (10)
R2 (t) = e
(−WX2λct
∑i=l
i=0 SiNi) (11)
R3 (t) = e
(−WX3 λct
∑i=l
i=0 SiNi) (12)
Rr (t) = e
(−WXrλrt
∑i=l
i=0 SiNi) (13)
RCPU (t) = e
⎛
⎝−W
∑i=l
i=0
SiNi
SP1SP2
λpt
⎞
⎠
. (14)
In addition to error rates λ’s, we also introduce a new variable:
The instruction sensitivity Si, which changes from one instruc-
tion type to another for each benchmark. The sensitivity of an
instruction type is the total number of errors caused by injection
divided by total the number of injections. For small benchmarks,
the number of instruction types is equal to the number of injec-
tions, but may differ in case of large benchmarks where sampling
for a specific confidence level must be reached for valid results.
NEZZARI AND BRIDGES: ACEDR: AUTOMATIC COMPILER ERROR DETECTION AND RECOVERY FOR COTS CPU AND CACHES 863
In order to determine the sensitivity of the instruction types
of a certain benchmark, all the instructions of this type must be
injected (can also inject a significant sample size, in case the
benchmark has a large number of instructions).
The reliability in the caches and the RAM is effected by the
access rates, where the first level of cache has the highest rate,
and the rate drops with every subsequent cache level, until the
RAM with the slowest access rate.
In the CPU, the reliability is effected by two factors, the speed
up caused by the multicore SP1 and the speed up caused by the
number of pipeline stages SP2. We are assuming that the speedup
improves the reliability, since it improves the time of execution,
reducing the probability of catching a SEE.
The reliability R(t) expression of the whole system is given
by the following equation:
R (t) = Rr (t)R3 (t)R2 (t)R1 (t)RCPU (t) . (15)
C. TMR Protected
The reliability predictions of cache levels RAM and CPU
when the ACEDR protection is applied are given by the
following expressions:
R1 (t) =
i=l∏
i=0
(
3e−2WSiλc t − 2e−3WSiλct)H1Ni (16)
R2 (t) =
i=l∏
i=0
(
3e(−2WSiλct) − 2e(−3WSiλct)
)X2Ni (17)
R3 (t) =
i=l∏
i=0
(
3e(−2WSiλct) − 2e(−3WSiλct)
)X3Ni (18)
Rr (t) =
i=l∏
i=0
(
3e(−2WSiλrt) − 2e(−3WSiλrt)
)XrNi (19)
RCPU (t) =
i=l∏
i=0
(3e(−2W Siλpt)
− 2e(−3WSiλpt)) Si
Ni
SP1SP2 . (20)
The reliability of the whole protected system combined is
R (t) = Rr (t)R3 (t)R2 (t)R1 (t)RCPU (t) . (21)
D. Reliability Equations Obtained From the
Injection Experiments
In order to compare the theoretical predictions of the relia-
bility of the protected code in Section III-C, we modeled a new
reliability expression of the whole processing architecture chain
after it is protected with ACEDR for the fault injection exper-
iment. The difference between reliability of the protected and
the unprotected codes is the sensitivity of instructions to error
that will drop after the protection is added, leading to the fol-
lowing equations representing the reliability of every processing
component (different levels of caches, RAM, and the CPU):
R1inj (t) = e
(−WH1λt
∑i=l
i=0 SipNip) (22)
R2inj (t) = e
(−WX2λt
∑i=l
i=0 SipNip) (23)
R3inj (t) = e
(−WX3λt
∑i=l
i=0 SipNip) (24)
Rrinj (t) = e
(−WXrλt
∑i=l
i=0 SipNip) (25)
RCPUinj (t) = e
⎛
⎝−W
∑i=l
i=0
SipNip
SP1SP2
λt
⎞
⎠
. (26)
The reliability of the whole system is given by
Rinj (t) = R1inj (t) ·R2inj (t) ·R3inj (t) ·Rrinj (t)
·RCPUinj (t) . (27)
E. Application of the Reliability Predictions (Relevance of the
Prediction Model to the Rest of the Paper)
The main objective of adding this section is to identify the
precision of the prediction model, by comparing (27) to (21).
This section is combining the results obtained from the theoreti-
cal model to the results obtained from the experimental injection
in Section V, where the injection experiment is necessary to ob-
tain the different instructions sensitivities used in the prediction
models.
When the injection experiments of all instruction types were
performed in our previous work [32], an observation has been
made on the generic categories of instruction types that cause
significant changes in the error rates, these categories are the
CPU and Memory instructions. In this section, we will apply the
reliability equations shown previously on the CPU and Memory
instructions generic categories.
This means that we have two types of sensitivities, Scpu
representing the sensitivity of the CPU instructions and Sm
representing the sensitivity of the memory instructions.
The resulting equations for the reliability prediction for no
protection are obtained using (15)
e−tW(λrHr+(
λp
SP1SP2 )+λc(H1+X2+X3))(ScpuNcpu+SmNm). (28)
The equation of the reliability prediction after adding
protection is obtained using (16)–(20)
R1 (t)
= (3e(−2WScpuλct)
− 2e(−3WScpuλc.t))H1.Ncpu(3e(−2WSmλct)
− 2e(−3WSmλct))H1Nm (29)
R2 (t)
= (3e(−2WScpuλct)
− 2e(−3WScpuλct)) X2Ncpu(3e(−2WSmλct)
− 2e(−3WSmλct))X2Nm (30)
R3 (t)
= (3e(−2WScpuλct)
− 2e(−3WScpuλct)) X3Ncpu(3e(−2WSmλct)
− 2e(−3WSmλct))X3Nm (31)
864 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 3, SEPTEMBER 2019
Rr (t)
= (3e(−2WScpuλrt)
− 2e(−3WScpuλrt)) XrNcpu(3e(−2WSmλrt)
− 2e(−3WSmλrt))XrNm (32)
RCPU (t)
= (3e(−2WScpuλpt)
− 2e(−3WScpuλpt))
NCPU
SP1SP2 (3e(−2WSmλpt)
− 2e(−3WSmλpt)) NmSP1SP2 . (33)
The reliability of the whole system is given by (21).
When taking the CPU and memory instruction types, the reli-
ability of the protected processor from the injection experiment
is obtained using (27). The reliability of the whole architecture
is
Rinj (t) = e
−tW
((
λrHr +
(
1
SP1SP2
λp
)
+ λc (H1 +X2 +X3)
)
(ScpupNcpup + SmpNmp)
)
. (34)
F. Summary
In this section, we have shown the reliability models of the
protected and unprotected codes, using the software protection
techniques that have been developed. Using combinational logic,
and including multiple parameters in the model, aiming to have
the first in-depth reliability prediction that can be valid for the
COTS processing architectures.
At the start, the reliability of the unprotected processor was
shown in (15), followed by the reliability of the protected system
(21) and at last, (34) for the reliability obtained from the injection
experiment of the protected code.
In order to check the precision of our prediction model, the re-
liability obtained from (21) and (34) will be compared; this will
show how close the theoretical and experimental reliabilities are
to each other. In this model, we consider (34) as the ground
truth. Another comparison between (21) and (34) will show
how much reliability was added using the software protection
techniques.
At this point, the prediction models have been obtained. In or-
der to perform the different comparisons, we will need to obtain
the different sensitivities of the different instruction types, and
by sensitivity, it means the error rate of every instruction type.
In order to obtain the sensitivity of the instructions, the injection
experiments must be performed on both the protected and the
unprotected codes.
IV. AUTOMATIC COMPILER ERROR DETECTION
AND RECOVERY
Based on our previous works [10] and [32], a decision has
been made to use TMR for error detection and recovery, with its
ability to detect and recover without adding large overhead, com-
pared to the implementation of the Hamming and BCH codes.
Nevertheless, our method provides a way to implement any error
detection and recovery technique at the instruction level.
LLVM compiler framework is the backbone of this paper.
The implementation is achieved at the optimizer stage of the
compilation process by creating LLVM passes to enable the au-
tomatic addition of protection code to any benchmark that can
be compiled with LLVM. The code we intend to protect must be
in its IR, which can be achieved by using the “clang” compiler
front end. One major advantage of using our software protection
technique is that the user does not have to write a single line of
protection code, all they need to do is compile the unprotected
code using our passes and the code will be protected.
As discussed, the protection code includes analysis and
transformation passes. The analysis pass will provide us with
statistic information about the code; it also goes through all
instructions and determines their types. The transformation pass
is the part that adds the protection code; it uses the information
provided by the analysis to make the decision of the protection
to be added [10].
In order to choose the best protection codes in terms of per-
formance, we conducted a study of the overhead generated by
applying different protection schemes [10]. Hamming code has
been implemented first in the study of ACEDR. Hamming code
is able to detect two errors and recover single one. The BCH pro-
tection scheme has also been studied, with its ability to detect
and recover multiple error bits.
The use of both the Hamming and BCH codes has shown a
large overhead, so we decided to exclude these techniques from
our automatic compiler implementation. Another constraint is
that all ECC codes, including the Hamming and BCH, are more
suitable for memory systems protection, where the writing is
done with the encoder of the ECC, and the reading is done with
the decoder of the ECC in order to detect and recover errors. CPU
instructions redundancy schemes, like the NMR techniques, are
most suitable to detect and recover the errors.
The TMR implementation using automatic compiler passes
has shown the best performance with its low overhead [10].
This technique has also shown the ability to detect and recover
any single bit errors. TMR is also able to detect and recover
multiple error bits, with the condition that all the errors occur on
the same word of the TMR’d words, or the two errors occur on
two different words of two different TMR’d words. The TMR’s
limitation is in case of the occurrence of two or more errors
in two different words of the same TMR’d words. In this case,
TMR will still be able to detect the occurrence of an error, but
it cannot recover, since it does not know which word is correct.
A. ACEDR Principles
In this section, we explain the concept of protection using
LLVM passes that will add protection code. We divide instruc-
tions to two main categories Memory (R/W operations) and
CPU (Arithmetic and logic operations) in order to show that
the protection of both types is important. SEEs occur randomly,
meaning that all instruction types are vulnerable to the radia-
tion effects. In the literature, memory instructions are typically
NEZZARI AND BRIDGES: ACEDR: AUTOMATIC COMPILER ERROR DETECTION AND RECOVERY FOR COTS CPU AND CACHES 865
Fig. 3. Software TMR algorithm.
protected with external hardware, such as hard-ECC circuits;
that is why most software protection techniques do not provide
schemes for cache memory protection. This paper does not make
any presumptions on the chosen architecture’s memory system
and considers it to be unprotected. As our research studies SEEs
on both the cache memory and CPU instructions, both require
injection. Unlike the literature, where only the CPU instruc-
tions have been injected, we inject all instructions to understand
their consequences. Adding the protection at the optimizer of
the LLVM compilation process will allow this research to tar-
get multiple processing architectures, and multiple high level
programming languages. After the code is transformed to its
IR, the passes run on it, and add the protection code automati-
cally. The addition of instructions can be achieved using the us-
ing clone () [33], or by building new instructions using the
IRBuilder [34]. The addition of the protection functions
could be achieved directly by embedding them inside the pass
or using the linker to link the protection functions with the code
to be protected. The second method allows the user to gain more
time in the development of the protection code. The protection
function is a TMR; it works with the majority vote as shown in
Fig. 3.
B. Adding ACEDR Instructions in IR
Our previous work [10] has been extended to include multiple
datatypes (i32, i32, i1, i8, i8, i64, float and double, float and
double pointers). This extension will allow our paper to have
high coverage compared to the state of the art. ACEDR-TMR
will add two redundant instructions to the original one, and then
calls a voter function in order to decide the correct outcome
Fig. 4. Protecting the “alloca” memory instruction.
Fig. 5. Protecting the “store” memory instruction.
among the three instructions, the replicated and the original.
In this paper, we found that memory instructions (alloca, load,
store, and GetElementpointer) and CPU instructions (Arithmetic
and logic operators, etc.) both require protection, because we do
not assume that the memory system is protected with any type
of hard-ECC protection techniques, allowing us to extend our
implementation to more processors.
1) Memory Instructions Protection: Memory instructions
occupy a large portion of the code, unless there is a hardware ded-
icated especially for error detection and recovery, like the use of
hard-ECC. Then the instructions responsible for R/W from/to
memory must be protected, in order to make the processing
architecture more resilient.
There is a dependency between memory instruction types,
where read and write operations are using the locations that have
been created with alloca instructions. Same applies to Getelem-
nptr (GEP) instructions, where newly created GEP instruction
point to previously created allocation. Our software protection
scheme detects this dependency relation and adds the appro-
priate protection code accordingly. At first the newly created
allocations are detected, this could be achieved using the analy-
sis pass, providing the number and the types of the alloca. After
the alloca type is recognized, the transformation pass will add
the protection code. In this case, the transformation pass will
create new redundant allocations and each time a read or a write
to the original allocations is detected, the same operations will
be replicated on the redundant allocations. The alloca instruc-
tion replication is shown in Fig. 4. In the following code, %i is
the original alloca instruction and %pwtc21 and %pwtcx32 are
the redundant ones we created.
The analysis pass will detect writes to the original alloca-
tions (store instructions), which stores a value in an address
specified by the allocation. The transformation pass will know
the types of the original store instructions and their number,
and adds the protection code accordingly. In this case, new
stores will be created storing same value of the original one
to the newly created allocations previously, Fig. 5. The instruc-
tionstore i32 10, i32 %i is the original store,store
i32 10, i32 %pwtc21, and store i32 10, i32
%pwtcx32 are the newly created stores.
866 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 3, SEPTEMBER 2019
Fig. 6. Protecting the “load” memory instruction.
Fig. 7. Protecting the “add” CPU instruction.
Fig. 8. Injection of CPU instruction.
Every time a load or (read) instruction is detected from
a memory location (the original read is %2 = load i32
%i, align 4), redundant reads are added (the redundant
reads are %0 = load i32 %pwtc21 and %1 = load
i32 %pwtcx32), and the outcome is compared using a
voter %func = call i32 @vote(i32 %2, i32 %1,
i32 %0), resulting that the correct memory location only will
be the one with the final read Fig. 6.
2) ALU Instructions Protection: The analysis pass will de-
tect this type of instruction and classify them depending on their
data. The transformation pass will know the type of instruction
and replicates them, and call a voter function to decide the right
outcome at runtime; see Fig. 7.
3) CPU and Memory Protection: At last, the protection of
both instruction types will be combined, in order to assess the
highest resilience for the processing architecture. This can be
achieved with a single transformation pass, two separate ones,
or each pass dedicated to protect a certain instructions type.
This combination ensures that memory instructions are pro-
tected, meaning that they can be used by the CPU safely. The
ALU instructions will also be protected as well by redundancy
a voting function checker.
V. ERROR INJECTION
We have developed a new fault injection tool based on LLVM
compiler in order to validate and measure our software pro-
tection method’s ability to detect and recover. The process of
injection will be applied on the unprotected code, as well as
the protected code. By quantifying the number of injections and
produced errors, the error coverage can be deduced. For large
benchmarks, we inject to produce a 0.95 confidence level.
Fig. 9. Injecting, tracing, and comparison of the protected and unprotected
code.
Fig. 10. Injection of store memory instruction.
The error injector’s principle of work is shown in Fig. 9. The
left side shows the injection and the tracing of the protected code.
Tracing is the process of logging the outcome of all instructions
into files. The logged files of the injection experiment will be
compared with the golden files (files containing the outcome of
every instruction without injection). The code must be in its IR
to be traced. Using a python script, tracing instructions are added
to the code that logs the outcome of every instruction.
On the right side of Fig. 9, the injection and tracing of the
protected code is achieved. This code will be injected after the
protection code has been implemented on it, using our LLVM
protection passes. After that, the protected code will be injected
randomly by calling a function to randomly flip an instruction to
emulate a single event error and produce one of the error types
(SDC, Control, Hang, or Crash). This depends on the nature of
the instruction and how it propagates through the code. Finally, a
comparison is done between the golden files and the outcomes of
injecting unprotected and protected codes. Comparing the pro-
tected injected code log files with the golden ones will show the
number and types of errors that occurred, allowing to determine
the coverage added by our software protection code. The infor-
mation provided by the analysis pass will be used by the fault
injector in order to make decisions of the injection required to
every instruction type. The code will be injected in its IR, where
NEZZARI AND BRIDGES: ACEDR: AUTOMATIC COMPILER ERROR DETECTION AND RECOVERY FOR COTS CPU AND CACHES 867
every type has its injecting function, including (i32, i32, i1, i8,
i8, i64, float and double, float and double pointers).
Limitations of our fault injector are as follows.
1) We cannot inject void types.
2) Branches are void by nature and, in order to inject them,
the decision instruction (ex “cmp”) must be injected.
3) Return instructions are void. In order to inject them, the
“load” instruction before them must be injected.
A. Injection Experiments of Different Instruction Types
1) Injecting the CPU Instructions: The following code snip-
pet shows how the injection of the CPU instructions is
performed.
Once the instruction has been injected using the @flip func-
tion responsible for flipping randomly one if its bits, the fol-
lowing function will be called inside the LLVM injection pass:
replaceAllUsesWith(callfl). This function is respon-
sible for replacing the uses of the previous instruction with the
newly injected one.
2) Injecting Memory Instructions: The process of injecting
the load instructions is similar to the injection of the CPU instruc-
tions; the only difference is when injecting the store instruction,
which is depicted in the following snippet.
Instead of calling the method responsible for replacing the
uses of the store instructions, it is sufficient to add a new
store instruction, to store the injected value in the original
memory address, where the correct value was supposed to be
stored.
The flip function will take a value or pointer to the value, then
it will randomly flip one of its 32 bit (in case an int32 is flipped),
then returns the flipped value, or a pointer to the flipped value
(if a pointer is passed to it).
In order to detect the different types of error that have oc-
curred, a python script will compare the two log files, of the
golden outputs and the injected logs. SDC errors are detected if
the output files have same length, but different outcomes. Hangs
and Control errors are detected when the injected log files are
longer than the golden files. Crash errors are known, when the
size of the output injected file is 0; see Algorithm 1.
B. Injecting Unprotected Code
In order to evaluate our method’s error detection and recov-
ery ability, we have implemented it on nine known and common
benchmarks: Fib, Qsort, SolveCubic, Rad2Deg, Deg2Rad, UQ-
sort, fast Fourier transformations (FFT), Matrix multiplication
(MM) and Suzan from MediaBench [35]. SolveCubic, Rad2Deg,
Deg2Rad, UQsort were all combined to a single benchmark we
called Math benchmark.
The unprotected code is highly vulnerable to error injection,
since it has no protection scheme. All benchmarks, except for
Susan, have shown more than 50% error rates. Hangs only occur
in the Susan benchmark. All of the benchmarks suffered SDC,
control errors, and crashes, except for Fibo benchmark where
only SDC and control errors occurred; see Fig. 11.
Fig. 11. No protection applied.
Algorithm 1: Error Types Classification.
1 Initialize(Crash_Error_Count = 0)
2 Initialize(Control_Error_Count = 0)
3 Initialize(SDC_Error_Counter = 0)
4 Initialize(Hang_Error_Counter = 0)
5 For (Injected_File in Injected_Files)
6 If (Injected_File_Size = = 0)
7 ++Crash_Error_Count;
8 Else If (Injected_File_Size > Max_File_Size)
9 ++ Hang_Error_Counter;
10 End
11 For (Golden_File in Golden_Files)
12 If (Golden_File_Size ! = Injected_File_Size)
13 ++Control_Error_Count;
14 Else If (Golden_File_Index = =
Injected_File_Index)
15 If (Golden_ File_Lines ! = Injected_ File_Lines)
16 ++SDC_Error_Counter;
17 End
18 End
19 End
20 End
C. Protecting CPU Instruction Types
Unlike most of the literature that injects only CPU instruc-
tions, we decided to take a more holistic and realistic approach
by injecting all instruction types to assess whether our protec-
tion technique could reduce the original error rates obtained from
868 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 3, SEPTEMBER 2019
Fig. 12. Protect CPU instructions (binary operations, arithmetic and logic
instructions).
the first experiment. In our second injection experiment, only the
CPU instructions (and not cache memory) have been protected.
Fig. 12 shows that the error rate has been reduced slightly
in most of the benchmarks. In the Fibo benchmark, protecting
only CPU instructions has not reduced the error rate and stayed
similar to the unprotected experiment. The error rates have been
dropped to 9.78%, 26.85%, 33.36%, 28.41%, and 33.35% in Su-
san, Qsort, Math, FFT, and MM benchmarks, respectively. This
shows that protecting only the CPU instructions is not enough
to guarantee good error coverage. This high error rate is due to
the nature of benchmarks where the CPU instructions occupy a
small part of the code.
D. Protecting Memory Instruction Types
In this third experiment, only memory instructions have been
protected, including the read and write operations from/to cache
memory. In the literature, this type of instruction has been ig-
nored, because of the assumption of using hard-ECC. We do not
assume having any special hardware architecture, allowing us
to extend our paper to multiple processing architectures. In this
experiment we injected all the instruction types and checked the
error coverage. The error rate has been reduced by just protect-
ing the memory instructions, compared to the second experiment
Fig. 13. Protect memory instructions.
where only the CPU instructions have been protected. The error
rates have been dropped to 0%, 4.77%, 9.02%, 15.18%, 12.40%,
and 9.62% in Fibo, Qsort, Susan, Math, FFT, and MM bench-
marks, respectively; see Fig. 13. This improvement has been
made due to the high portion occupied by memory instructions
in the benchmarks, meaning this type of instruction requires
protection.
E. Protecting All Instruction Types
Injecting the protected code and comparing it to the injected
unprotected code will quantify the reliability provided by our
protection technique to the benchmarks. In the fourth exper-
iment all instruction types have been protected. This means
the combination of protection of both the CPU and memory
instructions.
This has dramatically improved the coverage where in some
benchmarks the error rate has been reduced to 0%. In the Math
benchmark, including (SolveCubic, Rad2Deg, Deg2Rad, and
UQsort), FFT and MM benchmarks, the total error has been
dropped to 3.97%, 1.23%, and 0.71%, respectively. For the Su-
san benchmark where random errors have been injected, we no-
ticed that the error rate has been dropped to 0.83%; see Fig. 14.
NEZZARI AND BRIDGES: ACEDR: AUTOMATIC COMPILER ERROR DETECTION AND RECOVERY FOR COTS CPU AND CACHES 869
Fig. 14. Protect CPU and memory instructions.
The low error rate achieved using our software protection, was
due to the replication of all instruction types (CPU and memory
instructions). The fusion of the protection of both instruction
types will ensure higher coverage. The datatype formats; i32,
i32, i1, i8, i8, i64, float and double, float and double pointers
have been included in this paper.
We have built our fault injection tool considering single faults
to assess SEEs, however we expect that our protection code will
detect and recover multiple bit errors, with the condition that
they are in the same word. We also can protect multiple, but
separate variables, knowing that they are TMR’d using our soft-
ware. Directly using our protection technique prevented crashes
and control errors.
F. ACEDR Time Overhead
The overhead or the time delay added when applying our pro-
tection techniques will be compared to the time delay where
no protection is applied. We started by recording the execution
time of the unprotected code, after that we protect the code with
ACEDR, and then measured its execution time. We used the
Linux tool perf, to measure the delays and the number of pro-
cessor cycles [36]. Table II shows the different time overheads
for the different benchmarks protected with ACEDR.
TABLE II
ACEDR TIME OVERHEAD FOR THE DIFFERENT PROCESSING PLATFORMS
The study of the time overhead when injecting the code is
platform independent (same results on the Raspberry Pi 3 and
the Intel core i5-3470 were obtained). The different overheads
are depending on the error types, as follows.
1) SDC did not generate any significant overhead (less than
1%).
2) Hang, produced and indefinite overhead, since the
program is stuck in an infinite loop.
3) Control or flow errors have produced significant overhead,
which could range from 0% up to multiple times the orig-
inal time of execution; in some cases this reached an order
of magnitude of the original time. When the overhead is
infinity, a hang has occurred.
Crash means that the code did not execute or terminated
incorrectly producing 0% overhead.
The high performance (low time overhead) was thanks to the
pipeline, where independent redundant instructions have been
executed in parallel. The desktop performance is better than
the embedded Raspberry Pi 3 (especially when running Susan
benchmark), due to the big difference in the two platforms per-
formances. Another important factor is the compiler optimiza-
tions, improving the performance highly. This demonstrates the
portability of the ACEDR, and compilation of protected code
for differing processing architectures.
G. ACEDR and State-of-the Art
Compared to the state-of-the-art EDDI [24] and SWIFT [25],
where both schemes are used for error detection only without the
ability to recover, where all instructions have been duplicated
(Memory and CPU instructions for EDDI, partially memory
and all CPU instructions for SWIFT), our method offers better
performance in terms of the overhead generated.
In addition to this, our method is capable of error detection
and recovery as well, which is very important in real-time ap-
plications, where the recovery part enables the system to keep
running reliably. In the state of the art, the error rate of in-
jection of the baseline code (unprotected) was predetermined
(20% error for EDDI and 37% for SWIFT) which could impact
dramatically the results of injecting the protected code. Our
paper does not predetermine the baseline error rate. Every bench-
mark’s instructions are injected independently, which results
in different baseline error rates (from 25.06% up to 87.09%).
870 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 3, SEPTEMBER 2019
This will yield more realistic results, when the protected code is
injected.
VI. RELIABILITY COMPARISON OF INJECTION EXPERIMENTS
WITH PREDICTIONS
Section V, where the injection experiments have been per-
formed, has shown that the error rate has been dropped using
purely the software protection techniques developed. Since we
have obtained the different sensitivities for the different instruc-
tions types, we can measure the precision of our reliability pre-
diction model from Section III, where the reliability obtained
from (21) and (34) will be compared, this will show the precision
of the theoretical reliability prediction model. In this scenario,
we take (34) as the ground truth. In order to show that the pro-
tection code did not just drop the error rate, but it also improved
the total reliability of the whole system, a comparison between
(21) and (34) will be done.
The instruction types are divided to two major categories
(CPU/Memory), however, the number of instruction types can
be subdivided to the total number of types “l” that exists in the
benchmark.
The key remarks are as follows.
1) Both the predictions and the injection experiment show an
improvement in reliability when protecting using ACEDR.
2) The reliability of the prediction is higher than the relia-
bility of the injection experiment at the start; then both
curves intersect. After the intersection point, the relia-
bility of injection becomes higher than the reliability of
predictions.
A. Mean Time to Failure (MTTF)
In order to determine the accuracy of our prediction tech-
niques, compared to the results of the injection, we investigated
the MTTF as a key performance metric. By definition, MTTF
is the length of time a device or other product is expected to
last in its operation. MTTF is obtained by integrating reliability
over time as the following: MTTF = ∫∞0 R(t). The MTTF of
the injection experiment is obtained by integrating (34), and the
MTTF of the prediction is obtained by integrating (28).
The accuracy of our predictions is determined by the error
between the injected and predicted MTTFs. The residual error
is dependent on the error rates of the CPU and the first level of
caches λp and λc, the reason for this is the fact that the hit rate
of the first level of cache was more than or equal to 98% in all
of the tested benchmarks. We plot the error as a function of the
ratio λp/λc we obtained in Fig. 15.
Fig. 15 includes the error of MTTF for the previously injected
benchmarks. We note that the lowest errors for the benchmarks
are 8.9 × 10−4, 3.2 × 10−3, 2.3 × 10−3, 0.028, 8.06 × 10−3,
and 2.88 × 10−5, respectively. We define these as the optimal
points. The λp/λc ratio error become linear and constant after
this point at 0.030, 0.042, 0.076, 0.4, 0.7, and 1.6, respectively.
Errors corresponding to ratios less than the minimal points are
high and are considered inaccurate. We would recommend users
of our method for future predictions and models if greater than
this minima.
Fig. 15. MTTF error with respect to (λp/λc) ratio.
VII. CONCLUSION
Bit-flips originating from SEEs are becoming a prominent
problem in the processor architectures. It is crucial for designers
in both mainstream and embedded or critical processing systems
to ensure the reliability of their systems. Systems with redun-
dant hardware that make use of hard-ECC and hard-TMR will
elevate the design complexity of terrestrial applications, often
eliminating it as an option. Soft error detection and recovery
methods are viable alternatives because of their high coverage
and low overhead and allowing for the best trade space between
reliability and performance, providing engineers with flexible
ways of protecting their processing architectures.
New ACEDR techniques were implemented at compiler level
in this paper. We implemented and verified a new error injection
tool with experiments on different benchmarks in order to test
the reliability of our software protection techniques. We injected
all instructions of the chosen benchmarks, where we divided the
instructions to two main categories; Memory instructions, al-
lowing R/W operations on the cache memory of the processing
architecture, and CPU instructions including logic and Arith-
metic operations. To quantify our results, we injected both the
protected and the unprotected code and compared the results.
We showed that CPU registers and their data or instructions
can be fully protected against the bit-flips caused by the fault
injection experiment simulating SEEs. For both the first and
second benchmarks, we demonstrated that all instructions can
be fully protected with almost 100% error injection coverage.
When injecting errors in code, we greatly reduced error rates
in the benchmarks: from 73.08% to 0% for Fibo, from 46.16%
to 0% for Qsrt, from 87.41% to 4.45% for Math, from 25.03%
to 0.83% for Susan, from 54.12% to 1.23% for FFT, and from
59.72% to 0.71% for MM.
In the third benchmark (Math) including (SolveCubic,
Rad2Deg, Deg2Rad, UQsort), the CPU instructions were fully
protected, even though the third benchmark was not fully cov-
ered, all crashes and control/flow errors have been eliminated
using ACEDR. The high error detection and recovery was
due to the replication of multiple data and instruction types,
NEZZARI AND BRIDGES: ACEDR: AUTOMATIC COMPILER ERROR DETECTION AND RECOVERY FOR COTS CPU AND CACHES 871
including i32, i32, i1, i8, i8, i64, float and double, float and
double pointers. The ACEDR protection mitigates different er-
ror types (crashes, SDC, control/flow), with low time overhead
in multicore across multiple platforms, enabling the use of our
protection technique to many architectures. We found that the de-
lays measured were not highly significant because of the pipelin-
ing of independent redundant instructions, allowing the use of
abundant resources of the CPU architectures, without causing a
bottleneck.
In this paper, we also showed equations modeling the reliabil-
ity of the whole processing architecture, using multiple param-
eters related to the hardware architecture and the environment
where the architecture is operating. The injection experiments
confirm the reliability predictions, where, for both benchmarks,
the curves show that the reliability of the protected code is higher
than the reliability of the unprotected one. The precision of our
prediction model depends on the value of the initial variables
(the error rates of the CPU and the caches); the model can reach
very high accuracy level if the ratio λp/λc corresponds to the
optimal point. For values more than the optimal point, the accu-
racy will be from 92.35 to 98.5% depending on the benchmark,
except for the Suzan benchmark where the accuracy was 77%,
which is due to the randomness of the error injection in this
benchmark. Our model is limited if the values of λp/λc are less
than the optimal point, where the MTTF error can reach high
rates.
This paper would be highly valuable, both to satellites/space,
but also in general computing, such as in in aircraft, automotive,
server farms, and medical equipment (or anywhere that needs
safety critical performance) as hardware gets smaller and more
susceptible.
REFERENCES
[1] R. C. Baumann, “Soft errors in advanced semiconductor devices-part I:
The three radiation sources,” IEEE Trans. Device Mater. Rel., vol. 1, no. 1,
pp. 17–22, Mar. 2001.
[2] T. J. O’Gorman et al., “Field testing for cosmic ray soft errors in semi-
conductor memories,” IBM J. Res. Develop., vol. 40, no. 1, pp. 41–50,
1996.
[3] P. Shivakumar et al., “Modeling the effect of technology trends on the soft
error rate of combinational logic,” in Proc. Int. Conf. Dependable Syst.
Netw., 2002, pp. 389–398.
[4] P. Pukite and J. Pukite, Markov Modeling for Reliability Analysis. Wiley-
IEEE Press, 1998.
[5] R. Ginosar, “Survey of processors for space,” Data Syst. Aerosp. (DASIA).
Eurospace, pp. 1–5, 2012.
[6] R. W. Horst, R. L. Harris, and R. L. Jardine, “Multiple instruction issue in
the NonStop cyclone processor,” in Proc. 17th Annu. Int. Symp. Comput.
Architecture, 1990, pp. 216–226.
[7] T. J. Slegel et al., “IBM’s S/390 G5 microprocessor design,” IEEE Micro,
vol. 19, no. 2, pp. 12–23, Mar./Apr. 1999.
[8] M. Tremblay and Y. Tamir, “Support for fault tolerance in VLSI proces-
sors,” in Proc. IEEE Int. Symp. Circuits Syst., 1989, pp. 388–393.
[9] R. Phelan, “Addressing soft errors in ARM core-based SoC,” ARM, Cam-
bridge, U.K., White Paper, 2003.
[10] Y. Nezzari and C. Bridges, “Compiler extensions towards reliable multi-
core processors,” in Proc. IEEE Aerosp. Conf., 2017, pp. 1–6.
[11] C. Lattner and V. Adve, LLVM Language Reference Manual, 2006.
[12] A. Avizienis et al., “Basic concepts and taxonomy of dependable and
secure computing,” IEEE Trans. Dependable Secure Comput., vol. 1, no. 1,
pp. 11–33, 2004.
[13] N. Aggarwal et al., “Configurable isolation: Building high availability sys-
tems with commodity multi-core processors,” in ACM SIGARCH Comput.
Architecture News, vol. 35, pp. 470–481, 2007.
[14] A. Vajda, “Multi-core and many-core processor architectures,” in Pro-
gramming Many-Core Chips. Springer, 2011, pp. 9–43.
[15] J. P. Walters et al., “Software-based fault tolerance for the maestro many-
core processor,” in Proc. IEEE Aerosp. Conf., 2011, pp. 1–12.
[16] A. Avizienis, “The N-version approach to fault-tolerant software,” IEEE
Trans. Softw. Eng., vol. 1985, no. 12, pp. 1491–1501, Dec. 1985.
[17] K. Mitropoulou, V. Porpodas, and M. Cintra, “DRIFT: Decoupled
compiler-based instruction-level fault-tolerance,” in International Work-
shop on Languages and Compilers for Parallel Computing. Springer, 2013.
[18] J. E. Fritts et al., “MediaBench II video: Expediting the next generation
of video systems research,” Microprocessors Microsystems, vol. 33, no. 4,
pp. 301–318, 2009.
[19] J. L. Henning, “SPEC CPU2000: Measuring CPU performance in the new
millennium,” Computer, vol. 33, no. 7, pp. 28–35, 2000.
[20] A. Piotrowski, “Automatic installation of software-based fault tolerance
algorithms in programs generated by GCC compiler,” Int. J. Microelec-
tronics Comput. Sci., vol. 1, no. 3, pp. 263–268, 2010.
[21] G. A. Reis, J. Chang, and D. I. August, “Automatic instruction-level
software-only recovery,” IEEE Micro, vol. 27, no. 1, pp. 36–47, Jan./Feb.
2007.
[22] S. Feng et al., “Shoestring: Probabilistic soft error reliability on the cheap,”
in ACM SIGARCH Comput. Architecture News, vol. 38, pp. 385–396, 2010.
[23] J. Yu and M. J. Garzaran, “Compiler optimizations for fault tolerance soft-
ware checking,” in Proc. 16th Int. Conf. Parallel Architecture Compilation
Techn., 2007, p. 433.
[24] N. Oh, P. P. Shirvani, and E. J. McCluskey, “Error detection by duplicated
instructions in super-scalar processors,” IEEE Trans. Rel., vol. 51, no. 1,
pp. 63–75, Mar. 2002.
[25] G. A. Reis et al., “SWIFT: Software implemented fault tolerance,” in Proc.
Int. Symp. Code Gener. Optim., 2005, pp. 243–254.
[26] UIUC Open IMPACT Effort, “The OpenIMPACT IA-64 compiler.
[27] Y. Zhang et al., “DAFT: Decoupled acyclic fault tolerance,” Int. J . Parallel
Program., vol. 40, no. 1, pp. 118–140, 2012.
[28] C. Wang et al., “Compiler-managed software-based redundant multi-
threading for transient fault detection,” in Proc. Int. Symp. Code Gener.
Optim., 2007, pp. 244–258.
[29] K. Mitropoulou, V. Porpodas, and M. Cintra. “CASTED: Core-adaptive
software transient error detection for tightly coupled cores,” in Proc. IEEE
27th Int. Symp. Parallel Distrib. Process., 2013, pp. 513–524.
[30] R. M. Stallman, “Using and porting the GNU compiler collection,” Free
Softw. Found., vol. 51, pp. 02110–1301, 1989.
[31] U. Afzaal and J.-A. Lee, “A self-checking TMR voter for increased re-
liability consensus voting in FPGAs,” IEEE Trans. Nuclear Sci., vol. 65,
no. 5, pp. 1133–1139, May 2018.
[32] Y. Nezzari and C. Bridges, “Modelling processor reliability using LLVM
compiler fault injection,” in Proc. IEEE Aerosp. Conf., 2018, pp. 1–10.
[33] LLVM, LLVM Language Reference Manual. (2018). [Online]. Available:
https://llvm.org/docs/LangRef.html
[34] LLVM, LLVM IR builder. (2018). [Online]. Available:
http://llvm.org/doxygen/classllvm_1_1IRBuilder.html
[35] M. Consortium, MediaBench II benchmark. (2015). [Online]. Available:
http://mathstat.slu.edu/∼fritts/mediabench
[36] A. C. de Melo, “The new linux’perf’tools,” in Proc. Slides Linux Kongress,
2010, p. 18.
Yasser Nezzari received the State Engineering degree in control engineering
from the Institute of Electrical and Electronic Engineering at the University of
Boumerdes, Boumerdes, Algeria, in 2014. He is currently working toward the
Ph.D. degree at the Surrey Space Centre, University Of Surrey, Guildford, U.K.,
where he is involved in researching software compilers, radiation tolerance using
software, and adaptive software.
Christopher P. Bridges received the B.Eng. degree from the University of
Greenwich, London, U.K., in 2005, and the Ph.D. degree from the University
of Surrey, Guildford, U.K., in 2009, both in electronic engineering.
In 2013, he designed, built, and still operates the U.K.’s first CubeSat
(STRaND-1) with Surrey Satellite Technology (SSTL) and now contributes
toward computing hardware and software with SSTL, on ESA’s ESEO mis-
sion and also the NASA-JPL/CalTech AAReST mission. He currently leads the
On-Board Data Handling (OBDH) research group within Surrey Space Centre,
University Of Surrey. His research interests include software-defined radios,
real-time embedded systems, agent computing, Java processing, multicore pro-
cessing in FPGAs, and astrodynamic computing methods in many spaceflight
payloads.
