SSFT: selective software fault tolerance by Turhan, Tuncer
SSFT: SELECTIVE SOFTWARE FAULT
TOLERANCE
a thesis
submitted to the department of computer engineering
and the graduate school of engineering and science
of bilkent university
in partial fulfillment of the requirements
for the degree of
master of science
By
Tuncer Turhan
January, 2014
I certify that I have read this thesis and that in my opinion it is fully adequate,
in scope and in quality, as a thesis for the degree of Master of Science.
Assoc. Prof. Dr. Ozcan Ozturk(Advisor)
I certify that I have read this thesis and that in my opinion it is fully adequate,
in scope and in quality, as a thesis for the degree of Master of Science.
Assist. Prof. Dr. Bedir Tekinerdogan
I certify that I have read this thesis and that in my opinion it is fully adequate,
in scope and in quality, as a thesis for the degree of Master of Science.
Assoc. Prof. Dr. Suleyman Tosun
Approved for the Graduate School of Engineering and Science:
Prof. Dr. Levent Onural
Director of the Graduate School
ii
ABSTRACT
SSFT: SELECTIVE SOFTWARE FAULT TOLERANCE
Tuncer Turhan
M.S. in Computer Engineering
Supervisor: Assoc. Prof. Dr. Ozcan Ozturk
January, 2014
As technology advances, the processors are shrunk in size and manufactured
using higher density transistors which makes them cheaper, more power ecient
and more powerful. While this progress is most benecial to end-users, these ad-
vances make processors more vulnerable to outside radiation causing soft errors
which occur mostly in the form of single bit ips on data. For protection against
soft errors, hardware techniques like ECC (Error Correcting Code) and Ram
Parity Memory are proposed to provide error detection and even error correc-
tion capabilities. While hardware techniques provide eective solutions, software
only techniques may oer cheaper and more exible alternatives where additional
hardware is not available or cannot be introduced to existing architectures. Soft-
ware fault detection techniques -while powerful- rely mostly on redundancy which
causes signicant amount of performance overhead and increase in the number
of bits susceptible to soft errors. In most cases, where reliability is a concern,
the availability and performance of the system is even a bigger concern, which
actually requires a multi objective optimization approach. In applications where
a certain margin of error is acceptable and availability is important, the existing
software fault tolerance techniques may not be applied directly because of the
unacceptable performance overheads they introduce to the system. Our tech-
nique Selective Software Fault Tolerance (SSFT) aims at providing availability
and reliability simultaneously, by providing only required amount of protection
while preserving the quality of the program output. SSFT uses software proling
information to understand application's vulnerabilities against transient faults.
Transient faults are more likely to occur in instructions that have higher execu-
tion counts. Additionally, the instructions that cause greater damage in program
output when hit by transient faults, should be considered as application weak-
nesses in terms of reliability. SSFT combines these information to eliminate the
instructions from fault tolerance, that are less likely to be hit by transient errors
or cause errors in program output. This approach reduces power consumption
iii
iv
and redundancy (therefore less data bits susceptible to soft errors), while improv-
ing performance and providing acceptable reliability. This technique can easily be
adapted to existing software fault tolerance techniques in order to achieve a more
suitable form of protection that will satisfy dierent concerns of the application.
Similarly, hybrid and hardware only approaches may also take advantage of the
optimizations provided by our technique.
Keywords: Software Fault Tolerance, Software Fault Injection, Software Proling
for Reliability, Reliability, Multi objective optimization: Reliability and Avail-
ability.
OZET
SEC _IMSEL YAZILIM HATA TOLERANSI
Tuncer Turhan
Bilgisayar Muhendisligi, Yuksek Lisans
Tez Yoneticisi: Doc. Dr. Ozcan Ozturk
Ocak, 2014
Teknolojik gelismelerle birlikte, islemciler boyut olarak daha kucultuluyor ve
uretim surecinde daha sk ve kucuk boyutlu transistorler kullanlarak uretiliyorlar.
Bu uretim sureci islemcileri daha ucuz, daha guc tasarruu ve daha guclu klyor.
Bu surec son kullanc icin son derece faydal olmasna karsn, bu surec islemcileri
ds ortamdan kaynakl radyasyona kars daha zayf klyor ve bunun sonu-
cunda, genellikle veri ustunde tek bir bitin deger degistirmesi formunda olusan,
yumusak hatalar olusuyor. Yumusak hatalara kars uygulamalarn guvenilirligini
arttrabilmek adna, literaturde ECC (Hata Duzeltme Kodu) ozellikli veya Parite
ozellikle hafza gibi donanmsal hata tolerans teknikleri gelistirilmistir. Do-
nanmsal hata tolerans teknikleri etkili cozumler uretmesine karsn, donanmsal
altyapnn bulunmadg ya da var olan sisteme eklenmesi mumkun olmadg du-
rumlarda, yazlmsal hata tolerans teknikleri daha ucuz ve esnek bir alternatif
sunabilir. Yazlmsal Hata Tolerans teknikleri, guclu bir alternatif olmasna
ragmen, genellikle yedekleme mantgna dayal calstklarndan, performans
dususune ve hataya neden olmaya ack olan bit saysn arttrmaktadr. Uygu-
lama guvenilirliginin bir endise oldugu sistemlerde, performans ve erisiliebilirlik
daha buyuk bir sistem endisesi ve gereksinimi durumunda oldugundan, bu coklu
objektii bir yaklasm gerektirmektedir. Belirli olcude bir hatann kabul edilebilir
oldugu ve erisilebilirligin onemli oldugu uygulamalarda, sisteme getirdikleri per-
formans yukunden oturu, literaturdeki yazlmsal hata tolerans tekniklerini
oldugu gibi kullanlamayabilir. Bu noktada bizim teknigimiz secimsel yazlm
hata tolerans (SYHT) erisilebilirlik ve guvenilirligi es zamanl saglamay hede-
er. SYHT bunu, uygulamann sadece ihtiyac duydugu olcude hata tolerans
kullanarak ve uygulamann urettigi verilerde kaliteyi koruyarak saglamaktadr.
SYHT yazlm prol bilgisini kullanarak, uygulamann yumusak hatalara kars
hassasiyetlerini anlamaya calsr. Yumusak hatalar yuksek sayda calsan komut
satrlarnda olusmaya meyillidir. Ayrca, yumusak hatalara maruz kaldgnda,
uygulama tarafndan olusturulan verilerde daha fazla hataya sebep olan komut
v
vi
satrlarnn uygulamann guvenilirlik acsndan hassas komut satrlar oldugu
soylenebilir. SYHT bu bilgileri kullanarak, yumusak hatalara maruz kalma
olaslg dusuk ve uygulama tarafndan olusturulan verilerde daha az hataya
sebep olan komut satrlar icin hata toleransn kaldrr. Bu yaklasm, perfor-
mans arttrrken ve guvenilirligi yeterli seviyede tutarken, enerji tuketimlerini ve
yedeklenen veri saysn (dolaysyla yumusak hatalara maruz kalan bit saysn)
azaltr.Bu teknik kolaylkla literaturde yaygn olan yazlmsal hata tolerans
tekniklerine adapte edilebilir. Bu adaptasyonu yaparken de, uygulamann cesitli
endiselerine uygun olacak sekilde bir hata tolerans kullanr. Benzeri sekilde, hib-
rit ve donanmsal hata tolerans yaklasmlar da bizim yaklasmmzn sagladg
iyilestirmelerden faydalanabilirler.
Anahtar sozcukler : Yazlmsal Hata Tolerans, Yazlmsal Hata Enjeksiyonu,
Yazlm Guvenligi icin Yazlmsal Prol C karma , Yazlm Guvenilirligi, Coklu
Objektii _Iyilestirme: Guvenilirlik ve Erisilebilirlik.
Acknowledgement
I am much obliged to my supervisor Assoc. Prof. Dr. Ozcan Ozturk, for his
understanding and guidance through this experience. His ideas were extremely
helpful to me, in each step of my studies.
I am also grateful to my jury members, Assist. Prof. Dr. Bedir Tekinerdogan
and Assoc. Prof. Dr. Suleyman Tosun for their participation and their invaluable
comments and suggestions.
I am grateful to Computer Engineering Department of Bilkent University for
providing me tuition waiver for my MS studies.
I am thankful to Central Bank of Turkey for the allowance they provided
during this study.
I would also like to thank my friends for their understanding and support
throughout this experience, Osman Deger, Emir Gulumser, Gulden Torer, Seckin
Okkar, Ethem Bars Ozturk, Nesim Yigit and Muhammed Buyuktemiz.
I would also want to thank my entire family for their support, without them
and their prayers, this whole study would be for nothing.
vii
Contents
1 Introduction 1
2 Motivation 9
3 Related Work 14
3.1 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Fault Detection and Recovery . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Hardware Fault Tolerance . . . . . . . . . . . . . . . . . . 16
3.2.2 Software Fault Tolerance . . . . . . . . . . . . . . . . . . . 19
3.2.3 Hybrid Techniques . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Software Proling For Fault Tolerance . . . . . . . . . . . . . . . 27
4 Our Approach 32
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Selective SFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
viii
CONTENTS ix
4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 CFG Based Vulnerability Estimation Example . . . . . . . . . . . 53
5 EXPERIMENTAL EVALUATION 59
5.1 Benchmarks and Setup . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 Conclusion 73
List of Figures
2.1 Improvements for CRT. . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Check positions for Hamming Code. . . . . . . . . . . . . . . . . . 17
3.2 EDDI instruction duplication and scheduling example. . . . . . . 21
3.3 Weight calculation for error paths. . . . . . . . . . . . . . . . . . . 29
4.1 Example functions to adjust parameters. . . . . . . . . . . . . . . 37
4.2 An example CFG graph. . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 SSFT system architecture. . . . . . . . . . . . . . . . . . . . . . . 44
4.4 GIMPLE code example. . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Modied GCC compiles the application source code to produce
error injected application executable. . . . . . . . . . . . . . . . . 49
4.6 Test runs produce output and coverage statistics. . . . . . . . . . 50
4.7 QE is estimated by golden run comparisons. . . . . . . . . . . . . 51
4.8 ER estimations are produced using QE values, coverage statistics
(PE values) and error injection details. . . . . . . . . . . . . . . . 52
4.9 CFG for DCT function of Compress benchmark. . . . . . . . . . . 53
x
LIST OF FIGURES xi
4.10 Code fragment for DCT. . . . . . . . . . . . . . . . . . . . . . . . 54
4.11 GIMPLE Code for basic block 10. . . . . . . . . . . . . . . . . . . 54
4.12 GIMPLE Code for basic block 2. . . . . . . . . . . . . . . . . . . 54
4.13 GIMPLE Code for Basic Block 2 after error injection. . . . . . . . 55
4.14 GIMPLE Code for basic block 6. . . . . . . . . . . . . . . . . . . 56
4.15 GIMPLE Code for Basic Block 6 after error injection. . . . . . . . 56
4.16 Table for Error Rates and ER values for statement parameters. . . 57
5.1 PER values for our benchmarks (10 6). . . . . . . . . . . . . . . . 62
5.2 Execution counts for our benchmarks (106). . . . . . . . . . . . . 63
5.3 Improvements with our approach when the application is executed
without any errors, that is PER = 0. . . . . . . . . . . . . . . . . 64
5.4 Normalized execution times compared to SWIFT and EDDI ap-
proaches without applying our technique. . . . . . . . . . . . . . . 65
5.5 Program binary size reductions compared to SWIFT and EDDI
approaches without applying SSFT. . . . . . . . . . . . . . . . . . 66
5.6 Instruction count reductions compared to SWIFT and EDDI ap-
proaches without applying our technique. . . . . . . . . . . . . . . 67
5.7 The percentage of parameters that can be excluded from software
tolerance for dierent PER values. . . . . . . . . . . . . . . . . . 68
5.8 Normalized execution times compared to SWIFT approach with-
out applying our technique for dierent PER values. . . . . . . . 69
5.9 Program binary size reductions compared to SWIFT approach
without applying our technique for dierent PER values. . . . . . 70
LIST OF FIGURES xii
5.10 Instruction count reductions compared to SWIFT approach with-
out applying our technique for dierent PER values. . . . . . . . 71
5.11 The average rate of parameters that can be removed from software
tolerance for dierent PER values. . . . . . . . . . . . . . . . . . 72
List of Tables
5.1 The characteristics of the benchmark codes used in this study. . 59
xiii
Chapter 1
Introduction
Over the last decade, the processors have improved in many aspects through
technological advancements; they have become cheaper, more powerful and less
energy consuming and overall oer better and more ecient computing. In order
to provide more ecient and powerful processors, hardware manufacturers keep
improving their designs and fabrication technologies. There are many improve-
ments being applied; however the following aspects are more important for this
thesis: using higher density transistors and introduction of Chip Multiprocessors
(CMP). The current state of the art technology is 14nm process technology which
is adopted by most of the processor and System on Chip manufacturers including
Intel, AMD, Nvidia and ARM. The number of transistors on integrated circuits
doubles approximately every two years according to Moore's Law and to pro-
vide such advancements companies increase the transistor densities. Although,
these advancements provide users cheaper, faster and more ecient processors,
there are issues that need to be addressed in order to continue with such ad-
vancements. Baumann states that, "As the dimensions and operating voltages
of computer electronics shrink to satisfy consumers' insatiable demand for higher
density, greater functionality, and lower power consumption, sensitivity to radi-
ation increases dramatically." [1]. According to recent studies [2] soft errors are
expected to grow further as the scaling goes beyond 14nm, and with each gen-
eration 8% increase in soft-error rate is expected. This sensitivity to radiation
1
presents itself in the form of Single Event Upset (SEU). While these faults, in
general, are considered as transient errors and do not cause any permanent dam-
age on the hardware, a single bit ip in data may cause signicant failures. In
software systems, application programs, operating systems or drivers are consid-
ered as main causes of faults; however, in some cases, these transient faults may
be the source of the actual failure. In 2000, Sun Microsystems acknowledged that
cosmic rays interfered with cache memories and caused crashes in server systems
at major customer sites, including America Online, and dozens of others [3, 4].
In a more recent event, Hewlett Packard stated that cosmic ray strikes causing
transient faults was the main cause of the frequent crashes in 2048-CPU server
system in Los Alamos National Laboratory [5]. For brevity and keeping focus on
the concerns addressed in this thesis, the details about the formal denition of
the transient faults and actual reasons behind faults will not be further discussed.
In order to prevent failures caused by transient faults, the general convention
is to use bit level protection techniques like ECC (Error Correction Code) or
EDAC (Error Detection and Correction Code) in the memory architecture. In
order to comply with the needs of ECC, a circuitry that is capable of encoding
(Hamming Code, Reed Solomon) the data (encoded data is used for error detec-
tion and recovery), additional data space to store the encoded data bits, error
checking and recovery mechanisms are introduced to existing memory architec-
tures. For error detection and recovery, the Hamming Code encoding requires 8
extra bits to be encoded and stored for each of the 64 bit cells in the memory
architecture. The added circuitry will use these encoded data bits for error de-
tection and recovery purposes.
ECC hardware implementing Hamming Code is able to detect 2 bit errors and
can correct 1 bit errors in memory which is called SECDED (single error correc-
tion, double error detection). This SECDED is the convention in ECC because;
in order to recover from multiple bit errors a higher number of bits needed to be
stored, more hardware will be introduced to system for encoded data calculation
and data recovery. These additional requirements will result in an even more
expensive and slower system which is impractical in most cases. Multiple bit
failures are considered to happen much less frequently compared to SEU, which
makes SEU detection by far the most important concern [4].
2
While ECC oers great level of protection and recovery, it is sometimes omit-
ted for being costly and increasing the access times in memory [6]. Parity RAM
is another hardware solution, which requires less hardware than ECC. However,
faults can only be detected, but not corrected with this protection. In general,
Parity RAM is also considered to be costly and slower than RAM that is not
providing any protection, therefore may also be omitted. The convention is that,
memory units that are lower in the memory hierarchy, such as L1 and L2 caches,
are equipped with parity protection. In case of a failure, the data is restored
from its original location in RAM. Although most of the memory hierarchy is
seemingly under protection, there are parts inside the CPU architecture that are
not being protected (due to limitations of hardware fault tolerance techniques)
and hence are open to transient faults. ECC and parity bit protection techniques
cannot be applied to most parts of the CPU architecture and are often criticized
as they are not scalable to address the reliability concerns of the entire computer
architecture.
To give a more specic example, consider a system having ECC protection
in its RAM memory and parity protection in its cache level memory. Any SEU
on the data located in the RAM; will be detected and corrected before it can
cause any faulty behavior. The L1 and L2 caches will detect any SEU using
the parity bits and restore its data from the RAM which is known to be pro-
tected by ECC. However, when faults occur inside the ALU (Arithmetic Logic
Unit) or in the instruction fetch-decode unit or the registers inside the CPU, the
fault detection and recovery is not possible. ECC protection for these internal
parts of the CPU is known to be costly, power consuming and slow, and thus
is considered not scalable and usually not adopted by CPU manufacturers. For
instance, protection of the data in a CPU register le using ECC is shown to
be extremely costly in terms of both performance and power [7]. For ALU, such
protection will disrupt the pipeline architecture, impair the performance of the
whole processor while increasing the power consumption and the cost. Other
alternatives include using the pipelined structures inside the CPU and executing
the same instruction twice and delay the output until the result is veried by the
second execution. Similarly, VLIW (Very Long Instruction Word) architecture is
able to take advantage of ILP (Instruction Level Parallelism) and can be used in
3
order to execute the same instruction twice and comparing the results. VLIW
is a very common architecture, especially in GPUs (Graphics Processing Unit)
which implement SIMD (Single Instruction Multiple Data) or MIMD (Multiple
Instruction Multiple Data) on a manycore architecture. Since, GPU is a many-
core architecture that can process multiple data in an extremely fast manner; it
has become a new alternative for General Purpose Computing and referred to as
GPGPU (General-Purpose GPU). In both VLIW and GPU, executing the same
instruction multiple times will have an impact on the ILP and impair the system
performance dramatically, while decreasing the availability.
The bit level hardware protection is not commonly adopted in the low-level
hardware hierarchy due to aforementioned concerns. The manufacturers are of-
ten forced to implement high level protections, by which they are able to obtain
promising results with less severe impacts on the performance and cost. These
architecturally high level approaches are called "macro-reliability protection" [7].
Macro-reliability protection often uses duplication of coarse-grained structures
such as CPU cores or hardware contexts inside the processing unit to provide
transient fault tolerance in a more cost-eective and scalable manner [7]. While
this approach overcomes the scaling problems of the prior bit-level techniques,
macro-reliability schemes adopt a rather inexible one size ts all protection over
the whole CPU architecture. This strict protection scheme will not be able to
adapt dierent levels of performance and reliability requirements and most of the
time will end up overprotecting the entire system. This overprotection will be re-
ected to the end-user as higher power consumption levels with signicant losses
in performance and availability while increasing the overall cost. The end-users
may want to use the same underlying hardware for dierent types of applications
having dierent levels of reliability requirements, while in some applications faults
are intolerable, in others imprecise results may be acceptable. Similarly, the user
may want to upgrade the system conguration to increase the overall reliability
or performance of the system, in order to adapt rapidly changing requirements of
the market. These system upgrades will be much more costly because of the one
size ts all protection approach. Moreover, in most cases the underlying hardware
cannot be modied and the user may still want to improve the level of protection
from the transient faults. These examples can be multiplied, but concisely, the
4
reliability and performance concerns need to be handled in a more adaptable way
to fulll the specic performance and reliability requirements of an application.
This is where software fault tolerance (SFT) techniques become more appealing
alternatives with promises of providing a more exible way for users to adjust the
reliability and performance levels according to their needs and in case of a lack
of underlying hardware support.
SFT techniques rely on some form of redundancy similar to Hardware Fault
Tolerance (HFT) techniques. SFT techniques use dierent forms and levels of
redundancies. For instance, a bit-level approach in software can be used in terms
of error-detection and recovery, just like hardware ECC and parity protections.
This will require the application to handle encoding of Hamming Code (or Reed
Solomon) extra bits, store them in memory, check in case any error occurs and re-
cover the data. In practice, the software approach of ECC will perform worse than
hardware ECC, since bit level encoding calculations can be better implemented in
hardware as well as, detection and recovery calculations. Since the calculations
cannot be directly injected in the main thread due to performance overheads,
the software ECC will be performed in another thread in the form of periodical
sweeps to data residing in memory. Rapidly changing data in memory cannot be
addressed with software ECC, since the recovery data needs to be recalculated
in each data change. The parts of the memory where data is more immutable is
rather convenient for this protection scheme like L1 instruction cache where the
running application resides. The data in memory is partially protected and the
protection of the application code is done without any distinction. Additionally,
any error occurring between the sweeps will still aect the execution and even
spread throughout the rest of the program.
Another SFT technique is instruction duplication, where each instruction is
duplicated and the results of the duplicate instructions (shadow instructions) are
compared with the main instruction's results at synchronization points. The syn-
chronization points in the execution cycle are mainly chosen as the store instruc-
tions which store the data from registers to memory. The main idea is to keep the
data in the memory intact, detect errors before data is written back to memory,
and prevent any corruption. The instruction duplication is expected to double
the execution latency of the application, however with the use of ILP (Instruction
5
Level Parallelism) techniques and some additional improvements the performance
overhead can be reduced. For instance, SWIFT (an SFT instruction duplication
technique), is able to improve execution latency to 1.41x compared to the baseline
where no SFT is applied [4]. Similar to performance overheads, other concerns
emerge with instruction duplication. Such as code size and volatile memory re-
quirements. Instruction duplication scheme can also be used in hybrid techniques,
where hardware supports software implementation. Hybrid techniques were pro-
posed to tackle the performance bottlenecks of SFT-only instruction duplication
techniques, with provided additional hardware assistance.
In addition to software redundancy at the instruction level, it is also possible
to implement redundancy at the thread level, where an identical copy of the main
thread runs for reliability. Two dierent types of redundant thread mechanisms
come to mind, running the redundant thread in the same CPU that supports si-
multaneous multithreading capabilities (SMT) and running the redundant thread
in another CPU core which is possible in Chip Multiprocessor (CMP) systems.
The redundant thread introduces a synchronization problem between the threads
since it requires a slack between leading and trailing threads in which the trailing
thread will follow the leading thread. The trailing thread will detect and recover
from faults that may occur. The redundant thread mechanism in CMP is also
referred to as CRT (Chip level redundantly threading). CRT scheme also in-
troduces a communication overhead, due to data needed by multiple processors.
Therefore, it requires require additional hardware queues to be implemented in-
side the CPU.
The details of the SFT techniques described above will be discussed in more
detail in the related work section. The novel idea behind the SFT techniques is
that these techniques provide a more exible, cheaper alternative to HFT meth-
ods. However, despite its advantages, these systems overprotect the application
as a whole without taking advantage of the application and hardware character-
istics. In addition to performance overhead, volatile and non-volatile memory
requirements increase; additional hardware requirements emerge with the use of
SFT techniques, which defeats the whole purpose of using them. Moreover, the
redundancies introduced to system (shadow instructions, hardware queues, etc.)
increase the number of data bits that are vulnerable to SEUs, thereby leading to
6
a higher soft error rate [8].
Based on the drawbacks of current SFT techniques, it is necessary to take a
fresh look at software fault tolerance, where reliability and performance require-
ments of the running applications considered. More specically, SFT schemes
should aim at a balanced protection, that provides required levels of protection
while decreasing the cost and performance overheads. In order to achieve a bal-
anced protection, SFT techniques should be applied selectively. In the rest of
this thesis, this is called selective SFT (SSFT). In order to selectively apply fault
tolerance, the application should be carefully analyzed using software proling
information.
There are two dierent analysis in using software proling information for
reliability, static and dynamic. Static analysis uses the oine information that
is obtained during the compilation process of the application code. Compilers
often uses passes to analyze the application. The application code is rst parsed
and converted in to structure called Control Flow Graph (CFG). CFG consists of
basic blocks that are connected by edges. Basic blocks are straight lines of code,
that does not contain any jump instructions. The jumps between basic blocks
are provided by edges that connect basic blocks. CFG shows the paths that can
be traversed during the execution of the program. After CFG is formed, compiler
processes the application code to eliminate dead code and optimize program ex-
ecution. An example for static analysis can be type inference for fault-tolerance
prediction [9]. This study analyzes the instruction operand types, an informa-
tion that can acquired during compilation. EPIC, another technique using static
analysis, uses error propagation and CFG data to understand the impacts of a
soft error [10]. Although, oine analysis provide invaluable information to under-
stand the application characteristics in terms of reliability, an online analysis may
provide a dierent perspective. SSFT follows an orthogonal path, using run-time
information to understand the eects of soft-errors.
For dynamic analysis of an application, we use statistical data that is pro-
duced during program execution. The statistical data provides us, the paths that
application traverses in CFG and the the execution counts for each basic block.
The statements that are located in basic blocks with high execution counts, are
7
more vulnerable to transient faults. Moreover, the output produced by the appli-
cation can be analyzed in order to understand the weaknesses of the application.
The statements that causes heavy damage in program output, when distrupted
by a SEU, should be protected by some form of fault tolerance in order to preserve
the reliability. Additionally, the hardware and environmental conditions can also
be an eective factor, when considering the rate of transient faults.
SSFT will use aforementioned dynamic analysis, to selectively protect the
code segments that are most likely to damage software reliability. The code seg-
ments that are less likely to damage application execution, can be removed from
fault tolerance without impairing reliability. During this selection, the amount
of redundancies introduced to the system are reduced, while, the probability of
transient fault occurrence and specic output quality requirements of the user are
considered. SSFT will increase the performance, decrease hardware requirements
and therefore cost, while eectively preserving the software reliability. The mo-
tivations behind SSFT, briey discussed above, will be explained in depth with
examples in the next chapter.
8
Chapter 2
Motivation
The motivation behind using SFT techniques is that they are more exible and
cheaper unlike their HFT counterparts. The HFT techniques protect the whole
CPU hardware without considering the running application. Moreover, HFT
techniques are expensive, and therefore not scalable (especially bit-level tech-
niques like ECC or Parity). Furthermore, they increase the access times of
volatile memory, thereby decreasing the availability and increasing the power
usage. While most of these overheads are valid, it will be shown that SFT tech-
niques do not completely overcome these problems.
To overcome the limitations of SFT, we propose Selective SFT, where we
choose specic portions of the application that are most vulnerable to transient
faults. Specically, we prole the application and apply SFT to program seg-
ments that are likely to cause the most damage to program execution. By careful
selection of these program portions, the overall output quality is preserved with
minimal safekeeping.
Both EDDI [11] and SWIFT [4] protection schemes use instruction duplica-
tion for each and every line of the code. They use dierent registers for the
duplicated instructions and faults are detected at synchronization points (store
instructions). Similar to HFT techniques, both of these techniques overprotect
the whole CPU hardware without any distinctions. Application binary and data
is protected in a unied manner with no distinction.
9
One can observe that not all parts of the code have the same importance
level. For instance, a register may contain debugging data that does not aect
the outcome of the program. An error occurring in this register, will not aect
the application output. Similarly, a register le that is masked by an "AND"
instruction, will not cause any errors in the program output if the transient fault
occurs in the masked bits. Moreover, a register le that contains obsolete in-
formation (i.e. a loop variable after the loop is execution is nished), will not
aect the software execution in case of a transient fault occurrence. As a last
example, data in the register le may contain highly precise information which is
not required for that application (Double precision data is not vital for oating
point calculations). Any error occurring, in high precision bits will not impact
the output. In each of these examples, the application will resiliently recover from
the soft errors occurring in these sections without any serious impact of the ap-
plication outcome, therefore totally abandoning duplicated instruction approach
or using a more lightweight protection scheme for these sections will not have
major impact on software reliability. However, no such distinctions are made
in duplicated instruction schemes, all instructions and data are duplicated and
protected.
HFT methods are known to be expensive and not scalable, whereas SFT meth-
ods are expected to be cheaper and not require additional hardware. EDDI, as
one of the initial single threaded SFT techniques, has a geometric mean of 1.62x,
whereas SWIFT has 1.41x execution time compared to the baseline without any
fault tolerance. Therefore, reliability is achieved at the expense of availibility and
performance. In other words, SWIFT requires a better performing CPU (1.41x)
to achieve the same performance levels. In addition to performance, memory
footprint will also be aected. Specically, application binary size will be 2.4x
larger compared to the baseline SWIFT implementation [4]. Moreover, the extra
shadow instructions will increase the pressure on CPU registers, cache, and RAM.
The issues above show that instruction duplication technique also has its own
drawbacks similar to HFT. While these issues cannot be ignored or resolved com-
pletely, their impacts can be safely reduced, especially for applications that are
more resilient to soft errors. Consider an application in which the calculations
10
do not require 100% precision, that is an imprecise or approximate result is ac-
ceptable. These types of computations are commonly used in soft computing
applications, which are naturally more resilient to soft errors since an exact re-
sult is not always essential. For an application where 95% precision is sucient,
a protection scheme that protects some parts of the application and gives 95%
precision will be sucient. According to SEU model, a bit is dened as ACE (Ar-
chitecturally Correct Execution) if a transient fault aecting that bit will cause
the program to execute incorrectly [12]. When the redundancy is reduced in the
system, the number of bits susceptible to soft errors will decrease, hence the num-
ber of ACE bits will decrease.
Another example that can make use of SFT is software ECC, where an ECC
encoding is done over the instruction cache memory [6]. The software ECC thread
is a high priority thread that detects and corrects errors in the running applica-
tion code. The sweep performance aects the overall availability and performance
of the system since other processes are halted during the sweeps. Proling in-
formation about the running application and selectively choosing the instruction
data to protect will result in less sweep time, therefore will put less pressure on
the system performance and eventually increase the availability.
Thread-level redundancy which runs an identical copy of the main thread for
error detection requires a slack between the main and trailing thread. CRT (Chip
level redundantly threading) systems which may be preferable for performance
(2 CPUs are running the threads instead of one) and reliability concerns, since
the trailing thread will be running in a CPU that is physically far away from
the CPU that the main thread is running. This way overheads due to running
threads will be reduces while eliminating the possibility of a fault corrupting both
threads [13]. However CRT has a major impact on inter thread communication
in that the thread communication transforms into inter processor communication
which requires fast communication channels. This communication overhead is
hidden by enabling a longer slack between redundant threads, which eectively
stalls the main thread. When SSFT idea is applied to CRT or CRTR (Chip-
level Redundantly Threaded multiprocessor with Recovery), the trailing thread
will not require the same amount of resources as the original thread and the re-
sources seized by the trailing thread can be safely released when trailing thread
11
Main Thread
Software Code
Error Detection Code
Error Recovery Code
Communication
TrailingThread
Figure 2.1: Improvements for CRT.
is not used. The slack between main and trailing thread can be reduced through
SSFT which will improve the overall performance, reduce power consumption
and resource requirements. Similar to the SFT techniques presented above, the
improvements and the time saved can be put to better use for recovery purposes.
Figure 2.1 shows how SSFT can be applied to CRT, where bold lines represent
a redundant fault detection and recovery thread. As can be seen from the gure,
trailing thread does not need to be an exact replica of main thread, i.e, the lines
that are not bolded ar not executed by trailing thread. The spare time achieved
by not executing these statements, can be used for recovery and to compensate
communication delays.
Selectively applying reliability can also be implemented to accommodate hard-
ware fault tolerance techniques. The compiler previously informed about the
underlying hardware and the running application can make informed decisions
about where in memory each instruction should be placed. The instructions or
data that are more resilient to soft errors can be placed in locations that are un-
protected or only protected with parity, whereas the instructions that are most
likely to cause great damage are placed in ECC covered memory locations. In
12
this manner, the costs can be reduced and HFT techniques can be applied more
eectively.
Beyond the benets brought to fault tolerance techniques, SSFT mainly uti-
lizes the key fact that not all applications require the same amount of reliability.
Some applications are actually more tolerant to imprecisions and approximations
which make them more resilient to transient faults. These applications (referred
to as Soft Computations) may require some amount of reliability when they also
have conicting performance and availability concerns. In stream processing ap-
plications, nancial calculations and even some safety critical systems, where a
margin of error is acceptable. The SSFT idea is feasible to apply and potentially
will bring a balance between performance and reliability.
The arguments presented above constitute the motivational base for proling
the running application and selectively applying the fault tolerance techniques.
The details about fault injection, software proling, fault detection and recovery
will be discussed in the next section.
13
Chapter 3
Related Work
3.1 Fault Injection
The fault injection techniques follow various paths in simulating the transient
faults that are caused by cosmic radiation. MEFISTO [14], VERIFY [15] and
DEPEND [16] tools inject faults into a simulation model. RIFLE [17] and MES-
SALINE [18] tools inject faults at hardware pin-level. FIAT [19], and FER-
RARI [20] are tools that inject faults into physical systems using software im-
plemented fault injection (SWIFI). The SWIFI idea brings a new perspective to
fault injection in that, the faulty conditions can be simulated without hardware
requirements. These tools have the general problem of being specically designed
for a certain hardware and therefore cannot be adapted to dierent congura-
tions. Tools that are more adaptable emerged later on, such as NFTAPE [21],
GOOFI [22], PROPANE [23] and SWIF-IT [24].
NFTAPE tool supports multiple fault models (bit-ips, communication and
IO errors), multiple fault event triggers (path-based, time-based, and event-based
triggers), multiple targets (distributed applications, software implemented fault
tolerance (SIFT) middleware layer, black box applications, communication inter-
face, and operating system) and supports memory dumps when required. [21]
GOOFI tool on the other hand, is an object-oriented JAVA and SQL based
14
tool, in which user can use existing fault injection techniques or extend the tool
by dening their own and run fault injection tests. The tool targets Thor RD mi-
croprocessor which is a SAAB Ericsson Space AB processor and is created solely
for highly dependable space applications [22].
Propagation Analysis Environment (PROPANE) is a software proling and
fault injection tool for applications running on desktop computers. PROPANE
supports the injection of both software faults (by mutation of source code) and
data errors (by manipulating variable and memory contents) [23]. PROPANE's
capabilities for software proling, is more focused on the error propagation char-
acteristics of the running application.
SWIF-IT is more of a kernel level tool developed to run under Linux, which
injects fault at memory locations and inspects the impacts of the injected fault.
The fault injector being implemented within the kernel has the limitation that
memory corruptions are restricted to the kernel's view of the hardware meaning
that corruption of a data structure inside the process table is easier compared
to a data in a specic memory location [24]. Additionally, this tool oers er-
ror recovery schemes for the faulty memory locations. The recovery schemes are
based on simple redundancy techniques like Hamming Code and majority voting,
in which multiple copies of the same data is stored and in case of an error it is
recovered using the value that has the majority.
While these software implemented fault injection (SWIFI) tools provide some
level of fault injection capabilities, we have implemented our own fault injection
and testing tool which provides us the required data for our approach.
15
3.2 Fault Detection and Recovery
When considering transient fault detection and recovery, there are three main
classes of techniques: hardware fault tolerance (HFT), software fault tolerance
(SFT) and hybrid techniques (software implemented hardware supported).
3.2.1 Hardware Fault Tolerance
HFT techniques can be further categorized in to two, bit-level approaches and
macro-reliability approaches. Bit-level approaches mostly rely on redundantly
storing extra data bits for the data in memory in order to detect and recover
from any transient faults. One of the simplest approaches for error detection is
parity. Parity for a data is calculated by simply applying the XOR operation
on the data bits. For instance, even and odd parity for the following 7 bit data
"1101011" is calculated as follows.
1 1 0 1 0 1 1 = 1 Even Parity (3.1)
 (1 1 0 1 0 1 1) = 0 Odd Parity (3.2)
The idea is store the parity information in memory or in data transactions so
that a single bit error in data can be detected. For instance, in an ASCII data
transmission, the resulting parity bit is added as the 8th bit data and the data
will be sent as "11010111" where the last bit is the even parity bit. The receiver
will check the received data by simply applying the parity calculation on data
and comparing the calculated parity bit with the parity bit received. In case
a parity error is detected, data transmission is repeated. Similarly, this is also
used in memory hierarchy. Especially, CPU caches adopt this idea in hardware
so that in case of an error inside the data in the cache, data is invalidated and
requested from memory. Parity bit is able to detect a single bit error inside the
data; however, it does not oer any recovery options.
For recovery purposes, dierent error detection and recovery schemes have
16
Check Number Check Positions Positions Checked
1 1 1,3,5,7,8,11,13,15,17...
2 2 2,3,6,7,10,11,14,15,18...
3 4 4,5,6,7,12,13,14,15,20...
4 8 8,9,10,11,12,13,14,15,24...
. . .
. . .
. . .
Figure 3.1: Check positions for Hamming Code.
been proposed and used in the literature. Hamming Code, Reed-Solomon Code
and other cyclic code schemes are commonly used in the industry for this purpose.
For brevity, only Hamming code (the recovery method adopted in ECC memory)
will be explained here. Hamming code can be used to detect and correct single
bit errors, and with an additional parity bit added, it can also detect double
errors. The idea is to put parity protections on the positions that are powers
of two starting from the rst bit position. For a 64 bit data, the parity will be
placed on 1, 2, 4, 8, 16, 32, and 64 check positions which will make a sum of 7
parity bits. The check positions and the positions checked are shown in Figure
3.1.
For instance, a check position 4 (100 in binary) has the check number 3 mean-
ing that, all the data bit positions having 1 in their 3rd bit needs to be added in
parity calculation, which are 100 (4), 101 (5), 110 (6), 111 (7), 1100 (12) and
17
so on. As an example for detection and recovery, consider a 7 bit data, 1101011,
and calculate the parity for check positions 1, 2 and 4, powers of two up to the
total number of data bits. The bits are numbered from right to left starting from
one and the data in position x is shown as dx.
P1 = d1  d3  d5  d7 = 1 0 0 1 = 0 (3.3)
P2 = d2  d3  d6  d7 = 1 0 1 1 = 1 (3.4)
P3 = d4  d5  d6  d7 = 1 0 1 1 = 1 (3.5)
Consider the case when a single bit ip occurs in bit location 6, then the parities
will look like as the following.
P1 = d1  d3  d5  d7 = 1 0 0 1 = 0 (3.6)
P2 = d2  d3  d6  d7 = 1 0 0 1 = 0 (3.7)
P3 = d4  d5  d6  d7 = 1 0 0 1 = 0 (3.8)
The location of the error is determined by comparing the parity values. P1
is correct, therefore we write a 0; P2 is incorrect therefore we write a 1 to the
left, obtaining 10; P3 is incorrect as well, therefore putting another one to the
left we end up with 110 which indicates the data in the 6th bit position is ipped.
This way, the Hamming Code is able to perform 1 bit error detection and 1 bit
data recovery. In order to have single error correction, double error detection
(SECDED), an even data parity is added to 7 bit data (XOR of all data bits).
Although bit level approaches oer detection and recovery options, they are
often found not to be scalable, and therefore have not been used widely in CPU
architectures. This is due to their impact on performance and power consump-
tion. The alternative is to follow a higher level approach called macro-reliability.
Some of the older systems like HPs NonStop Cyclone System [25] or IBMs S/390
G5 processor [26] or Triple redundant 777 primary ight computer [27], all rely on
redundant CPU cores (Boeing 777 even has multiple ARINC data buses), through
18
which faults are detected, corrected. Even in case of a total hardware failure of
a single hardware, these systems promise to operate uninterrupted. There are
also simpler approaches in duplicating hardware for redundancy such as extra
pipelines in processors, queues for memory loads and stores, extra branch pre-
dictors [13, 28, 29, 30], additional data bits for detection of possibly incorrect
data [31].
Recently, with the emergnce of Chip Multiprocessors (CMP), hardware dupli-
cation and usage has become much more attractive. However, the he additional
hardware is required to overcome the diculties in communication, race condi-
tions and other problems that arise with reliability concerns.
Intel Itanium proessor is a good example of hardware reliability scheme im-
plementation. Specically, Itanium processor oer parity protection in low-level
caches, ECC protection in high level caches. Errors in pipelines are detected
using residues, which are calculated during mathematical operations, or using
parity bits [32]. The instruction level faults are detected and corrected inside the
pipelining architecture. Whenever a soft error occurs, the instruction is simply
re-read correctly from the instruction buer and restarted through the pipeline
as if the error had never occurred (referred to as replay). An error occurring in
the instruction buer is handled in the following manner; all instructions in the
instruction execution pipeline, the instruction buer, and the instruction fetch
pipeline are removed, then the erratic instruction and the instructions after it are
re-read from cache (referred to as refetch). Soft errors in translation lookaside
buers (TLBs) and in general purpose and oating point register les are handled
by rmware which eventually restarts the OS or application to resume running
as if no errors occurred (referred to as resteer). The technique described above is
handled by hardware without any software interventions.
3.2.2 Software Fault Tolerance
Software fault tolerance techniques mostly rely on redundant data and computa-
tions and comparison of the original and redundant data (or calculation). The
recovery of the faulty data is generally addressed as a separate problem. For
19
software fault detection, there are single threaded techniques that duplicate the
instructions like EDDI [11] or SWIFT [4]. The instruction duplication was pre-
sented rst as an ALU instruction duplication scheme. The idea was to increase
reliability, by keeping the redundancy as low as possible, simply by using the
same registers and abuse the VLIW architecture's ability to execute the same
instruction on multiple data and compare the outcomes that are executed by the
same CPU instruction [33].
In order to generalize this approach and protect all instructions inside the
system with better reliability, error detection by duplicated instructions (EDDI)
was proposed. EDDI simply duplicates all the instructions and reorders the
duplicated (or shadow) instructions that occur before the store instructions, so
that the instruction level parallelism (ILP) capabilities of CPU are utilized. The
results of the main instruction (MI) and shadow instruction (SI) are compared
before the store instructions so that the data in memory is always kept intact.
All store and load instructions (memory operations) are duplicated and the only
control over control ow errors relies on the count of instructions that are run by
shadow and main instructions. According to the experimental results, EDDI has
an execution time of 1.61x compared to the baseline execution [4, 11]. Normal
duplication of the instructions or execution of the same instructions twice is
expected to have a 2x execution latency, however exploiting ILP, the performance
is improved by 20% percent. However, this improvement all depends on the ILP
capabilities of the underlying CPU. More specifcially, instead of using a 4-way
issue CPU, a 2-way issue CPU has an execution latency of 1.82x. Experiments
show that EDDI is able to provide a 98% error coverage capability [11]. Figure 3.2
shows how EDDI adds the shadow instructions and schedules the instructions. As
can be seen from this gure, the SI (marked with an apostrophe) are interspersed
with main instructions for better ILP and the comparison is done just before the
data is written back to the memory.
SWIFT tries to tackle the problems that are not addressed or partially ad-
dressed by EDDI and tries to improve the overall performance by rather simple
optimizations. One simple optimization is that the I5' instruction in Figure 3
can simply be omitted since it is not critical and the stored value will require
20
I
1
 : ADD R1, R2, R3 
I
2
 : SUB R4, R1, R7 
I’
1
 : ADD R21, R22, R23 
I
3
 : AND R5, R1, R2 
I’
2
 : SUB R24, R21, R27 
I
4
 : MUL R6, R4, R5 
I’
3
 : AND R25, R21, R22 
I’
4
 : MUL R26, R24, R25 
I
c
 : BNE R6, R7, go_to_error_handler 
I
5
 : ST R6 
I’
5
 : ST R26 
Figure 3.2: EDDI instruction duplication and scheduling example.
additional memory space and an additional rather slow memory instruction to
be executed (referred to as EDDI + ECC). Another improvement in SWIFT is
in terms of the control ow checking mechanisms. EDDI does not have a direct
control ow control but rather implicitly checks the number of MI and SI. This
check may cause invalid branches to be taken, invalid store and load instructions
to be executed that may disrupt the MI execution or feed these instructions with
invalid data [4]. SWIFT overcomes this problem by adding the branch instruc-
tions as synchronization points and designating block signatures to the executing
block, so that the control ow is validated by comparison of the expected block
signature and the executing block signature. The block signatures are stored
and updated for each block in GSR and RTS registers. This improvement is
referred to as EDDI + ECC + CF (Control Flow Checking). One observation
for EDDI + ECC + CF is that, the control ow checking mechanism is required
only when the output is stored in memory since the main purpose is to keep the
data in memory intact. Therefore, the CF mechanism, signature comparison etc
can be safely omitted in blocks that do not have store instructions (referred to
21
as "SCFOpti"). Another simple observation is that, the comparisons for branch
instructions is actually covered by the idea of comparison of block signatures
which ensures correct branching, henceforth the branch synchronization can be
safely omitted (referred to as "BROpti"). SWIFT does not oer an actual per-
formance improvement over EDDI, however it improves EDDI with control ow
checking mechanism and by removing redundancies. Based on the conducted ex-
periments, it is shown that SWIFT has an execution latency of 1.41x compared
to the baseline implementation.
Another software only fault tolerance technique is software EDAC (or ECC)
technique which is a software implementation of ECC (Error Correcting Code).
The implementation aims only to protect the instructions that are placed in the
memory hierarchy before execution. Due to the dynamic nature of the data it
is not preferred and is argued to be not practical to use software ECC [6]. In
this approach, the protection of the software code in memory is done in terms
of sweeps in which the ECC walks through each memory block that contains
software instructions and checks for errors. When a sweep takes place, it takes
the highest execution priority; therefore any other software should be halted
during the sweep interval which eventually is a bottleneck in availability and
performance. The sweeps cannot be cached since; in each sweep the memory will
be read once and checked for errors. Another issue for software ECC is that,
the ECC software also runs in memory and itself is also susceptible to transient
errors in memory. This requires to execute multiple copies and a cross check
between these copies. Although software ECC provides some protection when
hardware ECC does not exist and oers recovery options compared to other SFT
approaches, it does not oer any protection over data and the memory is left
unprotected between the sweeps.
There are other software techniques in the literature having dierent levels
of duplications or controls, such as control ow only detections (using signature
comparisons for blocks, execution parity calculations) [34],high level code dupli-
cation and result comparison [35], analysis for dierent levels of duplication and
tailoring between instruction and procedure call duplication for energy consump-
tion reductions [36], and process level duplication [37].
22
Software fault tolerance techniques that depend on simultaneous multithread-
ing (SMT), mostly require a hardware support for queuing the load values (LVQ),
register values (RVQ), and branch outcomes [13, 28, 29]. Therefore these schemes
should be considered under the umbrella of hybrid techniques as they require ad-
ditional hardware.
In terms of recovery, SWIF-IT oers hamming code and majority voting based
recovery schemes in which multiple copies of the same data is stored and in case
of an error, it is recovered using the value that has the majority. SWIFT-R [38]
technique is an addition to SWIFT that has recovery capabilities. The following
techniques are suggested in SWIFT-R. Triple execution of the same instruction
and decide the result by majority voting in case an error is detected (referred to
as SWIFT-R). Another extension is using AN-CODE to back up a multiple of
the original data (possibly 3) and in case of an error, restoring the data by simply
using the AN-CODE coecient. For instance, a data x is multiplied by 3 and
stored as y. In case of an error, if y is divisible by 3 then y is correct and x should
be recovered as x = y
3
, otherwise y is incorrect and should be restored to y = 3x.
SWIFT-R suggests that, the data containing single bit information inside a 64
bit register should not be aected with the bit ips of the rest of the 63 bits that
are irrelevant. Therefore, these data bits can be ignored and masked (data&0x1)
in order to decrease the vulnerability of the register data (instead of having 64
bit vulnerability, only a single bit is susceptible to transient faults). However,
the applicability of this scheme is limited due to performance overheads. For the
masking idea, it can be argued that the register value having a single bit value is
hard to be ensured and therefore not easy to apply in most cases.
3.2.3 Hybrid Techniques
The hybrid techniques are generally oered as extensions of software only fault
tolerance techniques, to overcome the single points of failures or performance
bottlenecks that cannot be dealt with software. One of the extensions suggested
to improve software techniques is CRAFT, which is an extension over SWIFT
technique.
23
In SWIFT, two problems for memory operations were not addressed and con-
sidered to be the limitations of the approach. One problem in SWIFT is that, the
store instructions are single points of failure [8], since the fault checking is done in
software before the store instruction is issued. Any errors occurring before com-
mit and after the error check cannot be detected and will cause the corrupted
data to be stored in memory, which may later feed and corrupt other instructions.
In order to resolve this store instruction issue, the comparison is removed and
replaced with another store instruction. The second store instruction is fed with
a signature that is not an actual store instruction; rather, it is issued to validate
the original store instruction. In order to provide validation, the stores do not di-
rectly store the data to memory, instead they are queued in a protected hardware
queue and the commit for stores is delayed until the second (or shadow) store
instruction is executed and the data to be written is conrmed to be valid [8].
Similarly, for load memory operation, in order to duplicate the data loaded
from memory and have an exact replica of the loaded data, SWIFT does not
replicate the load instruction (for memory mapped IO, two load instructions may
return dierent values); instead, it inserts a move instruction to copy the loaded
value. However, there are two intervals where transient failures can be corruptive.
The rst interval is, after the load instruction and before the move instruction,
and the second interval for errors is after the memory address verication and
before the actual load instruction. In the rst interval, the original and the copy
will contain a corrupted value which will not be detected. In the second interval,
the data loaded will be issued to a faulty address and will contain invalid data.
In both cases, the faulty data may eventually feed a store instruction at some
point. In order to prevent these faults, the memory load values are queued in a
hardware protected buer called Load Value Queue (LVQ), where only the main
instruction loads the data from memory and the shadow instruction loads the
value queued inside LVQ, which prevents from feeding the shadow instructions
with corrupted data. The data in LVQ can be safely discarded after the shadow
instruction is fed with the buered data. These additional buers improve the
performance while dealing with the vulnerabilities that cannot be addressed by
SWIFT. The performance results compared to the baseline are given as 1.334,
24
1.376, and 1.314 which are improvements with CSB, LVQ, and CSB + LVQ,
respectively [8]. Additionally, the vulnerability factor for silent data corruption
(SDC), which are the undetected errors, is decreased from 90.5% to 89.0% when
both techniques, CSB and LVQ, are applied.
Redundant multithreading techniques can also be considered as hybrid tech-
niques, due to their hardware requirements. With the idea of multithreading as a
way of maximizing on-chip parallelism [39, 40], the idea of using multithreading as
a way to improve reliability also emerged in various approaches [41, 30, 29]. Chip
Multiprocessors (CMP) and CPU level multithreading enabled researchers to in-
vestigate dierent alternatives [29]. Redundant multithreading can be achieved
with two alternative ways, running the leading and trailing threads in the same
single CPU core that supports multithreading capabilities (SMT) or running
the threads in dierent CPU cores (CRT) preferrably in adjacent cores in or-
der to reduce the physical distance and communication delays. AR-SMT [30],
SRT [41, 28, 29] and SRTR [28] were suggested as alternatives for SMT tech-
niques, in which two threads (main and trailing thread) run the same instructions
in parallel threads and the results are compared for detection of transient faults.
In all SMT techniques (and even CRTs), queues or similar queueing mechanisms
are built to enable communication between main and trailing thread. Specically,
LVQ is used to buer the load values of the main thread, register value queue
(RVQ) is used to buer the register values of main thread for comparison, branch
outcome queue (BOQ) is used to store the branch outcomes for the main thread,
and store buer (StB) is used to verify the values issued by store instructions from
the main thread. The queue values are created by main thread and consumed by
trailing thread in order to check for faults.
SRT chooses stores as synchronization points, instead of register updates.
Secondly, a slack fetch organization handles and tries to keep the slack ( trail-
ing thread follows main thread behind certain number of instructions ) between
threads to a pre-dened number and organizes fetches accordingly.
SRTR scheme is actually the extended version of the SRT detection scheme
with recovery abilities. In order to provide the recovery, some SRT modules were
25
reconsidered. In SRT, leading instructions may commit to memory before check-
ing for faults. However, this is not feasible for SRTR since recovery mechanism
depends on trailing thread values, and therefore can not recover when the values
are committed to memory. Secondly, SRTR compares the leading and trailing
instruction values as soon as trailing instruction is complete without being have
to wait for leading instruction commit. To make this possible and reduce commu-
nication requirements, register values are placed in RVQ from which the trailing
thread can make comparisons and detect faults. In order to reduce the band-
width requirements and pressure on RVQ, a dependency check mechanism called
dependence-based checking elision (DBCE) is applied to check the true depen-
dency chains of registers so that only the last instruction in a chain is used and
placed in RVQ.
Although AR-SMT uses the multithreading infrastructure, it neither aims to
exploit the advantages of a CMP structure nor attacks the problems of running the
redundant threads in seperate CPU cores. Slipstream [42] processor idea was the
rst to take the performance advantages of a CMP and use CMPs for reliability
purposes. Later on CRT [29] and CRTR [13] were suggested as alternatives
for fault detection and recovery in CMP systems. In Slipstream, the register
commits are asynchronous in order to overcome the interprocessor communication
delays. The checks are only done at memory commits, and the memory commits
are done only after the checks are completed. The memory instructions are
not copied as in Slipstream (only a single copy of memory is used) only the
register les are duplicated. Also it extends the DBCE idea in SRTR with death
and dependence based checking elision (DDBCE), in which the formerly ignored
masking instructions are also added to dependency chains. The DDBCE chains
a masking instruction only if the source operand of the instruction dies after
the instruction. Register deaths ensure that masked faults do not corrupt later
computation [13].
26
3.3 Software Proling For Fault Tolerance
Software Proling for fault tolerance is the analysis of the application in order
to understand the characteristics of the application in terms of fault tolerance.
The characteristics of the application or the proling information can be put
to use for better protection of the running application. The proling info can
provide the programmer, which parts of the program are more susceptible to
transient faults, what is the approximate possibility of occurrence of some fault,
what is the impact of an error to the program output, in what path does the
errors propagate. These details combined with the underlying hardware, such as
operating system or other system specics will provide invaluable information in
terms of providing the best fault tolerance t to the reliability requirements of
the system with minimal cost. This thesis mainly relies on this fact.
One of the studies in this area uses instruction operand types as predictors
for transient fault impacts [9]. The instruction operands are separated into four
base groups called F, O, C and A. "F" is an abbreviation used for oating point
operand, "O" is for memory oset, "C" is for comparison operand and "A" stands
for ALU instruction operand. By this grouping, each register in the instructions
are grouped according to their usage. For instance, a register operand is of type
O, C, A when the register operand is used in memory oset, comparisons and
ALU instructions. Using this grouping, software characteristics is tried to be
predicted. A case study, Brake by wire [9] is used to evaluate the power of
operand type analysis. The analysis shows that, the brake by software is more
likely to be vulnerable to F and O type operands, meaning any errors in these
operands cause major defects and corruptive faults. Although the type inference
idea provides some insight of software vulnerabilities, it does not give any specic
detail about how to improve the software fault tolerance techniques. Moreover,
although grouping the instructions may seem like a good predictor, all memory
instructions cannot be considered important when some of the memory instruc-
tions cause critical errors. In other words, not all data written to memory or
read from memory carries critical or invaluable information. Or some instruc-
tion operands may contain critical data despite the fact that, most of that type
27
of operands contains insignicant data. As a hypothetic example, consider the
brake by wire case in which "F" and "O" type operands seems to carry more vital
information. While this gives a better idea about operands, it may not always be
true. For instance, an "A" instruction operand may also carry vital information.
Naturally, this particular operand should also be considered for protection when
reliability is a concern. However with type inference technique it may conversely
be grouped into "A" type operands which are considered to be safe to ignore in
terms of reliability.
Another study has a similar approach to ours; EPIC suggests that error propa-
gation and the eect of errors is a requirement that should be taken into account
when developing dependable software. They track error propagation paths as
well as the impact of the errors. For each input-output couple they suggest that
there is a permeability that is dened as the conditional probability of an error
occurring on the output given that the input is faulty [10]. They extend the idea
for multiple input-output systems punishing the modules with large I/O count.
When considering a single input and its impact on single output, the probability
of an error showing in this output (or the weight of an input) is calculated as
the multiplication of the error permeability of the I/O systems in the backtrack
tree. Figure 3.3, is an example that shows I/O paths and weights for outputs in
EPIC [10].
For instance, the calculation of W2 is follows:
W2 = P
B
2:1P
B
1:3P
D
1:1P
E
2:1 (3.9)
The error permeability is only one side of the story and it should be combined
with the eect of the errors that propagated the criticality of the error. The
criticality of the errors is a coecient for the weights calculated which is decided
by the system administrator and is a value between 0 and 1, where 1 denotes the
highest possible criticality value.
In EPIC, inputs of the system are assumed to be independent of each other and
the dependencies are tried to be resolved with the coecients assigned. An input
may have dierent eects on the output according to the variation or values of
28
I2
B 
I1
B 
I1
E 
O1
E 
I1
D 
I2
E 
O1
E 
O1
E 
O1
E 
I2
E 
I1
D I1
E 
P2.1
B P2.3
B 
P2.2
B 
P1.2
B 
P1.1
E P1.1
D 
P1.3
B 
P2.1
E 
P1.1
E P1.1
D
P2.1
E 
W1 
W2 
W3 
W4 
Figure 3.3: Weight calculation for error paths.
other inputs. In addition, error permeability of a single input can be misleading,
an input may have critical eect on the output (high permeability), although the
probability that it is used in calculation of the output may be low (a branch that
is taken only 0.1% of the time). The impact of the risk and the probability of the
risk occurrence should be evaluated independently and the system administrator
should be able to assign coecients for these factors. In addition, the errors that
seem to disappear in error permeability paths may go unnoticed and eventually
have dramatic impacts in the program output. For instance, these errors may
cause invalid branches to be taken, input data corruption, and invalid memory
accesses.
The testing environment chosen for EPIC is in embedded (or circuitry) sys-
tems domain in which, input-output signals and their relations have more clear
cut denitions. However, in practice, a single output may not be present for an
29
input (an input may only be used for data oset, branch decision, etc.) or the
output may feed another output as an input.
EPIC uses the gathered proling information for simple executable assertions
(EA), which checks for an output signal to be in a pre-dened range and injects
these checks in software where an input has high error permeability on the output.
The input-outputs are considered in signal level in EPIC, therefore the data in
between input and output is simply ignored.
Another approach similar to ours, although has a very dierent context and
a very dierent fault model, supports the ideas presented in this thesis. The con-
text of the work is nancial calculations in stream processing environments where
large amount of data is constantly processed and a continuous output is produced.
Partial fault tolerance (PFT) is tested for stream processing applications in order
to understand if PFT is viable. Specically, PFT is partial selection of the data
for protection in order to achieve decent reliability with minimal cost and impact
on performance. Output quality for the nancial calculations is used as a key
metric for deciding which parts of the data to protect. Although PFT idea is
presented for data protection in stream processing applications, we can borrow
and apply the same idea for general reliability concerns. Applying PFT is not
reasonable without a clear understanding of the impact of faults on the quality of
the application output [43]. Similar to PFT, our SSFT idea also requires an out-
put quality assessment when selectively applying the reliability measures in the
running application. Our expectation, similar to PFT, is that SSFT will provide
better resource utilization, less power consumption and better performance.
Our design is dierent from EPIC in following aspects; we use a probabilistic
approach for error occurrence calculation, and estimate an expected probability
for an error to occur. Additionally, we extend the error impact idea, using again
the golden runs idea, and use dierent measures for error impact calculation. We
then combine the probability and error impact values, and with user dened coef-
cients calculate an expected error rate for dierent variables inside the software.
We do not reduce the inputs and outputs to signals; therefore our approach is
more generic and can be applied regardless of the running application or hard-
ware. We then use optimization techniques for best t protection that has the
30
minimum protection cost (performance, hardware requirements), while providing
a decent amount of protection. The next section will explain the details of the
approach taken in this work.
31
Chapter 4
Our Approach
4.1 Preliminaries
In this section, we will briey introduce the denitions required to understand
this thesis. Basic block is a fundamental term, which other notions are build on.
A basic block is a piece of code that does not contain any branches or jumps.
Branches or jumps provide transitions between blocks. Basic blocks have single
entry and exit points. The transitions between an entry point of one basic block,
to exit point of another basic block, which are called edges. In order to have a
deeper understanding of a software execution cycle, the compilers form a data
structure to present the pieces of software which is called Control Flow Graph
(CFG). To obtain this graph, the software is decomposed into basic blocks that
are connected by edges.
GCC, the compiler infrastructure used in this thesis, is a collection of compiler
front ends for languages C, C++, Objective-C, Fortran, Java, Ada, and Go etc.
GCC converts these high level code into GIMPLE statements as a middle-layer
form, which is a three-address representation which is formed in tuples of no
more than 3 operands (with some exceptions like function calls) [44]. In order
to imitate SEU, we inject errors on the parameters of the GIMPLE statements,
which we will refer to as statement parameter throughout the rest of this thesis.
32
The term program execution cycle referred in this thesis, is a process that be-
gins with execution of application binary and ends with generation of the output.
33
4.2 Overview
Selective Software Fault Tolerance, relies on the fact that the characteristics of
the running application and the underlying hardware have major impact on the
reliability of the whole system. Proling information can be put to use for ne
tuning reliability measures and eectively protecting the entire program. Our
main idea of software proling for reliability, is rather simple and intuitive. The
statements that are executed the most in percentage, and the parameters that
are used in these statements are most likely to be vulnerable to transient faults.
Additionally, the statements that cause the most dramatic eect in terms of
program execution cycle and program output quality are likely to require the most
protection when reliability is considered. These ideas when combined constitute
an eective approach for analyzing the program vulnerabilities.
The statements that are executed the most in percentage can be analyzed by
statistical analysis of executing program. The statistical information about the
number of executions of a single line of code can be obtained through compiler
directives. When we know the execution statistics for each satement, we can
simply acquire the percentage of execution (PE) of a statement in the program
execution cycle.
PEi =
Execution count for statement i
nP
i=1
(Execution count for each statement)
(4.1)
PE for a statement (or an instruction for assembly language) actually gives
the likelihood of a statement being executed in a standard program execution
cycle. The more a statement is executed, the more it is likely to be hit by
transient faults. With this information, one can extract the statements that are
most vulnerable and these statements can be considered as main points of failure
when SEUs are of concern. It should be noted that program execution depends
on the input data. Therefore, the execution statistics and PE for each statement
is likely to change for varying inputs. For simplicity, each test input is given the
same probability and the nal execution counts are taken as the average of the
34
execution counts for dierent inputs. However, this estimation can be tailored
using the likelihood of the input data; therefore a ner approximation for PE can
be achieved.
The statements that aect the program execution and output quality the
most can be analyzed through experimentation methods. For experimentation,
the SEU should be imitated for a single statement (even better to analyze a sin-
gle parameter for a line of code). After a bit ip is injected inside the program,
the program should be executed and the behavior of the application should be
observed. This observation will involve the quality of the program output, or
abnormal execution termination after the injection. The output quality will be
measured by comparing the results before and after error injection (which is also
called Golden Run Comparison). The quality measure is dierent for each and
every dierent program, therefore dierent measures need to be applied. Typical
measures we use for our approach and details will be discussed in section 4.4.
From these measures, one can infer the quality eect (QE) of a SEU for a given
parameter inside a statement. The QE will be measured as a percentage and
for simplicity the program hang-ups, memory segmentation errors etc. will be
considered as a 100%QE. For applications that do not need to generate output,
for each input data, this QE value can be modied accordingly. For instance,
for an electronic wheel steering application, an output is generated in every few
microseconds. Although, the application is safety critical, it will work uninter-
rupted, in case of a single failure in output.
When considering QE for a transient fault, there are dierent observations
and aspects that need to be considered. A critical observation here is that,
some statements may not require reliability precautionary measures since they
may contain irrelevant information, dead code or code only needed for debugging
purposes. These statements do not have a QE, therefore may be ignored in
terms of reliability. Similarly, some parts of the program may only be for testing
purposes or contain test data. Those statements, should also be ignored when
QE experimentation takes place. Our technique allows user to selectively inject
errors, so that irrelevant functions are ignored.
35
The output quality is subjective, since dierent applications produce dierent
kinds of outputs which may or may not tolerate soft errors. This subjectivity
requires a human observation for output quality, since quantitative analysis may
not be sucient. For instance, consider a lossless compression algorithm. Any
error in the output is intolerable, since the output is useless when any byte errors
occur. Conversely, a lossy compression algorithm used in JPEG images or media
compressions will be able to tolerate some margin of byte errors, therefore, is
more exible in terms of soft errors. Similarly, applications that contain calcu-
lations where imprecise or partially correct results are acceptable are called soft
computing applications. Especially, these types of applications, where soft com-
puting is used as a computation convention, are by nature, more resilient to soft
errors. Therefore, a margin of error is presumed to be acceptable which makes
these applications perfect for Selective SFT optimizations.
In addition to QE and PE parameters, other parameters should also be con-
sidered when estimating the impact of transient faults in the program execution.
For example, at the hardware level, process technology will have an important
eect. According to recent research [2], dimension scaling will cause 8% increase
in soft-error rate with each generation. The process technology, therefore, should
also be considered as an input when vulnerabilities to transient faults are being
measured. While shrinking process technology, has its own advantages in power
consumption, heat dissipation, it also makes the entire system more vulnerable
to transient faults.
By using the QE and PE parameters, one can simply evaluate the expected
rate of error for each statement parameter inside the program. Expected rate of
error (ER) can be calculated by the following equation.
ER = F (PE) F (QE) (4.2)
Expected rate of error is, the error expectation percentage in the program out-
put in case of a transient fault occurrence inside a statement argument. Statement
parameter or statement argument term shall be used throughout the rest of the
thesis referring to the variables inside a statement to which error injections can
be made (i.e. transient faults can alter the value of the parameter causing errors).
36
The F (PE) and F (QE) are functions that can be used to adjust the error rate
factors' impacts. For simpler cases, where the parameters are to be used as is, F
function can be taken as a unit function. For instance, in some cases, the percent-
age of execution should be adjusted in order to increase the signicance of this
parameter meaning that the error occurrence probability is an important factor
for calculating the expected error rate. Consider that the QE eect is required to
be empowered for very low error rates and weakened for values larger than 50%,
then a model function F(QE) can be constructed as F (QE) =

QE+25
QE+100

 100.
For a dierent scenario where QE is required to be empowered for larger QE
values, than F (QE) may be constructed as F (QE) =

QE30
110 QE

. When a simpler
logic, like diminishing the overall error values is required following expression
can be used F (QE) = m  logn (QE). Figure 4.1 shows the eects of applying
dierent functions on the QE values.
31,8 
37,5 
42,3 
46,4 
50,0 
53,1 
55,9 
58,3 
60,5 
62,5 
4,0 
8,9 
15,0 
22,9 
33,3 
48,0 
70,0 
10,0 
13,0 
14,7 16,0 
16,9 17,7 18,4 
19,0 19,5 19,9 
0
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90 100
(QE+25)x100/(QE+100)
(QE*30)/(110-QE)
5xlogn (QE) for N=2
Figure 4.1: Example functions to adjust parameters.
37
In order to obtain the expected rate of error for the entire program, the ER
values for each individual statement argument should be accumulated. Formula
4.3 shows how PER (Program Expected Error Rate) value is calculated.
PER = AER
X
(ER) (4.3)
In formula 4.3, AER is the Architectural Error Rate which is a physical mea-
sure of the rate of failures (or the percentage of transient fault event occurrence).
The general convention in literature is to use MTBF (Mean time between fail-
ures), instead of failure rates since the MTBF value is a positive value and a
large value compared to the failure rates. For instance, consider a machine that
is expected to fail in every 5000 hours, the MTBF is considered as 5000 hours,
which is easier to use than an error rate of 0.0002 failures/hour. Failure rates are
generally considered with a factor of time, rather than number of instructions ex-
ecuted. We can calculate AER from a given failure rate as follows. For instance,
if a CPU has a rate of 1 failures/second and CPU can execute 109 instructions
per second, we achieve an AER value of 10-9.
This AER parameter can be aected by the density of the transistors used in
manufacturing the CPU unit, CPU clock frequency, and environmental radiation.
For instance, a CPU that is manufactured using denser transistor technology
is more susceptible to transient faults then a CPU that is manufactured using
sparser transistor technology, i.e, a 22 nm CPU is more likely to be aected by
radiation than a 90 nm manufactured CPU. In addition AER factor can also be
aected by environmental radiation. Normand states that, in avionics the SEU
rates are correlated to neutron ux, which is also correlated to the altitude [45].
For instance, an application running in a plane ying at 10000 ft is less likely
to be hit by neutron radiation than an application running in a plane ying at
60000 ft.
Since, PER is a measure of the rate of error that a program is expected to
produce against transient faults; we use PER as a reliability measure for the pro-
gram and try to optimize our reliability measures while keeping PER parameter
in an acceptable range. The acceptable PER value dened for the application,
will be referred to as PER throughout the rest of this thesis. PER value can be
38
adjusted for the reliability requirements of dierent applications. For the appli-
cations that do not tolerate any errors, the PER value is zero meaning that the
program is fully protected and is not produce any errors in case of a transient
fault.
For our optimizations, our method is to leave out as much statement parame-
ters unprotected as possible, while preserving PER value of the application. The
statement parameters that are protected requires, some sort of reliability mea-
sures to be taken like instruction duplication, parameter duplication and similar
which result in performance overhead, increased hardware requirements, and more
power consumption. When we reduce the number of statement parameters that
are protected, we end up with a system that has better resource utilization, less
power consumption and better performance. In order to choose the statement
parameters that have less critical impacts on PER, we can sort the statement
parameters in terms of their ER values. For instance, a statement parameter
with an ER value of 0 can safely be omitted without aecting the PER value of
the running application, meaning a transient fault that changes the value of this
statement parameter does not have any impact in the produced software output.
Our approach works as follows; we rst accept each and every statement
parameter to be protected as default by some form of reliability measure (redun-
dancy, memory protection, etc.), therefore the PER value is zero at the beginning.
The statement parameters are sorted in ascending order according to ER values.
Then, we remove protection from each statement parameter one by one. PER
value is re-calculated according to the ER values of the new set of parameters.
When we reach PER value, we can no longer optimize since the PER value will
not be acceptable according to reliability requirements of the application. When
PER value is zero, we can only take out the statement parameters that have zero
ER values. These parameters do not require any protection and can be omitted
when reliability measures are taken.
We call this optimization technique selective SFT (SSFT), meaning selectively
and carefully choosing the parameters that require Software Fault Tolerance and
39
leaving the rest of the parameters unprotected. By the use of aforementioned op-
timization, the underlying hardware and the specic output quality requirements
of each application is taken into account. This way, the amount of redundancy
is minimized even though some statement parameters inside the program are
left unprotected. This optimization provides a reliability level tailored for each
application, which in turn, increases the performance, decreases the hardware
requirements (therefore reduces cost), decreases power consumption, and reduces
the overall redundancies introduced to the system (therefore reduces the vulner-
abilities).
As discussed in chapter 2, our technique can be used in conjunction with
any redundancy technique. It can be used to reduce the number of duplications
in, those techniques that use instruction duplication such as EDDI or SWIFT.
Similarly, it can be used for software and even hardware ECC techniques, to selec-
tively choose the statement parameters to protect. For multithreading techniques
such as CRTR or SMT, our technique can be applied for safely releasing the re-
sources seized by the trailing thread, reducing the slack between main and trailing
threads, therefore improving the overall performance while reducing power con-
sumption and resource requirements.
As expected, our technique will provide better optimization when PER value
is above zero, meaning that a margin of error is acceptable for the application.
The next section will discuss the details of our system architecture.
40
4.3 Selective SFT
Selective SFT (SSFT) technique uses static and dynamic analysis for proling
the application. As explained in section 4.1, compilers form Control Flow Graph
(CFG) to understand application characteristics better. CFG is a static form of
analysis, which contains all possible paths that an application can traverse during
execution. In addition to static information captured in the CFG, statistical
data recorded during program execution will also give a better insight about
the application. Statistical data is recorded during program execution and is
a dynamic analysis instrument. As such, SSFT records the damage in output
(which is also another dynamic analysis instrument) for a statement parameter
in case of a transient fault occurrence. Figure 4.2 shows a partial CFG for an
B1 
B3 
B4 B5 B2 
B6 
P1,2 
P1,3 
P2,6 
P3,4 
P3,5 
P5,6 
E1 
E2 
E3 
E4 
E5 
E6 
L3 
Figure 4.2: An example CFG graph.
application. Let the probability of branch (shown as directed edges) that connects
41
basic block Bi to Bj be Pi;j. For instance, the probability of the branch that
connects B1 to B2 is shown as P1;2 in Figure 4.2. Also, let the set of predecessors
for a basic block be PREi. For instance, the set predecessors for basic block B6
(or PRE6) contains two basic blocks, namely B2 and B5. The cyclic edges in
the CFG shown in Figure 4.2, shows the loops in the application. Let Li be the
loop count for basic block Bi, i.e, for example B3, will have a loop count equal to
L3. B3 will be executed L3 + 1 times, since loop control block is executed once
more before the loop exits. Finally, let each basic block Bi be executed Ei times.
The entry block Bi in Figure 4.2, will be executed once. Equation 4.4 shows how
execution time of basic block Bj is calculated.
Ej =
X
Bi2PREj
(Ei  Pi;j) (Lj + 1) (4.4)
For example, for basic block B3, B1 is the only predecessor in PRE3, therefore
E3 = E1  P1:3  (L3 + 1). The value L3 is dierent than 0 since, B3 has an
incoming cyclic edge meaning that B3 is loop control block. For basic block
B6, PRE6 contains the basic blocks B2 and B5, and L6 is zero since the basic
block does not have an incoming cyclic edge. The execution count of B6 is,
E6 = (E2  P2:6) + (E5  P5:6). The branch probabilities P2:6 and P5:6 is 1 since
these branches are unconditional, hence E6 = E2 + E5. We use these Ei values
for calculating the vulnerability of the statements inside the basic blocks. The
statements that have high percentage of execution in the program execution cycle,
are likely to be aected by transient faults. Equation 4.5 shows how percentage
of execution for a statement in a basic block can be calculated.
PEi =
EiPn
i=1 Ei
(4.5)
Let parameter k of statement j in Bi is exposed to SEU be expressed as Pi;j;k.
Let the output produced by the application, when parameter k is aected by a
transient fault be Oi;j;k. Let the output produced by the application without any
errors be Ogolden. We use a "COMPARE" function to evaluate the quality of
output Oi;j;k, according to the requirements of the application. The result of the
evaluation is stored in parameter QEi;j;k. Equation 4.6 shows how the damage
caused by SEU in Pi;j;k is estimated.
QEi;j;k = COMPARE(Ogolden, Oi;j;k) (4.6)
42
Let the probability of SEU for an application be PSEU and probability of state-
ment i to have SEU be PSEUi . Equation 4.7 shows the probability calculation for
PSEUi .
P SEUi = PEi  P SEU (4.7)
The error expectation for parameter k of statement j in basic block Bi, depends
on the probability of SEU occurrence for statement j and the damage caused
by statement parameter k when SEU occurs. Note that, error expectancy also
contains the probability of SEU for the application. The estimation of error
expectation will then be:
ERi;j;k = P
SEU
i QEi;j;k: (4.8)
The error expectancy of an application (or PER value in our previous expression)
is an accumulation of the error expectancies of the statement parameters. SSFT
orders the parameters according to ERi;j;k values and removes as much parameters
as possible, without violating the error margin of the application. The reliability
of the application is not impaired while improving the performance, reducing the
hardware requirements and power consumption.
43
Application Source 
Code 
Error injection 
module(IMD) 
Parameter detector 
module (PMD) 
Input 
Output 
Parameters 
Error injected 
software executable 
QE estimator module 
(QMD) 
Output 
ER estimator module 
(EMD) 
QE values 
Statistics (PE Values) 
Optimizer module 
(OMD) 
ER values 
Figure 4.3: SSFT system architecture.
4.4 Implementation Details
Our overall system architecture is shown in Figure 4.3. As can be seen from Fig-
ure 4.3, "Parameter Detection Module" (PMD) takes "Application Source Code"
as input, and detects all the integer and oating point parameters inside the
source code to which error injections can be made. "Error Injection Module"
(EMD) takes the statement parameters and the "Application Source Code" as
input. EMD then, injects an error to a certain parameter in a statement and
compiles the application code to produce "Error Injected Application Binary".
The "Error Injected Application Binary" then, is executed to produce output
and statistics data. "QE Estimator Module" (QMD) uses the output produced
by "Error Injected Application Binary" to calculate the QE value for the er-
ror injected statement parameter. EMD takes QE and PE values as input and
estimates the ER values. When all the statement parameters detected by our
PMD are injected with errors and ER values are calculated, "Optimizer Module"
(OMD) can sort the statement parameters according to their ER values, and
44
optimizes until the user dened PER value is reached.
In order to estimate ER (expected rate of error) values for statement param-
eters of the running application, we have to perform error injections in statement
parameters. Existing error injection tools like NFTAPE, GOOFI, PROPANE and
SWIF-IT was not useful for our purpose, since they are either architecture spe-
cic or has limited error injection capabilities. Additionally, some of the studies
similar to ours use "GNU Debugger" tool, which has error injection capabilities
on the running application. The problem with this tool is that it does not oer
any automatic capabilities whereas our approach require thousands of dierent
errors to be injected automatically, in addition to compiling the error injected
code and recording the output. Moreover, our experiments require a tool that
has high level language error injection capabilities, can discriminate and record
dierent types of error injections and is exible. For error injection, we preferred
a high level approach since it will result in fewer test runs and will be sucient for
our purposes. A detailed low-level error injection tool would have injected errors
in dierent register values or memory parameters which eventually will impact
the value of a single parameter inside a high level statement.
In order to optimize the statement parameters, we applied two algorithms.
Algorithm 1 shows the steps of ER estimation for each statement parameter.
Algorithm 2 uses the data obtained from Algorithm 1, to selectively pick the
statement parameters that can be optimized (i.e. can be omitted with a fault
tolerance approach).
45
Algorithm 1 Error Expectancy Estimator(ER-set).
Input: P-set, set of statement parameters
Output: ER-set, set of ER values of statement parameters
procedure Parameter-Scanner(P-set; I-set;F-set)
I-set ;
F-set ;
for all p 2 P-set do
if p:type = integer then
I-set I-set + p
else if p:type = float then
F-set F-set + p
end if
end for
end procedure
procedure Estimate-PE(execution count)
return execution countpnP
i=1
execution counti
end procedure
procedure Estimate-ER(statisticsp)
return ERp  F(QEp) F(PEp)
end procedure
procedure ER-Estimator(I-set;F-set)
for all p 2 I-set [ F-set do
p.value p.value XOR (1 Rand(32))
executablep  Compile(P-set)
outputp  Execute(executablep)
execution count Statistics(executablep)
QEp  Error-Compare(outputp, outputgolden)
PEp  Estimate-PE(execution count)
ERp  Estimate-ER(QEp, PEp)
ER-set ER-set+ERp
end for
end procedure
46
Algorithm 2 Optimizer.
Input: ER-set, set of ER values for statement parameters
Output: R-set set of parameters that can be removed, PER value
Require: Initial PER value is 0.
procedure Optimizer(ER-set; acceptable PER)
SortAscending(ER-Set)
R-set ;
for all p 2 ER-set do
calculated PER  Calculate PER(R-set + p)
if calculated PER  acceptable PER then
R-set R-set + p
else
PER  Calculate PER(R-set)
end if
end for
end procedure
For our error injection purposes, we use the GCC compiler infrastructure.
GCC uses dierent passes in the compilation process which include parsing the
code, converting high level code to GIMPLE statements, compiler-level opti-
mizations, code elimination and many other passes. GIMPLE is a three-address
representation which is formed in tuples of no more than 3 operands (with some
exceptions like function calls) [44]. Our error injection is implemented as a GCC
compilation pass, and is capable of detecting the parameters inside GIMPLE
statements, and injecting errors on these parameters in the form of SEU (Bit
ips on data). A single high level language statement may correspond to mul-
tiple statements in GIMPLE since auxiliary variables are needed when breaking
the code into 3 operand form. Figure 4.4 shows the conversion of a statement
into GIMPLE.
47
i.0 = (unsigned int) i; 
D.2340 = i.0 * 32; 
D.2341 = block + D.2340; 
D.2342 = *D.2341[k]; 
D.2343 = (float) D.2342; 
D.2344 = cos2[k][j]; 
D.2345 = D.2343 * D.2344; 
sum = D.2345 + sum; 
sum += block[i][k] * cos2[k][j]; 
High Level Code 
GIMPLE Code 
Figure 4.4: GIMPLE code example.
Our error injection tool inside GCC has three major capabilities. First, we
are able to record the integer and oating point number variable parameters for
each function implemented in the running application. This will provide the base
data for our error injection scripts, which will simply inject the error to a specic
parameter inside a specic function, compile and run the software. The detection
and recording of the statement parameters is presented as "Parameter Detector
Module" in Figure 4.3.
In order to use these error injection parameters and compile and run the error
injected software, we use scripts and modify the GIMPLE statements. Speci-
cally, we notify the compiler to inject a single bit ip error into a specic operand
of a GIMPLE statement during compilation. After the compilation is done, the
error injected application executable is built and the results of error injection is
recorded for analysis. Figure 4.5 shows the compilation process for our imple-
mentation.
48
Error injector 
Error injection 
parameters 
Modified GCC 
Compile source 
code with 
injected error 
Error injected 
application executable 
Error injection details 
(Line of code, Basic 
block, etc.) 
Figure 4.5: Modied GCC compiles the application source code to produce error
injected application executable.
After the compilation of modied executable, we run the executable and record
the output with a specic error signature (the name of the function and the index
of the integer or oat parameter can be used for such purpose). The output
recording requires some sort of book keeping mechanism so that the outputs
can be easily traced back to error injections. For instance, for a program that
generates a le as an output, the output name may comply the error injection
details so that the output can be associated to the error injection. Similarly, for
a program that generates a result as an output, the program output should be
written to a text le including the error injection details. This recording, in some
cases, requires modication to the application source code which is one of the
obstacles in our implementation. Figure 4.6 shows how we implement the overall
test procedure.
49
Error injected 
application executable 
Coverage statistics 
Original application 
executable 
Golden output 
Output with error 
signature 
The execution counts for 
each statement 
Figure 4.6: Test runs produce output and coverage statistics.
For estimating the error rate, golden run comparisons are performed. Appli-
cation is rst executed without any error injections which is called the "Golden
Run". Then, the error injected copies of the same application is executed. The
output generated by the "Golden Run" is used as a yardstick to measure how
the error injected software output strays from the "Golden" output value. In
our implementation, we compare the results of the golden output and the error
injected outputs, and record the results in a le. For error percentage calcula-
tions, dierent criteria is used for each program. These criteria may actually be
subjective and require user provided coecients in order for the error percentages
to be more meaningful. For instance, a JPEG conversion or ltering application
may be more tolerable to errors in the output; however a lossless compression
application, like ZIP compression, can be intolerable to any sorts of error. Sim-
ilarly, a stray in the PI calculation application may have a more dramatic eect
in the output, while a statistical analysis tool can be more tolerable to errors in
the output.
For simplicity, we have chosen criteria for error percentage calculations as
50
Evaluate QE for a certain 
error criteria
Output with error 
signature QE Estimator
QE values with error 
signature
Golden output
Figure 4.7: QE is estimated by golden run comparisons.
simple as possible. For applications that generate les as outputs, we have used
the error criteria as the amount of data that does not match the Golden output.
For applications which generate ranking information as output, like in the case
of a sorting output, we use the ratio of the number of out of order elements to
the number of all elements in the list. In case of single result output applications,
like mean value, standard deviation, or square root calculation, we measured the
error rates by simply measuring the percentage of stray in nal results. The error
rates are captured and stored in the form of percentages and this information
is used for optimizations and expected error rate calculations along with other
statistical information. Figure 4.7 shows how error rate calculations are handled.
As explained before, our technique adds a statistics parameter during com-
pilation in order to record the execution statistics data. The execution statistic
we collect is called Coverage Statistics, which is generated by "GCOV" tool pro-
vided by GCC infrastructure. When the compiled software is executed, GCOV
interferes and records program execution statistics like the percentage of branch
decisions, the number of executions for each line of code, etc. The recorded data
can then be visualized when necessary. Since our primary interest is in the vul-
nerabilities against transient errors for each statement, we use GCOV to generate
the source code containing the number of executions of statement. These execu-
tion counts provide valuable information when identifying the sensitivity of each
statement. The reliability measures for these code pieces should be stricter for
keeping reliability at acceptable levels. Figure 4.8 shows how this data is gen-
erated. These results can be processed and used for our software proling and
optimization purposes for better tailored software reliability.
51
QE values with error 
signature
Error injection details 
(Line of code, basic 
block, etc.)
ER estimator ER estimations
Coverage statistics
Figure 4.8: ER estimations are produced using QE values, coverage statistics (PE
values) and error injection details.
The collected results (ER estimations) can be used for dierent optimizations,
i.e. same data can be processed dierently according to dierent PER values.
As discussed in chapter 4.2, PER value is the acceptable PER value for an
application. When considering an application that cannot tolerate any errors,
that is PER value is 0, the optimizations can only take place on the statement
parameters that do not cause any forms of errors in the program output in order
to yield a 0 PER value. This allows us to provide dierent levels of optimizations
for dierent reliability requirements.
We modied the GCC compiler and added our error injection pass into the
GCC compiler framework in order to implement error injection in the system.
This pass has been implemented using C programming language, where we in-
struct the GCC compiler to inject the errors. We have chosen benchmarks that
are coded in C programming language with minimum number of external libraries.
We combine the data generated from dierent sources (our own, GCC, GCOV,
etc.) using JAVA programming language.
Due to limited capabilities in GCC, error injections for pointer or array data
are missing from our implementation. Section 4.5 will discuss over an example
how we used the control ow graph and statistical information in order to obtain
the PE parameter.
52
4.5 CFG Based Vulnerability Estimation Exam-
ple
As explained in section 4.1, CFG represents the execution properties of the ap-
plication in basic blocks and edges. For example, Figure 4.9 shows the CFG of
the DCT function from the Compress benchmark.
Figure 4.9: CFG for DCT function of Compress benchmark.
The execution statistics in Figure 4.9 were obtained using GCOV, whereas
CFG itself is obtained using GCC compiler. During the execution of the bench-
mark, DCT function is called 256 times, which is the reason for the entry basic
block, B2, to be executed 256 times. Therefore, each instruction in B2 is also
executed 256 times.
Figure 4.10 shows the code fragment corresponding to the basic blocks B2
through B10 of Figure 4.9. Note that, these basic blocks are on the right branch
of the given CFG.
The execution statistics in the CFG are in accordance with branch probabili-
ties. For instance, basic block B10, which contains the necessary GIMPLE code
53
Figure 4.10: Code fragment for DCT.
for the rst "FOR" loop (for loop with index i), is executed 2304 times. The
GIMPLE code for this loop is shown in Figure 4.11.
Figure 4.11: GIMPLE Code for basic block 10.
In this example, "B" is a symbolic name used for constant value 8. Basic
block B10 is executed 9 times making the total execution count 256 9 = 2304,
which is compatible with the expected statistics. Similarly basic blocks B3 and
B8 are executed 2304 8
9
9 = 18432 times which is the execution count recorded
by GCOV. Additionally, in GCOV, the branch probabilities are recorded as %89
for B3 and %11 for B11, respectively.
Figure 4.12: GIMPLE Code for basic block 2.
Our approach rst compiles the "Compress" benchmark and injects the errors
for each statement parameter of each function in "Compress". GCC generated
54
GIMPLE code for basic block B2 is shown in Figure 4.12. In order to inject an
error to basic block B2, we instruct compiler with the necessary parameters. We
explicitly indicate whether oating point parameters will also have error injected
to them. After the error injection, the basic block takes the form shown in
Figure 4.13
Figure 4.13: GIMPLE Code for Basic Block 2 after error injection.
The injection in Figure 4.13 causes the loop variable "i" to have value 8 in-
stead of 0. When the loop variable "i" takes value 8 instead of 0, the loop is not
executed therefore no calculations are made and the sum variable now contains
an undened value. According to our measurements, this error injection causes
100% error in the output for the "Compress" benchmark (i.e., this parameter has
a QE value of 100%). This indicates that, the value of this variable is extremely
important and the software does not tolerate any errors in this statement param-
eter hence this parameter should be protected using some form of fault tolerance.
However, PE (the percentage of execution) is also an important factor and should
also be considered when ER value for this statement parameter is calculated.
According to our measurements, the total number of statements executed in the
entire "Compress" benchmark is "727360". Therefore, PEB10 =
2304
727360
= 0:3%
meaning that, this particular parameter takes 0.316% percent of the program
execution cycle.
Similarly, ER = 100  0:00316 = 0; 316, assuming that adjusting functions
for PE and QE are both unit functions. The ER parameter is used in calculating
PER (Program Expected Error Rate) and eventually is used in our optimizations.
For optimizations, the parameters that have lower ER values can easily be ignored
when fault tolerance techniques are applied.
As another example, the basic block B6 has an execution count of 147456.
When we instruct the compiler to inject an error in this basic block, basic block
55
takes the form in Figure 4.15 instead of the one in Figure 4.14.
Figure 4.14: GIMPLE Code for basic block 6.
Figure 4.15: GIMPLE Code for Basic Block 6 after error injection.
According to our measurements, this injected error, causes an error rate of
78.68% in the output. Based on this error rate, we can assume that an injection
to this statement parameter causes heavy damage in the output, hence should be
handled with care when using a fault tolerance technique to increase reliability.
The corresponding PE will be equal to 147456
727360
= 20%, meaning that execution of
this inner loop block, corresponds to 20% percent of the execution cycle. When
calculated without any adjustment ER will have 78:68  0:2027 = 15:95 error
expectancy. This indicates that this statement parameter is far more important
than other statement parameters when reliability is considered, since it consti-
tutes a larger portion of the software execution cycle. However, 100% percent
error or a crash in program is undesirable; therefore the error rate needs to be
improved by adjusting the functions accordingly.
PER value is calculated as the sum of all error expectancies from all statement
parameters. Table 4.16 shows a portion of the error rates, execution counts, and
the basic block that statement parameter belongs to, sorted according to the ER
values.
56
Injected Statement Block Error Rate Execution Count Total Execution ER Value
1 14,-7,dct 15 0 147456 727360 0
2 16,-5,dct 16 0 16384 727360 0
3 23,2,dct 16 0 16384 727360 0
4 6,-15,dct 7 0 18432 727360 0
5 7,-14,dct 8 0 18432 727360 0
6 9,-12,dct 10 0 2304 727360 0
7 20,-1,dct 22 81,00775 2304 727360 0,25660176
8 19,-2,dct 21 83,33333 2304 727360 0,263968313
9 10,-11,dct 11 85,27132 2304 727360 0,270107129
10 8,-13,dct 9 86,43411 2304 727360 0,273790406
11 0,-21,dct 2 100 2304 727360 0,316761989
12 24,3,dct 17 83,72093 16384 727360 1,88583881
13 15,-6,dct 16 85,27132 16384 727360 1,920761806
14 21,0,dct 4 88,37209 16384 727360 1,990607571
Figure 4.16: Table for Error Rates and ER values for statement parameters.
From the table 4.16, we selectively apply SFT and obtain a new PER value.
As a starting point, we assume that all of the statement parameters are protected
through some SFT and the PER value is initially equal to 0 for the application.
Then from the set of parameters (sorted similar to table 4.16), we choose the
parameters that are to be left out and not protected by the SFT. For instance, if
we do not protect parameters with an ER value of 0 (lines 2 to 7), the PER value
for the application does not change since these parameters are proven to be not
causing any errors in the application in case of a SEU event. The optimization
of the rest of the parameters is infeasible if the user set the PER value as 0.
However, for soft computing applications and applications that are more tol-
erable to errors, the rest of the parameters can be added to the optimization
list until the specied PER value is reached. For example, the parameter in
basic block 22 (line 7) has an ER equal to 81:0075  2304
727360
= 0:256. When
this parameter is considered alone, overall application PER will be equal to
ER  AER = 0:256  10 7 = 2:56  10 9. If the acceptable value for PER
is too low (for instance, 10 9 could be picked as an acceptable error expectancy
57
for the program) then this parameter cannot be removed from the list of parame-
ters that will exercise SFT techniques since, the PER value exceeds the acceptable
PER value when this parameter is left unprotected. However, if the application
is more tolerant to faults, a larger PER value can be used (for instance,10 8 for
the previous example). This way, other statement parameters can be added to
the list of unprotected parameters. As can be seen in table 4.16, parameter in
line 8 has an ER value equal to 83:33  2304
727360
 10 7 = 2:64  10 9, whereas
parameter in line 9 has an ER value of 85:27 2304
727360
 10 7 = 2:7 10 9. When
we do not protect these two parameters, the PER value will be equal to 5:210 9
and 7:9 10 9, respectively. Therefore, both parameters can be included in the
unprotected parameters.
However, when parameter in line 10 is considered, corresponding ER and PER
values are equal to 86:43 2304
727360
 10 7 = 3:16 10 9 and 11:06 10 9, respec-
tively. This PER does not allow leaving this parameter unprotected. Similar to
this example, we selectively add as many parameters as possible to the list of
unprotected parameters according to PER value.
The above values are for illustration purposes and are not used for our nal
results, since it is only applied to a single function in the "compress" benchmark to
better show our approach. Chapter 5 will continue with experimental evaluation.
58
Chapter 5
EXPERIMENTAL
EVALUATION
5.1 Benchmarks and Setup
We implemented our selective software fault tolerance technique in a modied
version of GCC, where we introduce error injection capabilities. We used ten
benchmarks to test the eectiveness of our scheme. Table gives these benchmarks
and their salient features. Note that, these are all implemented in C since we are
operating on a modied GCC environment.
Number of Code Number of
Benchmark Source Basic Size (Bytes) Execution
Blocks Cycles
bs Generic 23 4580 32
compress UTDSP 70 7355 727360
edge detect UTDSP 49 8365 1466059
fft 1024 UTDSP 24 5151 85042
jpeg UTDSP 940 93460 15629
lpc UTDSP 116 13342 19155
mpeg2enc MediaBench 2646 220086 28149452
qsort Generic 51 4926 471
sqrt Generic 28 4149 63
st Generic 54 4284 30051
Table 5.1: The characteristics of the benchmark codes used in this study.
59
The third column gives the number of basic blocks in the CFGs of each bench-
mark. The fourth column gives the maximummemory space occupied by the basic
blocks of the applications (at runtime). Note that, if no memory space optimiza-
tion is performed, this is the "memory occupancy" (or "memory consumption")
of the application. Our objective is to reduce the memory occupancy over the
course of execution by exploiting the lifetimes of basic blocks. The last column
of Table 5.1 gives the number of execution cycles for each benchmark, when no
reliability optimization is used.
We have mainly used UTDSP [46] as our benchmark suite since it ts well with
our approach. Scientic benchmarks we used from UTDSP are compress, edge de-
tect, FFT, LPC and JPEG. Compress benchmark uses discrete cosine transform
to compress a 128 x 128 pixel image by a factor of 4:1 while preserving its infor-
mation content. Edge detect benchmark, detects the edges in a 256 gray-level 128
x 128 pixel image relying on a 2D-convolution routine to convolve the image with
kernels (sobel operators) that expose horizontal and vertical edge information.
FFT is 1024-point complex fast fourier transformation benchmark (radix-2, in-
place, decimation-in-time). LPC benchmark is a linear predictive coding (LPC)
encoder benchmark. JPEG is a JPEG ltering program, which lters images. In
addition to these UTDSP benchmarks, we have also added some commonly used
applications like binary search, quick sort, square root calculation, and statis-
tics. Statistics application includes sum, mean, variance, and standard deviation
calculations. Additionally, we used MPEG-2 encoding benchmark (from Media-
Bench [47]), which does conversion of uncompressed video frames into MPEG-1
and MPEG-2 video coded bit stream sequences.
60
5.2 Results
Our experiments and optimizations are aected by ER values for individual state-
ment parameters. If a parameter in a statement, does not cause any errors in the
output (ER value is 0), then the instruction can be eliminated since it does not
require any safety precautions against transient errors.
Aside from the ER values, for individual statement parameters, our experi-
ments and improvements are aected by architectural factors. In order to rep-
resent architectural factors, we dened the architectural error rate parameter
(AER). AER parameter is directly related to how the CPU is manufactured
(higher density transistors will end up with higher transient faults) and the envi-
ronmental factors (the outside radiation levels, clock rates, how close the circuits
are designed to one another). The AER factor is expected to grow exponentially
as the nanometric scaling continues [1]. For our base implementation, we take
AER factor value as 10-7. AER factor does not aect our implementation results,
since only PER value is aected by this parameter. The acceptable PER value
can be adjusted accordingly for dierent AER values, hence we see similar results
with various AER values.
61
02
4
6
8
10
12
14
16
18
20
22
24
26
28
1,25 
12,18 
14,72 
26,04 
6,96 
1,16 
7,42 
3,05 
0,49 
1,92 
Figure 5.1: PER values for our benchmarks (10 6).
As explained before, program expected error rate (PER) depends on ER values
and AER parameter. More specically, PER value indicates the error expectation
in the output of an application without applying any protection. For example,
t 1024 benchmark has an error expectation of 2:6 10 5, which is obtained by
using the transient fault and the quality impact of the error when occurred. If
there are no errors, that is the application is protected, PER value will be equal
to 0. Figure 5.1 shows the PER values for our benchmarks.
As can be seen from this gure, applications have a wide spectrum of PER
values ranging from 4:9 10 7 (sqrt) to 2:6 10 5 (t 1024). This wide range of
values are mainly due to the fact that PE values for each statement parameter
have a dierent value since they are calculated using execution counts. Specic
execution counts for our benchmarks are shown in Figure 5.2.
62
05
10
15
20
25
30
0,00003 
0,72736 
1,46606 
0,08504 0,01563 0,01916 
28,14945 
0,00047 0,00006 0,03005 
M
il
li
o
n
s
 
Figure 5.2: Execution counts for our benchmarks (106).
As indicated, PER values are greatly aected by the execution count. How-
ever, execution counts for an application are not directly correlated with error
expectations. For example, when a program has many loops or lots of lines of
code that take much longer to execute, it does not mean that, that program is
more prone to transient faults. Instead, if frequently executed code blocks have
major impact in the output, the application would have greater error expectancy.
As explained in Chapter 4, our goal is to reduce the number of statement
parameters that require reliability precautions. For this purpose we set a PER
value for the application (discussed in section 4.2) in order to pick the statement
parameters that can be eliminated. For instance, a 0 value for PER means that,
the software can not tolerate any errors and only the statement parameters that
has 0 ER value can be safely ignored when reliability precautions are applied.
Figure 5.3 shows the amount of improvements with our approach when the
application is executed without any errors, that is with a PER value of 0.
According to these results, improvements brought by our approach is 40.85%
63
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Figure 5.3: Improvements with our approach when the application is executed
without any errors, that is PER = 0.
on the average. This indicates that 40.85% of these application code segments can
be optimized without any compromises in terms of reliability. The improvements
range from 8.5% to 81.75% which shows that our optimizations mostly rely on
the characteristics of the application. In the next set of experiments, we use
our approach with two dierent software fault tolerance techniques, EDDI [11]
and SWIFT [4]. We compare the execution time, program size, and instruction
count improvements, when EDDI and SWIFT are seperately optimized using
our framework. Note that, the fault tolerance technique used is orthogonal to
our approach and can be chosen independently. In Figure 5.4, benchmark
execution times are compared to the baseline SWIFT and EDDI execution times.
Our implementation provides improvements between 2.5% and 23.8%, and has
a mean value of 12%, when applied over the SWIFT technique. The reduction
in the execution times are not as impressive as the previous results. First of all,
SWIFT is already optimized, and it can benet from the advantages of Instruction
Level Parallelism (ILP).
64
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SWIFT
EDDI
Figure 5.4: Normalized execution times compared to SWIFT and EDDI ap-
proaches without applying our technique.
On the other hand, our normalized execution times with respect to base EDDI
implementation are between 3.3% and 31.3%, and has a mean value of 15.7%.
Next, we give the improvements brought by our approach in the program size.
The code size reductions we achieve are shown in Figure 5.5. As can be seen from
Figure 5.5, our approach reduces the code size in SWIFT between 5% and 48%,
has a mean value of 23.9%. Similarly, for EDDI, our reductions in code size vary
5.6% to 53% has a mean value of 26.5%.
65
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SWIFT
EDDI
Figure 5.5: Program binary size reductions compared to SWIFT and EDDI ap-
proaches without applying SSFT.
Similar to code size, we see considerable reductions in the instruction counts.
Specically, the average instruction count reductions are 22.5% and 25.8% for
SWIFT and EDDI, respectively. As can be seen from Figure 5.6, reductions range
from 4.7% and 45%, and 5.4% and 52% for SWIFT and EDDI, respectively.
While not presented here, we expect to see a much better improvement in
power consumption, since, although ILP techniques improve execution times for
SWIFT and EDDI, the power consumptions will not be improved. When all
other parameters are the same, power consumption will be similar to the number
of executed instructions.
In our experiments discussed so far, we assumed that applications are not
tolerable to errors. That is, these applications have PER = 0. However, there
are many applications in the soft computing domain which potentially accept
results with an error margin. For such applications, our approach is capable
of providing much better results since we are able to identify higher number
66
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SWIFT
EDDI
Figure 5.6: Instruction count reductions compared to SWIFT and EDDI ap-
proaches without applying our technique.
of statement parameters without violating the error margin indicated by PER
parameter. In the next section, we present a sensitivity analysis on PER values.
67
5.3 Sensitivity Analysis
For soft applications where a margin of error is acceptable, PER parameter can
be adjusted to increase the amount of parameters that can be removed from the
list of protected parameters. Figure 5.7 shows the percentage of parameters that
can be excluded from protection without violating the PER value.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
λPER= 0 
λPER=10-8 
λPER=5X10-8 
λPER=10-7 
λPER=3x10-7 
λPER=7x10-7 
λPER=10-6 
λPER=3x10-6 
Figure 5.7: The percentage of parameters that can be excluded from software
tolerance for dierent PER values.
Compared with PER = 0, the amount of parameters excluded from SFT
will increase from from 40.85% to 64.5% when PER = 10
 7. This improve-
ment highly depends on software characteristics. "st" benchmark improved by
66.7%, "edge detect" benchmark improved by 46.4%, "mpeg2enc" benchmark
improved by 31%, "jpeg" and "compress" benchmarks improved around 20%,
"qsort" benchmark and "sqrt" benchmarks improved around 15%, while "bs"
benchmark did not improve any further. As seen in Figure 5.7, when PER value
is set to a value that is too high, most of the statement parameters are excluded
from fault tolerance, which may not be desirable. The PER value should be
68
adjusted according to the application requirements. For the following gures,
we used SWIFT approach (without applying SSFT) as baseline and show our
improvements over this technique for dierent PER values.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
λPER= 0 
λPER=10-8 
λPER=5X10-8 
λPER=10-7 
λPER=3x10-7 
λPER=7x10-7 
λPER=10-6 
λPER=3x10-6 
Figure 5.8: Normalized execution times compared to SWIFT approach without
applying our technique for dierent PER values.
When PER = 10
 7, our technique reduces execution time between 6.8% and
25.9% with a mean value of 18.8% for SWIFT technique. The execution times
improve further, when PER is set to higher values.
69
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
λPER= 0 
λPER=10-8 
λPER=5X10-8 
λPER=10-7 
λPER=3x10-7 
λPER=7x10-7 
λPER=10-6 
λPER=3x10-6 
Figure 5.9: Program binary size reductions compared to SWIFT approach with-
out applying our technique for dierent PER values.
When PER = 10
 7, our technique reduces program size between 13.7% and
52%, with a mean value of 37.6% for SWIFT.
70
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
λPER=0 
λPER=10-8 
λPER=5X10-8 
λPER=10-7 
λPER=3x10-7 
λPER=7x10-7 
λPER=10-6 
λPER=3x10-6 
Figure 5.10: Instruction count reductions compared to SWIFT approach without
applying our technique for dierent PER values.
Our technique reduces instruction count between 12.9% and 49.11%, with a
mean value of 35.56% when it is applied on SWIFT and the PER = 10
 7.
In the next gure, Figure 5.11, we give the average percentage of parameters
that can be excluded from fault tolerance (over all benchmarks) over a range of
PER values. Specically, when PER = 10
 7, on the average 23.5% reduction
is possible. Along with other gures, this indicates that, for soft computing
applications, SSFT will further reduce the redundancies; hence will provide better
performance, higher availability and cheaper overall cost (hardware requirements
and power consumptions).
71
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
70,00%
80,00%
90,00%
40,85% 
56,10% 
60,98% 
64,47% 
70,74% 
78,39% 
81,21% 
89,67% 
Figure 5.11: The average rate of parameters that can be removed from software
tolerance for dierent PER values.
72
Chapter 6
Conclusion
In order to improve software reliability and provide protection against transient
faults that does not cause permanent damage in hardware, reliability techniques
have been proposed. Software reliability can be achieved through hardware, soft-
ware, or hybrid techniques, which all rely on some form of redundancy. These re-
dundancies cause performance overhead, higher power consumption and increase
the cost of hardware components required by the system.
It is necessary to keep the redundancies minimal in order to reduce their
eect in the system, while maintaining necessary protection. This requires a
better understanding in terms of reliability and performance requirements of the
running application and the environmental factors that aect of transient fault
occurrence (such as hardware specications and environmental radiation levels).
We propose SSFT (Selective software fault tolerance) that uses software pro-
ling information to understand the vulnerabilities of the running application in
terms of reliability and tries to reduce the redundancies while providing the re-
quired levels of protection. Reduction in redundancies will decrease the overall
system cost and performance overhead.
Our experiments shows that, even for applications that cannot tolerate any
errors and require 0 error expectancy, some optimizations can still be made on the
73
statement parameters. For instance, for benchmarks that are more amenable to
such optimizations (as in LPC benchmark), 81.75% of the statement parameters
can be safely omitted without impairing the reliability of the software. A fault
tolerance technique that does not have any optimizations would have protected
81.75% of the statement parameters that do not even require protection. The
impact of this overprotection can easily be seen from our results. For the same
LPC benchmark when SWIFT fault tolerance technique is applied, our approach
improves the execution time of the software by 23.8%, the program binary size
by 47.7%, and the instruction count by 45.1%. Similarly, when it is applied with
EDDI fault tolerance technique, our approach improves the execution time of the
software by 31.3%, the program binary size by 52.9%, and the instruction count
by 51.8% . The reduced redundancy will provide less power consumption, better
performance; therefore improve the availability and will decrease the number of
data bits susceptible to soft errors. Additionally, the reduced execution time will
allow a cheaper and slower CPU, a smaller non-volatile storage, and a smaller
volatile memory. These improvements will reduce the cost of the hardware re-
quirements due to applying the software fault tolerance. On the average, our
technique provides 11.9% execution time, 23.9% binary size, and 22.6% instruc-
tion count reduction for SWIFT, 15.7% execution time, 26.5% binary size, and
25.9% instruction count reduction for EDDI, respectively.
For soft computing applications that can tolerate some margin of error, i.e,
has less strict reliability concerns our technique provides even better reductions.
Instead of enforcing zero error, if we allow an error rate (PER) of 10 6, our bench-
marks, on the average, improve from 40.85% to 81.2%. This further improves
SWIFT and EDDI execution times by 13% and 18.3% respectively. Similarly,
program size is reduced by 30.9% to 35.4%, and instruction count is reduced by
28.73% to 34.5% for SWIFT and EDDI, respectively. These results show that
there is more room for improvements for soft computing applications.
One key advantage of our technique is that, since the redundancies have severe
eects in terms of performance, power consumption and hardware requirements,
a small amount of reduction in redundancies results with a higher rate of im-
provement. The results above indicate that SSFT provides decent improvements
74
in terms of performance and great improvements for program binary size and
program instruction count while preserving the reliability of the software.
In addition, our technique can be applied in combination with any fault toler-
ance technique. Hardware, software and hybrid fault tolerance techniques can all
benet from reduced redundancies provided by our approach. Similarly, hardware
fault tolerance techniques can be informed by the compiler about the vulnera-
bilities of the software and can make informed decisions about where in memory
each instruction should be placed. For instance, a more critical statement argu-
ment or instruction will be placed in an ECC protected memory, whereas the rest
of the data can be placed in a parity protected memory, which will reduce the
cost of the hardware, reduce the power consumption and improve the response
time. Moreover, software techniques can be improved in terms of performance
overhead, power consumption, and memory requirements.
Our approach is exible in the sense that it can be applied to applications that
do not tolerate any errors without any compromises. The amount of optimization,
however, is aected by the acceptable rate of error, i.e, software that can tolerate
some error rate (soft computing applications), can be improved further, and more
redundancy can be eliminated when reliability measures are taken. Overall, the
application is provided with the most suitable fault tolerance, the overall costs
are reduced, and the protection level provided by the fault tolerance is kept in
an acceptable level that suits the reliability requirements of the application.
One limitation in our technique is that, it relies on proling information.
Since the software behavior may dramatically change for dierent data sets, the
program should be tested with a variety of data sets, in order to provide a more
representative proling information.
Another limitation is that, the impact of injected error also depends on the
parameter and resulting data. In order to better simulate error injections, bit
ips should be randomized and the impact of dierent forms of bit ips should be
normalized. To obtain accurate results it is necessary to perform multiple tests.
As future work, we are planning to extend our approach such that it will
75
enable hybrid techniques like CRTR and SRTR to utilize, and hardware fault
tolerance such as ECC protected memory to use it.
We also would like to implement fault injections and redundancy reductions
on array and pointer parameters. This was not possible due to limitations in
GCC framework. We expect to see further improvements with such additions.
76
Bibliography
[1] R. Baumann, \Soft errors in advanced computer systems," IEEE Des. Test,
vol. 22, pp. 258{266, May 2005.
[2] G. A. Reis, J. Chang, N. Vachharajani, S. S. Mukherjee, R. Rangan,
and D. August, \Design and evaluation of hybrid fault-detection systems,"
in Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International
Symposium on, pp. 148{159, IEEE, 2005.
[3] R. C. Baumann, \Radiation-induced soft errors in advanced semiconduc-
tor technologies," Device and Materials Reliability, IEEE Transactions on,
vol. 5, no. 3, pp. 305{316, 2005.
[4] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, \Swift:
Software implemented fault tolerance," in Proceedings of the international
symposium on Code generation and optimization, pp. 243{254, IEEE Com-
puter Society, 2005.
[5] S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A.
Wender, \Predicting the number of fatal soft errors in los alamos national
laboratory's asc q supercomputer," Device and Materials Reliability, IEEE
Transactions on, vol. 5, no. 3, pp. 329{335, 2005.
[6] P. P. Shirvani, N. R. Saxena, and E. J. McCluskey, \Software-implemented
edac protection against seus," Reliability, IEEE Transactions on, vol. 49,
no. 3, pp. 273{284, 2000.
[7] G. A. Reis, Software modulated fault tolerance. Princeton University, 2008.
77
[8] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S.
Mukherjee, \Software-controlled fault tolerance," ACM Transactions on Ar-
chitecture and Code Optimization (TACO), vol. 2, no. 4, pp. 366{396, 2005.
[9] G. Munkby and S. Schupp, \Type inference for soft-error fault-tolerance
prediction," in Automated Software Engineering, 2009. ASE'09. 24th
IEEE/ACM International Conference on, pp. 65{75, IEEE, 2009.
[10] M. Hiller, A. Jhumka, and N. Suri, \Epic: Proling the propagation and
eect of data errors in software," Computers, IEEE Transactions on, vol. 53,
no. 5, pp. 512{530, 2004.
[11] N. Oh, P. P. Shirvani, and E. J. McCluskey, \Error detection by duplicated
instructions in super-scalar processors," Reliability, IEEE Transactions on,
vol. 51, no. 1, pp. 63{75, 2002.
[12] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, \A sys-
tematic methodology to compute the architectural vulnerability factors for a
high-performance microprocessor," in Microarchitecture, 2003. MICRO-36.
Proceedings. 36th Annual IEEE/ACM International Symposium on, pp. 29{
40, IEEE, 2003.
[13] M. Gomaa, C. Scarbrough, T. Vijaykumar, and I. Pomeranz, \Transient-
fault recovery for chip multiprocessors," in Computer Architecture, 2003.
Proceedings. 30th Annual International Symposium on, pp. 98{109, IEEE,
2003.
[14] E. Jenn, J. Arlat, M. Rimen, J. Ohlsson, and J. Karlsson, \Fault injection
into vhdl models: the mesto tool," in Fault-Tolerant Computing, 1994.
FTCS-24. Digest of Papers., Twenty-Fourth International Symposium on,
pp. 66{75, IEEE, 1994.
[15] V. Sieh, O. Tschache, and F. Balbach, \Verify: evaluation of reliability using
vhdl-models with embedded fault descriptions," in Fault-Tolerant Comput-
ing, 1997. FTCS-27. Digest of Papers., Twenty-Seventh Annual Interna-
tional Symposium on, pp. 32{36, IEEE, 1997.
78
[16] K. K. Goswami and R. K. Iyer, \A simulation-based study of a triple modu-
lar redundant system using depend," in Fault-Tolerant Computing Systems,
pp. 300{311, Springer, 1991.
[17] H. Madeira, M. Rela, F. Moreira, and J. G. Silva, \Rie: A general pur-
pose pin-level fault injector," in Dependable ComputingEDCC-1, pp. 197{
216, Springer, 1994.
[18] J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.-C. Fabre, J.-C. Laprie, E. Mar-
tins, and D. Powell, \Fault injection for dependability validation: A method-
ology and some applications," Software Engineering, IEEE Transactions on,
vol. 16, no. 2, pp. 166{182, 1990.
[19] Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton,
R. Dancey, A. Robinson, and T. Lin, \Fiat-fault injection based automated
testing environment," in Fault-Tolerant Computing, 1988. FTCS-18, Di-
gest of Papers., Eighteenth International Symposium on, pp. 102{107, IEEE,
1988.
[20] G. A. Kanawati, N. A. Kanawati, and J. A. Abraham, \Ferrari: A exible
software-based fault and error injection system," Computers, IEEE Trans-
actions on, vol. 44, no. 2, pp. 248{260, 1995.
[21] D. T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, and R. K. Iyer, \Nftape: a
framework for assessing dependability in distributed systems with lightweight
fault injectors," in Computer Performance and Dependability Symposium,
2000. IPDS 2000. Proceedings. IEEE International, pp. 91{100, IEEE, 2000.
[22] J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson, \Goo: Generic
object-oriented fault injection tool," in Dependable Systems and Networks,
2001. DSN 2001. International Conference on, pp. 83{88, IEEE, 2001.
[23] M. Hiller, A. Jhumka, and N. Suri, \Propane: an environment for examining
the propagation of errors in software," ACM SIGSOFT Software Engineering
Notes, vol. 27, no. 4, pp. 81{85, 2002.
[24] J. Freschl and D. Xue, \Swif-it: A tool for memory fault injection and pro-
tection," 2005.
79
[25] R. W. Horst, R. L. Harris, and R. L. Jardine, \Multiple instruction issue in
the nonstop cyclone system," 1990.
[26] T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm,
C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson,
et al., \Ibm's s/390 g5 microprocessor design," Micro, IEEE, vol. 19, no. 2,
pp. 12{23, 1999.
[27] Y. Yeh, \Triple-triple redundant 777 primary ight computer," in Aerospace
Applications Conference, 1996. Proceedings., 1996 IEEE, vol. 1, pp. 293{307,
IEEE, 1996.
[28] T. Vijaykumar, I. Pomeranz, and K. Cheng, \Transient-fault recovery us-
ing simultaneous multithreading," ACM SIGARCH Computer Architecture
News, vol. 30, no. 2, pp. 87{98, 2002.
[29] S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, \Detailed design and evalu-
ation of redundant multi-threading alternatives," in Computer Architecture,
2002. Proceedings. 29th Annual International Symposium on, pp. 99{110,
IEEE, 2002.
[30] E. Rotenberg, \Ar-smt: A microarchitectural approach to fault tolerance
in microprocessors," in Fault-Tolerant Computing, 1999. Digest of Papers.
Twenty-Ninth Annual International Symposium on, pp. 84{91, IEEE, 1999.
[31] C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, \Techniques to
reduce the soft error rate of a high-performance microprocessor," in ACM
SIGARCH Computer Architecture News, vol. 32, p. 264, IEEE Computer
Society, 2004.
[32] Intel Corporation, \Rachet up reliability for mission-critical applications,"
2012.
[33] J. G. Holm and P. Banerjee, \Low cost concurrent error detection in a vliw
architecture using replicated instructions.," in ICPP (1), pp. 192{195, 1992.
80
[34] R. Venkatasubramanian, J. P. Hayes, and B. T. Murray, \Low-cost on-line
fault detection using control ow assertions," inOn-Line Testing Symposium,
2003. IOLTS 2003. 9th IEEE, pp. 137{143, IEEE, 2003.
[35] M. Rebaudengo, M. S. Reorda, M. Violante, and M. Torchiano, \A source-to-
source compiler for generating dependable software," in Source Code Analysis
and Manipulation, 2001. Proceedings. First IEEE International Workshop
on, pp. 33{42, IEEE, 2001.
[36] N. Oh and E. J. McCluskey, \Low energy error detection technique using
procedure call duplication," in Proceedings of the 2001 International Sym-
posium on Dependable Systems and Networks, 2001.
[37] A. Shye, V. Janapa, M. Daniel, and A. Connors, \Transient fault tolerance
via dynamic process-level redundancy," 2006.
[38] J. Chang, G. A. Reis, and D. I. August, \Automatic instruction-level
software-only recovery," in Dependable Systems and Networks, 2006. DSN
2006. International Conference on, pp. 83{92, IEEE, 2006.
[39] D. M. Tullsen, S. J. Eggers, and H. M. Levy, \Simultaneous multithreading:
Maximizing on-chip parallelism," in ACM SIGARCH Computer Architecture
News, vol. 23, pp. 392{403, ACM, 1995.
[40] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L.
Stamm, \Exploiting choice: Instruction fetch and issue on an implementable
simultaneous multithreading processor," in ACM SIGARCH Computer Ar-
chitecture News, vol. 24, pp. 191{202, ACM, 1996.
[41] S. K. Reinhardt and S. S. Mukherjee, \Transient fault detection via simul-
taneous multithreading," in ACM SIGARCH Computer Architecture News,
vol. 28, pp. 25{36, ACM, 2000.
[42] K. Sundaramoorthy, Z. Purser, and E. Rotenberg, \Slipstream processors:
improving both performance and fault tolerance," ACM SIGPLAN Notices,
vol. 35, no. 11, pp. 257{268, 2000.
81
[43] G. Jacques-Silva, B. Gedik, H. Andrade, K.-L. Wu, and R. K. Iyer, \Fault
injection-based assessment of partial fault tolerance in stream processing
applications," in Proceedings of the 5th ACM international conference on
Distributed event-based system, pp. 231{242, ACM, 2011.
[44] Richard M. Stallman and the GCC Developer Community, \Gnu compiler
collection internals," 2010.
[45] E. Normand, \Single-event eects in avionics," Nuclear Science, IEEE
Transactions on, vol. 43, no. 2, pp. 461{474, 1996.
[46] C. Lee and M. Stoodley, \Utdsp benchmark suite," 1998.
[47] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, \Mediabench: a tool
for evaluating and synthesizing multimedia and communicatons systems,"
in Proceedings of the 30th annual ACM/IEEE international symposium on
Microarchitecture, pp. 330{335, IEEE Computer Society, 1997.
82
