Improving processor reliability using software protection techniques. by Nezzari, Yasser
Improving Processor Reliability Using Software 
Protection Techniques 
Yasser Nezzari 
 
Submitted for the Degree of  
Doctor of Philosophy  
from the 
University of Surrey 
 
 
 
 
 
 
 
Surrey Space Centre 
Faculty of Engineering and Physical Sciences 
University of Surrey 
Guildford, Surrey, GU2 7XH, UK 
June 2019 
Supervised by: Dr Christopher P. Bridges 
 
©Yasser Nezzari 2019 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
iii 
 
STATEMENT OF ORIGINALITY 
 
This thesis and the work to which it refers are the results of my own efforts. Any ideas, Data, images, 
or text resulting from the work of others (whether published or unpublished) are fully identified as such 
within the work and attributed to their originator in the text, bibliography, or in footnotes. This thesis 
has not been submitted in whole or in part for any other academic degree or professional qualification. 
I agree that the University has the right to submit my work to the plagiarism detection service 
TurnitinUK for originality checks. Whether or not drafts have been so assessed, the University reserves 
the right to require an electronic version of the final document (as submitted) for assessment as above. 
 
 
Yasser Nezzari 
May 2019 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
iv 
 
ACKNOWLEDGMENTS 
Firstly, I should like to thank my parents for their eternal understanding and support, and for always 
being there throughout this PhD. I’m very grateful for the trust they have in me and for all the advice 
and opportunities they given to me throughout my life. I would also like to thank all my brothers and 
sisters, Abdelhak, Saida, Chafia, Zakaria, and Idhir for all their support. 
I am extremely grateful for the support and guidance of my supervisor Dr Chris Bridges, with his 
invaluable experience, insight and knowledge in the area, and whose hard work and dedication is an 
example to us all.  
Special thanks to Karen Collar from Surrey Space Centre, who has been its backbone, always 
listened to our daily problems and concerns, and always been happy and willing to offer support and 
assistance. 
I’m hoping that the friendships I have built in Surrey Space Centre will continue beyond this PhD. 
In particular, I would like to thank Alla, Asma, Lotfi, Zak, Gabi, Hakim and Walid for the amazing time 
we had in Guildford.  
I would like also to thank the security staff of Surrey University for always being there to open the 
door for me beyond work hours and in the weekends, I appreciate your understanding and support 
enabling me to make much progress in my work and thesis writing. 
Beyond Surrey Space Centre, special thanks to my university friend Hamza and I hope your research 
in Japan is going well. I would like also to thank my childhood friend Chouib. I’m very lucky to have 
such amazing friends. 
 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
v 
 
ABSTRACT 
The use of Commercial Off-The-Shelf (COTS) processors is increasingly attractive for the space 
domain, especially with emerging high demand applications in Earth observation and communications. 
An order of magnitude improvement in on-board processing capability with less size, mass, and power 
is possible, however, COTS parts still lag in terms of reliability in the space environment. Costly 
protection techniques to ensure resilience to Single Event Effects (SEEs) is required. Whilst current 
software reliability techniques are only capable of detecting errors, and performing partial recovery, our 
research offers a step change for both error detection and recovery without degradation in fault coverage. 
This targets modern multicore processors.  
This research presents a novel software technique Automatic Compiler Error-Detection and 
Recovery (ACEDR) for software error detection and recovery. This technique is capable of covering 
both the CPU and Memory of COTS processing architectures, where the corruption of data (RAM and 
CPU registers) accessed by instructions can be corrected. ACEDR does not require additional hardware 
modifications in order to have the capability of error detection and recovery.  ACEDR is based on LLVM 
compiler framework, where it adds redundant instructions to the original code at compile-time, and 
inserts check instructions (voter function call) to enable it to decide the right outcome out of the three 
redundant instructions at run-time. To achieve high coverage, both CPU instructions, like the arithmetic 
and logic operations, and memory instructions, responsible for Reading/Writing (R/W) from/to memory 
have been triplicated and protected. This work does not provide protection to jump/conditional jump 
instructions, also bit flips that would transform instructions into other instructions are not considered in 
this work. The LLVM modifications consist of modifying the optimization phase of the compilation 
process, by adding two passes, an analysis and a transformation one. Analysis pass will provide 
information about the code, consisting of instruction types and some statistics that can be utilised to 
analyse the confidence level later.  The transformation pass adds the protection code, where it takes 
information provided by the analysis and uses it to add the appropriate protection where needed. 
This research also proposes an adaptive protection for the multicore processors, motivated by the 
fact that traditional software protection techniques are mostly focused on error detection, and ignore the 
recovery part. Our research offers a step change not only with its ability to detect and recover errors, 
but also with the ability to reduce the overhead while keeping high error coverage. This research offers 
the ability for the running software mode to change in real time, and provide the proper error detection 
and recovery technique depending on the error rate on orbit. Novel error detection and protection 
techniques have been implemented to provide the operation mode with the necessary resilience. This 
includes Instructions-TMR (ITMR), Threads-TMR (TTMR) and the combination of both ITMR and 
TTMR. Reliability prediction models have been presented in this research, in order to estimate the 
reliability.  
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
vi 
 
The reliability equations will model the whole processing architecture using multiple parameters 
related to the hardware architecture and the environment. In order to determine the reliability added 
using our software protection techniques, all of the benchmarks have been tested using software fault 
injection, where an LLVM fault injector has been developed. 
The reliability model is starting from the basic model of Markov chain. The reliability predictions 
depends on many factors specific to the hardware architecture used, including λ, the error rate of the 
Single Event Upsets (SEU). λ changes depending on the cross section of the component. Another factor 
is the access rate (depends on the hit/miss rate of the component). Other parameters that effect the 
reliability predictions are specific to the CPU, like the number of cores and pipeline stages. Our model 
also takes into account the sensitivity of the different instruction types that can be found in the different 
benchmarks that have been tested under a SEU rate λ. The prediction model is estimating the worst-
case scenario, and does not consider the case where an error has occurred before writing to memory or 
loading to CPU registers. 
This research has been validated by the mean of fault injection, where both the protected and the 
unprotected codes have been injected. The outcomes of the injected codes have been compared to find 
ACEDR and the adaptive solution’s ability to detect and recover from SEUs errors. The fault injector 
will go through the code and randomly flips a bit of one of its instruction’s data. This includes both the 
CPU and memory instructions. The inclusion of both the CPU and memory instruction’s data makes the 
injector more realistic, with respect to SEUs behaviour in real world.   
ACEDR improves the reliability of the system by reducing the error rate of the injection experiment 
simulating SEUs. The error rate is defined as the number of the injections that have caused an error 
divided by the total number of injections. ACEDR provides up to 99% improvement for some 
benchmarks. This research has been tested in two machines; Intel core i5-3470 with 3.2 GHz frequency 
and a Raspberry Pi 3. On the 1st processing platform the overhead was less than 15% and on the 2nd 
platform the overhead was less than 17%.  
Unlike other techniques in the literature that only provide error detection and/or partial recovery, our 
pure software-based protection technique offers high rate for both the detection and recovery, relative 
to the high error rate that has been injected. This is due to the variety of data and CPU register types it 
replicates. This research triplicates i32, i32*, i1, i8, i8*, i64, float & double, float & double pointers 
data and instruction types, in addition to replicating both memory and CPU registers. This newly 
developed software-based technique is notably beneficial, when designers do not have the luxury of 
modifying the hardware, but they still need resilience against SEUs in a computer system. In ACEDR 
the overhead was low since our work is adding redundant but independent instructions that uses the 
CPU pipeline to execute instructions in parallel, without having a bottleneck.  
For the adaptive multicore solution, both predictions and injection experiments confirm that the best 
reliability results were obtained when the combined protection techniques were used, where the error 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
vii 
 
rates dropped to an interval between 0% & 0.60%. The 2nd best results were obtained when Instructions 
TMR (ITMR) was used, where the error rate dropped to an interval between 0% & 3.97%. Using 
Threads TMR (TTMR) has dropped the error rates to an interval of 3.51% to 14.73% which we do not 
recommend for mission critical systems. The error detection and recovery came with an overhead 
between 14.97% & 131.70% for the combined protection techniques, 6.44% to 87.54% when ITMR 
was used and 10.32% to 41.14% when TTMR was used. 
In the adaptive solution, the overheads added when protecting using the combined technique (TTMR 
& ITMR) was higher than when only TTMR or the ITMR was used for protection. The main reason for 
the delays was the creation of new threads and joining them, in addition to the delays added by the 
voting function, and the addition of redundant instructions using the ITMR. However, the reliability was 
dramatically improved when the combined protection techniques were used. 
The reliability predictions of ACEDR and the adaptive solution have been compared against the 
reliability obtained from the fault injection experiment, after finding a correlation between the two 
results. 
This work would be highly valuable, both to satellites/space and in general computing such as in 
aircraft, automotive, server farms, and medical equipment (or anywhere that needs safety critical 
performance) as hardware gets smaller and more susceptible. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
viii 
 
 
 
 
 
 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
ix 
 
TABLE OF CONTENTS  
 
TABLE OF CONTENTS ...................................................................................................... ix 
LIST OF FIGURES ............................................................................................................ xiii 
LIST OF TABLES .............................................................................................................. xvi 
LIST OF ACRONYMS ..................................................................................................... xvii 
1. Introduction .................................................................................................................... 20 
1.1 Research Methodology ........................................................................................... 23 
1.2 Research Aims ........................................................................................................ 24 
1.3 Novel Contributions ................................................................................................ 25 
1.4 Thesis Structure ...................................................................................................... 26 
2. Literature Review ........................................................................................................... 28 
2.1 Terrestrial Processing Systems ................................................................................ 29 
2.1.1 The Von Neumann Machine Model ................................................................... 29 
2.1.2 Pipelining ........................................................................................................... 30 
2.1.3 Caches ................................................................................................................ 30 
2.1.4 Multiprocessors .................................................................................................. 32 
2.1.5 Compilers ........................................................................................................... 34 
2.1.6 Structure of a Compiler ...................................................................................... 35 
2.1.7 LLVM Compiler ................................................................................................ 35 
2.2 Space Processing Systems ...................................................................................... 38 
2.2.1 Radiation Effects on Electronic Systems ........................................................... 39 
2.2.2 Cumulative Effects and Single Event Effects .................................................... 41 
2.2.3 Technology and SEE Error Rate Relationships ................................................. 41 
2.2.4 Radiation Hardening by Design ......................................................................... 44 
2.2.5 Comparing Processors for Space and Earth Applications  ................................. 44 
2.3 Error Correction Codes ........................................................................................... 45 
2.4 Software Protection Techniques .............................................................................. 47 
2.4.1 Process-Level Replication (PLR) ...................................................................... 47 
2.4.2 Thread-level Replication (TLR) ........................................................................ 48 
2.4.3 Heartbeats .......................................................................................................... 48 
2.4.4 Fault Tolerance on Multicore Processors Using Deterministic Multithreading . 48 
2.4.5 A User-level Library for Fault Tolerance for Multicore .................................... 49 
2.4.6 Check-Pointing and Rollback Recovery for MultiProcessors ........................... 49 
2.4.7 Checkpoint/Rollback ......................................................................................... 49 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
x 
 
2.4.8 Respec: Multiprocessor Replay via External Determinism ............................... 49 
2.4.9 DRIFT: Decoupled compiler-based Instruction-level Fault-Tolerance .............. 49 
2.4.10 Composite Data Type Protection Algorithm (CDTP) ....................................... 50 
2.4.11 Live Variable Check Algorithm ........................................................................ 50 
2.4.12 Further Techniques ............................................................................................ 51 
2.4.13 Comparison of the Techniques .......................................................................... 52 
2.4.14 Techniques for Memory Access and Coherency ............................................... 54 
2.5 Summary ................................................................................................................. 55 
3. New Software Protection Techniques for COTS ............................................................ 57 
3.1 Key Research Challenges ....................................................................................... 58 
3.2 Generating Hardened Code Using LLVM Compiler .............................................. 59 
3.3 Using FEC & Modular Redundancy Protection Techniques (TMR) ...................... 60 
3.4 Modelling Reliability Equations for COTS Processing Architectures .................... 63 
3.5 New Adaptive Multicore Solution .......................................................................... 63 
3.6 Proposed Fault Injection ......................................................................................... 65 
3.7 Summary ................................................................................................................. 66 
4. Reliability Predictions .................................................................................................... 68 
4.1 Software TMR Reliability Prediction Model .......................................................... 70 
4.2 Reliability Prediction of a Memory of N word Size with TMR.............................. 71 
4.3 Reliability Prediction of RAM, Caches & CPU ..................................................... 71 
4.3.1 RAM, Caches and CPU Reliability Predictions Without Protection ................. 71 
4.3.2 RAM, Caches and CPU Reliability Predictions when Protected with ACEDR 72 
4.3.3 Reliability Equations Obtained From the Injection Experiments ...................... 73 
4.3.4 Reliability of RAM, Caches and CPU when Protected with Threads TMR ...... 75 
4.3.5 Reliability when the Combination of TTMR & TTMR is used ......................... 76 
4.4 Summary ................................................................................................................. 76 
5. Automatic Error Detection and Recovery in LLVM ...................................................... 78 
5.1 Experimental Setup ................................................................................................. 79 
5.1.1 Iterating Through LLVM-IR Code Layers ......................................................... 80 
5.1.2 Experiment 1: Software TMR Using Instructions ............................................. 82 
5.1.3 Experiment 2: In-line Software FEC  ................................................................ 85 
5.1.4 Discussion & Evaluation ................................................................................... 87 
5.2 Automatic Compiler Error-Detection & Recovery (ACEDR) ................................ 89 
5.2.1 Adding ACEDR Instructions in IR .................................................................... 89 
5.3 Error Injection ......................................................................................................... 91 
5.3.1 Injection Experiments of different Instruction types ......................................... 93 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
xi 
 
5.3.2 Injecting Unprotected Code ............................................................................... 94 
5.3.3 Injecting Protected Code .................................................................................. 101 
5.3.4 ACEDR Time Overhead .................................................................................. 106 
5.3.5 Comparing ACEDR & State-Of-The-Art ........................................................ 106 
5.4 Reliability Comparison of Injection Experiments to Predictions ......................... 107 
5.4.1 Mean Time to Failure (MTTF) ........................................................................ 108 
5.5 Summary ............................................................................................................... 109 
6. Tunable Multicore Protection ....................................................................................... 112 
6.1 Adaptive Multicore Platform Concept .................................................................. 113 
6.1.1 Implementation of Adaptive Multicore Protection .......................................... 113 
6.1.2 Adaptive Protection Modes ............................................................................. 114 
6.2 Injection Experiments ........................................................................................... 116 
6.2.1 Injecting the Unprotected Code ....................................................................... 116 
6.2.2 Injecting the Protected Code ............................................................................ 119 
6.2.3 Study of Time Overhead .................................................................................. 124 
6.2.4 Reliability of the Injection Experiment VS Reliability Predictions ................. 124 
6.2.5 Discussion ........................................................................................................ 127 
6.3 Summary ............................................................................................................... 129 
7. Conclusion & Future Work ........................................................................................... 132 
Future Considerations .................................................................................................... 136 
Publications .................................................................................................................... 137 
8. References .................................................................................................................... 138 
A. Appendix A-Radiation Hardening by Design ............................................................... 147 
A.1 Hardness by Layout Design .................................................................................. 147 
A.2 Hardness by Circuit Design .................................................................................. 147 
A.3 Radiation Hardened by Design (RHBD) Processors ............................................ 148 
A.4 Radiation Hardened Processing Architectures ...................................................... 149 
A.4.1 Using PCM in Next-generation Embedded Space Applications ...................... 150 
A.4.2 Quad-Core Radiation-Hardened System-on-Chip ........................................... 151 
A.4.3 SCS750 Architecture........................................................................................ 152 
A.4.4 Honeywell's RHPPC Integrated Circuit ........................................................... 153 
A.5 Hardened by Design Layout ................................................................................. 154 
B. Appendix B -Error correcting coding ........................................................................... 155 
B.1 Basic Concepts ...................................................................................................... 155 
B.1.1 Block Codes and Convolutional Codes ........................................................... 155 
B.1.2 Hamming Distance, Hamming Spheres and Error Correcting Capability ....... 156 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
xii 
 
B.2 Linear Block Codes ............................................................................................... 157 
B.2.1 Generator and Parity-Check Matrices .............................................................. 157 
B.2.2 The Weight is the Distance .............................................................................. 158 
B.3 Encoding and Decoding of Linear Block Codes ................................................... 158 
B.3.1 Encoding with G and H ................................................................................... 158 
B.3.2 Standard Array Decoding ................................................................................. 159 
B.3.3 Hamming Spheres, Decoding Regions and the Standard Array ...................... 160 
B.4 General Structure of a Hard-Decision Decoder of Linear Codes ......................... 161 
B.5 Hamming, Golay and Reed–Muller codes ............................................................ 162 
B.5.1 Hamming Codes .............................................................................................. 162 
B.5.2 The Binary Golay Code ................................................................................... 163 
B.5.3 Extended (24, 12, 8) Golay Code .................................................................... 165 
B.6 Binary Reed–Muller Codes ................................................................................... 165 
B.6.1 Boolean Polynomials and RM Codes .............................................................. 166 
B.7 Finite Geometries and Majority-Logic Decoding ................................................. 166 
B.8 Binary Cyclic Codes and BCH Codes .................................................................. 168 
B.8.1 Binary Cyclic Codes ........................................................................................ 168 
B.8.2 Encoding and Decoding of Binary Cyclic Codes ............................................ 170 
B.8.3 The Parity-Check Polynomial .......................................................................... 170 
B.9 Decoding of Cyclic Codes .................................................................................... 171 
B.10 TMR ...................................................................................................................... 172 
B.11 Discussion ............................................................................................................. 172 
 
 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
xiii 
 
LIST OF FIGURES 
Figure 2-1 The Von Neumann Machine Model [2] ........................................................................... 30 
Figure 2-2 Snapshot of Sequential Program Execution [2] ............................................................. 30 
Figure 2-3 The Memory Organization of a System [17] .................................................................. 31 
Figure 2-4 Different Cache Levels [17] ............................................................................................. 32 
Figure 2-5 SISD Architecture [16] ..................................................................................................... 33 
Figure 2-6 MISD Architecture [16].................................................................................................... 33 
Figure 2-7 SIMD Architecture [16].................................................................................................... 33 
Figure 2-8 MIMD Architecture [16] .................................................................................................. 33 
Figure 2-9 SMP Schematic [2] ............................................................................................................ 34 
Figure 2-10 Dance-hall (UMA) System [2] ........................................................................................ 34 
Figure 2-11 Schematic NUMA Organisation with Centralised Interconnection Network [2] ..... 34 
Figure 2-12 Decentralised Interconnection Network [2] ................................................................. 34 
Figure 2-13 Compiler Stages .............................................................................................................. 35 
Figure 2-14 Front End of LLVM ....................................................................................................... 36 
Figure 2-15 Optimization Phase of LLVM ....................................................................................... 37 
Figure 2-16 Backend of LLVM .......................................................................................................... 37 
Figure 2-17 Various Radiation Effects Induced in Electronic Devices [10] ................................... 38 
Figure 2-18 Cross Section and LET .................................................................................................. 42 
Figure 2-19 Performance of Space Processors [91] .......................................................................... 45 
Figure 2-20 A Canonical Digital Communications System [104]. ................................................... 46 
Figure 2-21 Application and Kernel Layers ..................................................................................... 47 
Figure 2-22  Performance of the Protection Techniques ................................................................. 53 
Figure 3-1 Software Layers with LLVM ........................................................................................... 59 
Figure 3-2 Example of the High-Level Languages & Processing Architectures Supported by 
LLVM ........................................................................................................................................... 60 
Figure 3-3 Proposed Flow Chart of FEC Using LLVM ................................................................... 62 
Figure 3-4 Typical Multicore COTS After Applying Software Threads TMR .............................. 64 
Figure 3-5 Typical Multicore COTS After Applying the Combined ITMR & ITMR .................. 65 
Figure 4-1 Serially Connected Elements ........................................................................................... 68 
Figure 4-2 Triple Modular Redundancy with Voting ....................................................................... 70 
Figure 4-3 Markov TMR States ......................................................................................................... 70 
Figure 4-4 Memory Modules Connection in Software Perspective ................................................ 71 
Figure 4-5 Typical Multicore Processing Architecture .................................................................... 75 
Figure 5-1 Overview of New Compiler Pass ..................................................................................... 80 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
xiv 
 
Figure 5-2 Iterating Through Different LLVM IR Code Layers .................................................... 81 
Figure 5-3 Detecting a Specific Instruction Type in the Intermediate Representation ................. 81 
Figure 5-4 TMR Algorithm ................................................................................................................ 82 
Figure 5-5 Replicating a Binary Instruction “add” ......................................................................... 83 
Figure 5-6 Replicating the Allocation of Memory Instruction ........................................................ 83 
Figure 5-7 Storing Copies of the Original Data in the New Created Allocations .......................... 83 
Figure 5-8 The CFG of the Implemented TMR................................................................................ 84 
Figure 5-9 TMR Implementation ...................................................................................................... 86 
Figure 5-10 Implementation of the FEC Using the Linker ............................................................. 86 
Figure 5-11 Implementation of the FEC Without the Linker ......................................................... 87 
Figure 5-12 ECC Implementation ..................................................................................................... 88 
Figure 5-13 Protecting the “alloca” Memory Instruction ............................................................... 90 
Figure 5-14 Protecting the “store” Memory Instruction ................................................................. 90 
Figure 5-15 Protecting the “load” Memory Instruction .................................................................. 91 
Figure 5-16 Protecting the “add” CPU Instruction ......................................................................... 91 
Figure 5-17 Flowchart of the Error Injection Process ..................................................................... 93 
Figure 5-18 Injection of CPU Instruction ......................................................................................... 93 
Figure 5-19  Injection of Store Memory Instruction ........................................................................ 94 
Figure 5-20 No Protection Applied .................................................................................................... 96 
Figure 5-21 Injection of the Different Types of Instructions of the Unprotected Code (fib) ........ 98 
Figure 5-22 Injection of the Different Types of Instructions of the Unprotected Code (qsrt) ...... 99 
Figure 5-23 Injection of the Different Types of Instructions of the Unprotected Code MM ........ 99 
Figure 5-24 Injection of all Instructions of Unprotected FFT....................................................... 100 
Figure 5-25 Injection of all Instructions of Unprotected math (solvecubic, rad2deg, deg2rad, 
uqsort) ........................................................................................................................................ 100 
Figure 5-26 Injection of all Instructions of the Protected math (solve cubic, rad2deg, deg2rad, 
uqsort) ........................................................................................................................................ 101 
Figure 5-27 Injection of all Instructions of the Protected math (solve cubic, rad2deg, deg2rad, 
uqsort) ........................................................................................................................................ 102 
Figure 5-28 Protect CPU Registers (Binary Operations, Arithmetic & Logic instructions) ...... 103 
Figure 5-29 Protect Memory Instructions ...................................................................................... 104 
Figure 5-30 Protect CPU Registers & Memory Instructions ........................................................ 105 
Figure 5-31 The MTTF Error with Respect to (λp/λc) Ratio ........................................................ 108 
Figure 6-1 Alternating Protection Mode in Real Time ................................................................... 114 
Figure 6-2 Adaptive Protection Flowchart ...................................................................................... 115 
Figure 6-3 TTMR Mode of Operation ............................................................................................. 116 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
xv 
 
Figure 6-4 Injection of the Different Types of Instructions of the Unprotected Code (fib) ......... 118 
Figure 6-5 Injection of the Different Types of Instructions of the Unprotected Code (qsrt) ....... 118 
Figure 6-6 Injection of all Instructions of Unprotected math (solvecubic, rad2deg, deg2rad, 
uqsort) ......................................................................................................................................... 119 
Figure 6-7 Injection of the Different Types of Instructions of the Unprotected Code (Roots) .... 119 
Figure 6-8 Injection of the Different Types of Instructions of the Protected Code Using TTMR 
(fib) ............................................................................................................................................. 120 
Figure 6-9 Injection of the Different Types of Instructions of the Protected Code Using 
TTMR(qsrt) ............................................................................................................................... 121 
Figure 6-10 Injection of the Different Types of Instructions of the Protected Code Using TTMR 
(Math)......................................................................................................................................... 121 
Figure 6-11  Injection of the Different Types of Instructions of the Protected Code Using (Roots)
 .................................................................................................................................................... 122 
Figure 6-12 Injection of the Different Types of Instructions of the Protected Code Using TTMR & 
ITMR (Math) ............................................................................................................................. 123 
Figure 6-13 Injection of the Different Types of Instructions of the Protected Code Using TTMR & 
ITMR (Qsort) ............................................................................................................................ 123 
Figure 6-14 Comparing the Predicted and Experimental Reliability of the Fib Benchmark .... 127 
Figure 6-15 Comparing the Predicted and Experimental Reliability of the Roots ..................... 127 
Figure 6-16 Comparing the Predicted and Experimental Reliability of the Math benchmark . 127 
Figure 6-17 Comparing the Predicted and Experimental Reliability of the Qsort benchmark . 127 
Figure A-1 PCM Memory Management Internal Architecture [37] ............................................ 151 
Figure A-2 RAD5545 Quad-Core Microprocessor Personality Block Diagram .......................... 152 
Figure A-3 Functional Bock Diagram for SCS750 [46] ................................................................. 152 
Figure A-4 RHPPC Processor Functional Block Diagram ............................................................ 154 
Figure B-1 A Systematic Block Encoding for Error Correction [104]. ......................................... 155 
Figure B-2 Binary Symmetric Channel (BSC) ............................................................................... 159 
Figure B-3 General Structure of a Hard-Decision Decoder of a Linear Block Code [104] ........ 162 
Figure B-4 A Majority-Logic Decoder for a Cyclic RM (1, 3) Code [104]. .................................. 168 
Figure B-5 A Cyclic Shift Register. .................................................................................................. 169 
Figure B-6 Circuit for Systematic Encoding: Division by 𝒈(𝒙) [104] .......................................... 171 
Figure B-7 General Architecture of a Decoder for Cyclic Codes [104]. ....................................... 172 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
xvi 
 
LIST OF TABLES 
Table 2-1 COTS Processing Architectures ........................................................................................ 34 
Table 2-2 In-Orbit Observation of SEUs [89] ................................................................................... 43 
Table 2-3 SEUs Rates for Different Circuit Technologies and Different Orbits [25]. ................... 44 
Table 2-4 Comparison of Different Techniques ................................................................................ 54 
Table 3-1 List of Research Challenges............................................................................................... 58 
Table 4-1 Notation ............................................................................................................................... 69 
Table 5-1 TMR Additional Instructions ............................................................................................ 84 
Table 5-2 TMR Function Instructions ............................................................................................... 85 
Table 5-3 Profiling Results of Fibonacci Series Benchmark ........................................................... 88 
Table 5-4 ACEDR Time Overhead For the Different Processing Platforms................................ 106 
Table 6-1 Error Rates of Different Software Protection Techniques ............................................ 123 
Table 6-2 Time Overhead ................................................................................................................. 124 
Table 6-3 Experimental & Predicted Reliability ............................................................................ 126 
Table 7-1 List of Publications ........................................................................................................... 137 
Table B-1 Results of the Binary Elements Addition and Multiplication. ..................................... 157 
Table B-2 Standard Array of a Binary Linear Block Code. .......................................................... 160 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
xvii 
 
LIST OF ACRONYMS 
 
ADC Analogue-to-Digital Converter 
ALU Arithmetic and Logical Unit 
API  Application Programming Interface 
AST Abstract Syntax Tree 
AVF Architectural Vulnerability Factor 
BCH The Bose, Chaudhuri, and Hocquenghem 
BJT Bipolar Junction Transistors 
CC Conditional Code 
CCD Charge-Coupled-Device 
CDTP Composite Data Type Protection Algorithm 
CFG Control Flow Graph 
CMOS Complementary Metal–Oxide–Semiconductor 
CMP Chip Multiprocessors 
CMP Compare 
COTS Commercial off the Shelf  
CPR Check-Pointing and Rollback 
CPU Central Processing Unit 
CRC Cyclic Redundancy Check 
CWRD Codeword 
DBF Double Bit Flip 
DC Direct Current 
DMA Direct Memory Access 
DMR Dual Modular Redundancy 
DRAM Dynamic Random-Access Memory 
DRIFT Decoupled Compiler-Based Instruction-Level Fault-Tolerance 
DSP Digital Signal Processor 
ECC Error Correcting Codes 
EDAC Error Detection and Correction 
EO Earth Observation 
FFT Fast Fourier Transform 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
xviii 
 
FI Functional Interrupt 
FIT Failure-In-Time 
FPGA Field Programable Gate Array 
HPC High Performance Computing 
IC Integrated Circuit 
ILP Instruction Level Parallelism 
IR Intermediate Representation 
ISA Instruction Set Architecture 
ITMR Instructions TMR  
LET Linear Energy Transfer 
LLVM Low Level Virtual Machine 
LVCA Live Variable Check Algorithm 
MBU Multiple Bit Upset 
MIMD Multiple Instruction, Multiple Data 
MIPS Millions of Instructions per Second 
MISD Multiple Instruction, Single Data 
MOSFET Metal Oxide Semiconductor Field Effect Transistor  
MPSoC Multiprocessor System-on-Chip 
NASA National Aeronautics and Space Administration 
NMR N-Modular Redundancy 
NUMA Non-Uniform Access 
OBC Onboard Computer 
PC Program Counter 
PLR Process-Level Replication 
RH Radiation Hardened 
RHBD Radiation Hardened by Design 
RISC Reduced Instruction Set Computer 
RS Reed–Solomon 
SBF Single Bit Flip 
SBFT Software Based Fault Tolerance 
SBU Single Bit Upset 
SDC Silent Data Corruption 
SEE Single Event Effects 
Yasser Nezzari                                 Improving Processor Reliability Using Software Protection Techniques  
xix 
 
SEL Single-Event Latch-up 
SET Single-Event Transient 
SEU Single-Event Upset 
SIMD Single Instruction, Multiple Data 
SISD Single Instruction, Single Data 
SoC System-on-Chip 
SSC Surrey Space Centre 
SSTL Surrey Satellite Technology Limited 
TID Total Ionising Dose 
TLR Thread-Level Replication 
TMR Triple Modular Redundancy 
TTMR Thread TMR 
UMA Uniform Access 
W/R Write-Read 
Yasser Nezzari                                                                                                                       Chapter 1, Introduction 
20 
 
1. INTRODUCTION 
In recent years, dramatic technology scaling in Integrated Circuits (ICs) has occurred. The scaling comes 
with smaller and faster transistors, enabling higher transistor counts that led to higher performing processing 
architectures with low power consumption, less size and at a fraction of the cost. However, these new 
technologies are sensitive to noise margins with low threshold voltages that give rise to transient effects caused 
by environmental or external factors, such as background or cosmic radiation effects. Another disturbance 
source are internal intermittent effects specific to the IC such as the operating conditions change (e.g. 
temperature, component wear-out, component overload) [1-4]. These are often soft errors and do not cause 
permanent damage. As transistors become increasingly prone to soft errors, high-reliability should not be 
exclusive to mission critical processing purposes, but should also be extended to processors used in 
mainstream computing and embedded architectures.  
For mission critical domains such as space, Radiation Hardened (RH) and Radiation Hardened By Design 
(RHBD) processing architectures are developed for their high reliability in a harsh environment. Even with 
their efficiency in mitigating Single Event Effects (SEEs), RH and RHBD still lag behind in terms of 
performance, compared to their commercial-off-the-shelf (COTS) relatives. The performance gap is estimated 
to be from 5-10 years [5]. In addition, RH and RHBD are costly and consume higher power. All the previous 
limitations have steered new space developers to consider the use of COTS. 
To mitigate errors, hardware redundancy can detect and recover SEEs [6, 7], however, it is only suitable 
for the domains that have no budget restrictions. An example of the hard-redundancy is the use of hard Error 
Correction Codes (ECC) that could be costly in terms of both performance and power [8, 9]. Hard-
redundancy can be impractical in embedded systems due to power constraints.  Software error detection and 
recovery techniques are more appealing to tackle the problem of soft errors because of their flexible 
implementation and low cost. Soft error detection and recovery can be applied on COTS, allowing designers 
to have an order of magnitude in performance compared to the hard-redundancy.  
The current trend in the implementation of soft-protection techniques still lags in terms of performance and 
ability to detect and correct errors. The lag in performance is due to the inefficiency in using the memory and 
CPU abundant resources of the COTS, leading to a bottleneck. Abundant resources refer to the clock frequency 
and the size of the RAM and caches. The power consumption is out of scope for this research. Most of the 
techniques applied to detect and recover errors only cover the CPU registers, where most of literature ignores 
the protection of the memory system of the processing architecture, because of the assumption that the memory 
has hard-ECC protection. In our work, we demonstrate that Read/Write (R/W) operations from/to the memory 
have as much importance as the rest of instructions in the study of reliability.  
This research contributes to mitigating space-borne single event upset (SEUs) on processor architectures 
by extending compiler functionality to allow for regular coding practice found on Earth in space. The 
increasing need for higher processing capabilities in space applications that continually pushes the space 
Yasser Nezzari                                                                                                                       Chapter 1, Introduction 
21 
 
industry to consider using COTS components. The challenge in utilising COTS is to provide the requisite 
safety and reliability in an extremely challenging radiation environment. Common practices in industry such 
as hardware triplication, hardware-based error correction, and cross strapping allow for higher performances 
and reduced risk to SEUs in space. While both Radiation hardened (RH) and Radiation Hardening by Design 
(RHBD) processors are less susceptible to radiation effects, they are expensive and only target a very small 
market (aerospace/defence). By exploiting opportunities in commercial multi-core architectures and treating 
this system as a set of redundant processors, we can leverage the multi-core architecture in software and 
associated high frequency and throughput performance thus providing an order of magnitude increase in on-
board processing capability with less volume, mass, and power. 
Software based methods for multi-core processor fault tolerance to single event upsets (SEUs) causing 
interrupts or ‘bit-flips’ are reviewed and we propose to utilise additional cores and memory resources together 
with newly developed software protection techniques. This work also assesses the optimal trade space between 
reliability and performance. We base our new developments on the modern compiler “LLVM” as it is ported 
to many architectures, where we implement optimisation passes that enable automatic addition of protection 
techniques including N-modular redundancy (NMR) and error detection and correction (EDAC) at 
assembly/instruction level to the languages supported by LLVM. The optimisation ‘passes’ modify the 
intermediate representation of the source code meaning it could be applied to any high-level language, and 
any processor architecture supported by the LLVM framework. In our initial experiments, we implement 
separately triple modular redundancy (TMR) and error detection and correction codes including Hamming, 
and Bose-Chaudhuri-Hochquenghem (BCH) at instruction level. We combine these two methods for critical 
applications, where we first TMR our instructions, and then use FEC (Forward Error Correction) as a further 
measure, when TMR is not able to correct the errors originating from the SEUs.  
After the initial implementations of TMR and different FEC algorithms, we propose the “Automatic 
Compiler Error-Detection and Recovery” (ACEDR); an original software error detection and recovery 
technique for automatically applying protection code using the LLVM compiler framework. The applied codes 
are capable of automatically detecting and recovering soft-errors at runtime. This work is based on two LLVM 
passes: an analysis and a transformation. The analysis pass will be executed on the intermediate representation 
(IR) code intended to be protected and provides the statistics and information about memory instructions 
dependencies. The transformation pass will add the protection code; by adding redundant instructions and 
calling a voter function to detect and recover at runtime. The voter works as in triple modular redundancy 
(TMR). We will show in this work the importance of protecting both the memory (R/W) instructions and CPU 
registers (arithmetic and logic operators, and branching) on the reliability predictions outcome. We start by 
studying the different error rate reductions of using partial protection techniques by only protecting CPU 
registers, then memory and then combining both [10]. The error rate is defined as the number of the injections 
that have caused an error divided by the total number of the injections. This work has some limitations; it only 
protects memory and CPU registers accessed by instructions (not the instruction itself), meaning that this work 
Yasser Nezzari                                                                                                                       Chapter 1, Introduction 
22 
 
do not consider bit flips that would transform instructions into other instructions or jump/conditional jump 
instructions.  According to [141] it is possible to protect the protection code, using their Flowchart of the self-
check routine. 
Using instruction level redundancy will allow for the mitigation of one or more bit errors and could tolerate 
SEUs and multiple bit upsets (MBUs). ACEDR contributes to the state of the art with the following points: 
• Low overhead compared to the state of the art, less than 15% in Intel core i5-3470 and a less than 17 
% in Raspberry Pi 3, 
• A reliability prediction model that predicts the reliabilities of all the processing architecture 
components, and to quantify the reliability added using the software protection code. The prediction 
model is estimating the worst-case scenario, and does not consider the case where an error has 
occurred before writing to memory or loading to CPU registers, 
• High error detection and recovery from a new injector where error rates can be reduced to less than 
1% in some benchmarks, 
• Multiple data and CPU register types (i32, i32*, i1, i8, i8*, i64, float & double, float pointers & double 
pointers) [11] have been protected, after [10] was extended, in addition to both the CPU and the 
memory Read/Write instructions types, 
• Comparison of the reliability predictions with the reliability obtained from the injection experiment 
of the protected code. 
The trend in improving the reliability of the COTS processing architectures is to use software protection 
techniques, however most of these techniques add large overheads, and most of them only detect errors and 
have no recovery schemes. The overhead is caused by adding redundant code to the original one, which is not 
optimized when using the abundant resources of the COTS CPU, creating a bottleneck. Most of the error 
detection and recovery techniques only cover the CPU registers and ignore the memory system, which can 
cause errors, since the RAM and the caches are highly vulnerable to SEUs. 
In order to overcome the SEUs, system designers are considering the worst-case scenario, which can affect 
the system performance dramatically, especially when knowing that the worst case of radiation only happens 
frequently when the satellite is near the South Atlantic Anomalies (SAA) and near the poles in its orbit. This 
research proposes to find the optimal trade between performance and reliability, where system designers do 
not have to sacrifice one thing for the other. If the operational mode of the processor can be switched between 
modes that have higher performance when the SEUs have low rate, and to other more reliable modes, only 
when the SEUs rate increase, which can improve the reliability of the system, without having to reduce the 
system’s performance.  
In our second proposed solution the system will have three modes of operation: the 1st one is the 
unprotected mode, where the code is running without any protection techniques added to it, making it run 
Yasser Nezzari                                                                                                                       Chapter 1, Introduction 
23 
 
without extra overhead. The second mode is when the code is protected using ITMR, which can reduce the 
error rate dramatically depending on the benchmark used, knowing that the overhead of using this protection 
mode was low (6.44% in some benchmarks). The third mode is when the code is protected using the combined 
techniques (ITMR & TTMR) which gave the best results in terms of reliability, where the error rate was 
reduced better than the second mode of protection, however the overhead was higher than when using ITMR. 
ITMR relies on using compiler passes to add redundant instructions to detect and correct errors [12]. TTMR 
replicate functions on threads instead of instructions and then calls a voter to detect and correct errors. Our 
approach for TTMR does not include the case where threads require interpretation of pointers or global 
variables. 
A model to predict the reliability using the different protection modes will be presented, based on Markov 
chain, which will be extended to model the whole processing architecture, including all the CPU components 
(Cores, RAM and caches). The model starts by showing the basic equations predicting the reliability at the 
instructions level, and depending on the component’s cross section area, access rate and the SEUs error rate, 
our model predicts the reliability of every component, and at the end using combinational logic, the reliability 
of the whole system was obtained. Our prediction model includes multiple architectural and environmental 
parameters. The reliability prediction model will be verified using fault injection, on the different protection 
modes. 
The adaptive multicore solution contributes to the state of the art with the following points; 
• Low overhead with the adaptive solution, switching modes depending on the SEUs error rates and 
utilizing redundant independent instructions, taking advantage of the CPU abundant resources, 
• A reliability prediction model for all the processing architecture components,  
• High error detection and recovery rate, where the error rate has been reduced to less than 1% in some 
benchmarks. 
1.1 Research Methodology 
In this project, we are aiming to use the COTS multicore processor, 5 to 10 years more advanced than the 
ones currently used in space (RH and RHBD) to enable space applications to benefit from the technology 
revolution in the COTS components. The new COTS system will be protected against SEUs from space 
radiation, based on software that will cover both the CPU registers and the memory. The abundant resources 
in multicore architecture will be exploited in order to implement the appropriate protection technique 
depending on the availability of these resources and the required protection level. The protection techniques 
will also depend on the code intended to protect, taking into consideration the amount of memory and CPU 
usage of the code. 
The implementation of differing protection techniques will be automated, meaning that they should be 
automatically added to the code.  
Additional computations originated from the protection codes will produce an overhead. To overcome this 
Yasser Nezzari                                                                                                                       Chapter 1, Introduction 
24 
 
problem redundant resources will be used. Our work will manage the resources, in order to minimise the 
overhead. This could be achieved by automatic parallelisation, and different code optimisations techniques. A 
trade-off between reliability and overhead will be studied to find the optimal solution. 
At the start, the state of the art will be examined in order to find the gaps in the current software protection 
used in COTS, this will help narrow down the research proposal answering to all of the research aims and 
objectives. Once the proposal is completed, the initial experiments will be conducted for the proof of concept 
and also to compare the different software protection algorithms in terms of their ability to detect and recover, 
while also taking into account the amount of overhead added by using the protection techniques. Once a 
protection technique is chosen amongst the tested ones, a reliability prediction model will be included in order 
to estimate the reliability of the whole processing architecture with and without the software protection and 
then compare the results. The reliability prediction model will depend on multiple parameters, the internal and 
external ones. While the internal parameters are specific to the processing architecture used such as the number 
of CPU cores and the number of pipelines, the external parameters will depend on the environment in which 
the processor is operating in, such as the SEEs rate.  
The software error detection and correction will always add an overhead relative to the processing 
architecture used and also to the algorithm used for the protection. An adaptive system will enable the 
reduction of the overhead by switching the protection modes depending on the SEEs rate, enabling the system 
to have low reliability and high-performance mode when the error rate is low, and to have high reliability with 
lowered performance when the error rate is high.  
All the software protection techniques will be validated using fault injection, which can perform random 
bit-flips on the codes, causing different error types similar to SEUs on the CPU. The injection is performed at 
the instruction’s data, which results in corrupting the instruction’s results. This will show the ability of the 
software protection techniques used to detect and recover different types of errors.   
The software injecting experiment will also provide the parameters used in the reliability prediction model, 
allowing it to determine the precision of the prediction equations. 
1.2 Research Aims  
• To discover how a new & existing protection algorithms in software can be integrated into a portable 
compiler, using direct manipulation of the source code’s intermediate representation, with the 
following requirements: 
o Utilise only compiler extensions at the optimisation stage of the compilation process in order 
to apply protection (a programmer will only need to compile using the implemented 
optimisation pass and the protection algorithms will be added by the transformation pass). 
o To protect both the physical memory system and the CPU registers of the microprocessor 
architecture automatically. 
Yasser Nezzari                                                                                                                       Chapter 1, Introduction 
25 
 
• The memory and CPU correction codes and protection algorithms are to be adaptive to error rates, 
meaning the algorithm is able to change its operating mode, thus its protection ability depending on 
the rate of SEEs. 
• The new optimisation passes developed must be parallelisable, whereby additional computations and 
overheads can be reduced or traded-off using redundant resources of the multicore processor 
architecture. 
• Profile several common C/C++ benchmarks before and after adding different protection techniques. 
This must include the number of additional cycles, added instructions, increased time overheads, and 
any static or dynamics memory and caches overheads.  
• Testing the compiler error correcting ability by injecting bit-flip errors as a simulation of SEUs in 
software, and comparing the number of errors for each protection technique with the ones detected 
and corrected. 
1.3 Novel Contributions  
This work will contribute to safety critical systems research across a wide range of applications and 
industries and is applicable for many high-level languages with the largest coverage of microprocessor 
architectures for a given compiler (supported by the LLVM framework). The concept of using software 
protection techniques can be demonstrated in EDDI [13], SWIFT [14], Shoestring [120], Fault Tolerance 
Software Checking [121], DAFT [125], SRMT [126] and many other references. 
A number of key outcomes are expected through this research: 
• A new error and reliability prediction model for modern multi-core processor architecture components,  
• The “Automatic Compiler Error-Detection and Recovery” (ACEDR) method for automatic code 
protection with compiler manipulations and extensions. 
o This solution is portable across multiple processing architectures and multiple high-level 
languages to ensure high impact, 
o Low overhead compared to the state of the art with less than 15% in Intel core i5-3470 and a 
less than 17 % in Raspberry Pi 3, High error detection and recovery rate, where the error rate 
has been reduced to less than 1% in some benchmarks, 
o Protection modes include the protection of multiple data types (i32, i32*, i1, i8, i8*, i64, float 
& double, float pointers & double pointers) [10, 12], in addition to both the CPU registers and 
the memory Read/Write instructions types, 
• A new adaptive solution with low overhead that switches modes depending on SEEs error rates and 
utilizing redundant independent instructions, taking advantage of the CPU abundant resources. The 
adaptive solution has reduced the error rate to less than 1%, 
• First direct comparisons of reliability predictions with injection experiments of ACEDR and Adaptive-
Yasser Nezzari                                                                                                                       Chapter 1, Introduction 
26 
 
ACEDR solution. Current state of the art has limited to no such comparisons. 
1.4 Thesis Structure  
Chapter 2, titled Literature Review, starts by presenting the microprocessor architecture, from the Von 
Neumann machine to the current small sized COTS multicore architectures, with low power consumption. 
Most of these architectures have many cores (more than or equal to 4), a shared ram, two or more levels of 
cache that could be private or shared. The pipelining and the caches are also presented, as they will be included 
in the reliability prediction model of Chapter 4. The radiation effects on electronic systems are introduced, by 
defining the different radiation environments, and classifying the radiation damage. Terminology shows the 
different terms such as the Linear Energy Transfer (LET), the cross section, the critical charge and the 
sensitivity of a circuit. The radiation hardening section shows the different processes used to harden circuits, 
such as hardening by design and by process architectures. The theory of error detection and recovery of the 
literature chapter, introduces the different algorithms used in software for SEU mitigation, including the 
mathematical background behind them. The compiler section introduces compilers in a generic way, then 
focuses on the LLVM, which is the backbone of the first part of this research.  The Software Protection 
Techniques section shows the different techniques for fault tolerance in multicore architecture that have been 
studied, these techniques have been implemented at different layers (User, Kernel and Compiler). At the end 
of this section the gaps in the previous works have been identified in order to propose the ideal solution.  
Chapter 3 shows the proposed solution that could answer all of this research’s objectives, by first showing 
how the radiation hardened code is generated using LLVM compiler. The use of  FEC and Modular redundancy 
is also introduced. The proposal also shows how the reliability prediction model is going to be built starting 
from the basic Markov model and then building the model of the whole processing architecture. The adaptive 
model is also proposed, which is the ideal solution to have the optimal trade-off between reliability and 
performance. The last proposed section is the fault injection, where an LLVM fault injector will be developed 
to evaluate the ability of the software protection to detect and correct errors. 
Chapter 4 shows the theoretical reliability prediction equations modelling the whole processing 
architecture. The basic TMR reliability equation was shown, then the prediction equations for the CPU RAM 
and caches were developed. The prediction model includes multiple parameters, which could be internal, 
specific to the processing architecture used, or external related to the environment where the processor is 
operated. The model also includes the different instruction types and their sensitivities in the model. 
Chapter 5 shows the implementation of the different error detection and correction algorithms, where TMR 
Hamming and BCH codes have been implemented using our compiler passes to detect and correct errors. This 
experiment also includes a section of evaluation where the overhead of the different algorithms has been 
evaluated and compared. ACEDR section is done after the initial results were obtained, where TMR has 
resulted in the best performance amongst the rest of the FEC codes used. ACEDR is used to protect all 
instruction types of the processor. The evaluation of the error detection and recovery of the protection code is 
Yasser Nezzari                                                                                                                       Chapter 1, Introduction 
27 
 
done in the Error Injection section. 
Chapter 6 introduces the Adaptive Multicore Protection, where the platform concept is shown including 
the adaptive implementation and the different operation modes used in this model. The adaptive solution is 
evaluated in terms of the overhead and the ability of the different protection modes to detect and recover errors. 
The evaluation of the error detection and recovery was achieved using the fault injection that was developed 
using LLVM. 
Chapter 7 includes the conclusion and future work, where the key findings and contribution of every chapter 
are summarised, and future research possibilities are proposed.   
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
28 
 
2. LITERATURE REVIEW   
The next generation space missions are demanding more powerful processors capable of handling different 
applications including communication and Earth observation. The higher processing ability will be used to 
improve multiple aspects of the space applications, such as the downlink bottleneck, allowing more raw data 
to be processed and compressed before transmitting to the ground station. This can also improve autonomous 
capabilities for the new space missions such as constellations of several space systems, automatic manoeuvring 
and guidance and in situ studies and keep up with the highly demanding real-time applications. 
The constraints for a processor in the space domain are its size, weight and power consumption, which are 
decided by the spacecraft’s physical dimensions, and its photovoltaic solar array’s capacity. Another constraint 
is the processor’s reliability, that’s why for mission-critical domains such as space, Radiation Hardened (RH) 
and Radiation Hardened By Design (RHBD) processing architectures are used, because of their resilience 
against SEEs. However, these types of processors trail in terms of performance, where the performance gap 
between them and the COTS is estimated to be 5-10 years [5]. Furthermore, RH and RHBD are more 
expensive, and consume higher power. All previous constraints are driving the space domain into using the 
COTS.  
The huge gap in performance between COTS and RH and RHBD processors is due to the technology 
scaling in the COTS Integrated Circuits (IC) enabling faster and smaller transistors in the same IC leading to 
increased performance with less size, power dissipation and at a fraction of the cost. However, COTS have 
low threshold voltage, making them vulnerable to noise margins, provoking transient effects in them, caused 
by the environmental or external factors, such as the cosmic radiation effects. Another disruption source is 
further intermittent effects caused by internal factors specific to the IC when the operating conditions change 
(e.g. component wear-out, component overload) [1-3]. Soft errors do not cause permanent damage. As 
transistors are becoming increasingly prone, high-reliability should not be exclusive to mission-critical 
processing purposes, but should also be considered for processors used in mainstream computing and 
embedded architectures. 
In this chapter the COTS, processor architectures will be introduced, starting with the basic model, the 
concept of pipelining, memory caches, and the different multicore architectures. This section also includes 
background about compilers, with the focus on the modern example “LLVM” and its different stages as related 
to our main objectives towards extending a compiler to apply differing protection algorithms. LLVM enables 
the user to modify the intermediate representation of the supported high-level languages and add redundant 
instructions with checks in order to detect and correct errors. LLVM can also be used as a tool to evaluate the 
reliability of a system, by the means of software fault injection. 
The radiation hardened processing architectures will be introduced in this chapter, showing the different 
fabrication techniques used to harden processing architectures, in addition to some known RH and RHBD 
processing architectures. The radiation effects on electronic systems are detailed in this chapter. This includes 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
29 
 
the terrestrial and space radiation environment, and the different radiation effects categories such as 
accumulative dose and SEEs. The quantification of the different error rates on different technology die sizes 
have been included for different orbits to better highlight the critical research questions.  
The state of the art in software protection techniques is also included in this chapter, where the protection 
could be implemented in different software layers (Application, Operating System, and Compiler). This covers 
different error detection and recovery techniques relying on redundancy in software and hardware. 
2.1 Terrestrial Processing Systems 
The technological advance has led to a dramatic growth in the clock frequency and in the quantity of logic 
(the number of transistors) that a chip can host. These features have been exploited by computer architects in 
order to further boost performance using architectural techniques. 
This section is crucial to gather an understanding of the new processing architectures and will be used to 
build the mathematical prediction model of the reliability in Chapter 4. In this section, we will start by showing 
the basic Von Neumann Machine Model, demonstrating how the different processing architectural elements 
are added together. The processor pipeline and caches are shown, since they play a major role in the reliability 
prediction model. The different multicore architectures depend on how the different processing elements are 
connected, which can affect the reliability prediction model significantly. 
2.1.1 The Von Neumann Machine Model  
Despite the advanced engineering aspects of nowadays microprocessors, they still mimic the classical Von 
Neumann machine model shown in Figure 2-1. The Von Neumann model [15] comprising four blocks: 
• A central processing unit (CPU) consisting of an arithmetic–logical unit (ALU) that carries 
arithmetic and logical operations, registers used for quick storage of operands, a control unit that 
decodes instructions and sends them to be executed, and a program counter (PC) that specifies the 
address of the next instruction to be executed.  
• A memory that stores instructions, data, and in-between and final results.  
• An input that passes data and instructions from the outside world (input devices) to the memory.   
• An output that passes the final results and messages to the outside world (output devices). 
The instruction execution cycle is carried as follows [15]:  
• Fetch from memory the next instruction pointed to by the PC.  
• Decode the instruction using the control unit.  
• Execute the instruction.  
• Updated the PC.  
• Write back to memory. 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
30 
 
Memory *************
Input
Output
Control ALU
State
PCRegisters
CPU I/OMemory
I/O BusMemory 
Bus  
Figure 2-1 The Von Neumann Machine Model [2] 
Gradually, the CPU architecture has been improved with pipelining, allowing giant performance jumps. 
Along with microarchitectural advances, the strict sequential execution cycle has been extended so that it could 
be possible to have several instructions processed concurrently in each pipeline stage. 
2.1.2 Pipelining  
Figure 2-2 shows the abstract view of the pipeline. The pipeline registers [2] are in between each stage and 
the next. A pipeline register stores all the information required for completion of the execution of the 
instruction after it has passed through the stage of its left [16]. In Figure 2-2, at (T+3), five instructions are 
executing simultaneously.  
MEMIF EX WBID
MEMIF EX WBID
MEMIF EX WBID
MEMIF EX WBID
MEMIF EX WBID
*
*
*
*
*
*
*
*
*
*
*
5 Instructions in 
progress
T+7 T+8T+6T+5T+4T+3T+2T+1T
 
Figure 2-2 Snapshot of Sequential Program Execution [2] 
2.1.3 Caches  
As the gap between processor and memory performance is elevated [17], multilevel caches became a 
necessity. In the 1990s, while processor speeds increased by 60% per year, memory delays decreased by only 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
31 
 
7% per year. Caches are used as high-speed buffers between main memory and the CPU [15].  
2.1.3.1 Memory Organization 
CPU executes arithmetic and logic operations of the system. The memory organization of a typical 
processing architecture [17] is shown Figure 2-3: At the core is CPU, possessing the highest frequency, 
followed by the cache memory, then RAM and at last a storage device. 
 
Figure 2-3 The Memory Organization of a System [17] 
When a program is initiated, or an instruction is to be executed on the data, then the instruction and the 
data are copied from a slow storage device (hard drive, CD drive…etc.) to a faster one (RAM). RAM is 
considered as a cache memory for the slow storage devices. The RAM has higher speed than the storage 
device, however, it is slower than the CPU. To solve this issue, caches are used between the CPU and the 
RAM. Cache is an SRAM (Static RAM), which is faster but costlier than the DRAM (Dynamic RAM) because 
unlike DRAM containing one transistor and capacitor per flip-flop, the SRAM’s flip-flop is composed of six 
transistors. In addition, the SRAM requires to be refreshed periodically. 
In multicore processors, the cache memory has a multilevel structure where each level varies in their speed 
and size. Data is transferred from RAM to the 3rd Level of cache (L3), which is faster than the RAM but larger 
and slower than L2. Using the 2nd Level of caches L2 can speed up the process depending on the memory 
organisation data associativity model. L2 cache is located close to the processor. Not all modern CPUs contain 
a 3rd level of cache, some architectures only use the 1st and 2nd Level of cache. The 1st Level of cache is built-
in in the CPU itself, and it is the fastest amongst the other levels but the smallest in storage capacity. This 
cache Level is used to store the commonly used data and instructions. 
2.1.3.2 Cache Controller and Memory Processes 
As explained previously caches are used to enhance the system’s performance, by closing the gap of 
performance between the CPU and the RAM. The question that arises next is how the cache knows which 
data or instruction must be closer to the processor. For that specific task, a cache controller associated with 
a given level of cache is implemented [17].  
The cache controller stores the instructions/data which are very commonly used by the computer 
(spatially). As an example, when someone needs a formula in a book, instead of opening the book every 
time they use it, they could just copy it into a piece of paper, thus, making it always ready to use. 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
32 
 
The cache memory reads the most likely data to be read in the near future (sequentially), as an example, 
when a folder contains images from number 1 to 5, supposing the user opens the image number 1, the image 
2 will be cached. 
When the processor needs to execute an instruction, it checks first it in the data registers, if the instruction 
has not been found, it checks the first level of cache L1, then it looks for it in the 2nd and further in the 3rd level 
of cache [17]. A cache hit is when the data requested is found in the cache, but a cache miss is when the data 
requested by the processor cannot be found in the caches, which generates a delay or time overhead requiring 
the processor to stall. In the case of a cache miss, cache controllers between each level will try to retrieve the 
data from the respective lower level from L1 cache to external storage as shown in Figure 2-4. 
 
Figure 2-4 Different Cache Levels [17] 
2.1.4 Multiprocessors  
Chip multiprocessors (CMPs) emerged to mitigate the power consumption problem that came with the 
increasing frequency scaling [16]. In early 1966, Flynn categorized computer architectures into four classes 
depending on the singularity or multiplicity of instruction and data streams: 
• Single instruction, single data (SISD): A sequential computer that can perform one operation at a 
time. This architecture has no parallelism in either the instruction or data streams. It can fetch single 
instruction stream (IS) from memory using its control unit (CU). The CU produces the controlling 
signals driving the single processing element to use the single data stream. 
• Single instruction, multiple data (SIMD): Containing multiple processing elements capable of 
executing the same instructions on multiple data elements simultaneously.  Modern CPUs use this 
architecture to improve the performance of multimedia applications. A SIMD architecture can 
provide parallelism but not concurrency. 
• Multiple instructions, single data (MISD): In this architecture, multiple instructions operate on the 
same data. This is an uncommon architecture which is generally used for fault tolerance.  
• Multiple instructions, multiple data (MIMD): In this architecture, multiple processors work in 
parallel, meaning that multiple processes execute multiple instruction on different data elements. 
The different architectures are shown in Figure 2-5, Figure 2-6, Figure 2-7 and Figure 2-8: 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
33 
 
 
Figure 2-5 SISD Architecture [16] 
 
Figure 2-6 MISD Architecture [16] 
 
 
Figure 2-7 SIMD Architecture [16] 
 
Figure 2-8 MIMD Architecture [16] 
 
2.1.4.1 Shared Memory: Uniform vs. Non-Uniform Access (UMA vs. NUMA)  
The parallel processes share a common address space [2], in shared-memory architectures. Main memory 
can be at the same distance from all processors or can be distributed with each processor. The first case is the 
uniform memory access (UMA), and the second, is the non-uniform memory access (NUMA). In symmetric 
multiprocessor (SMP), each processor-cache hierarchy could be a (chip) multiprocessor itself, Figure 2-9. 
Another UMA systems, called dance-hall architecture, is shown in Figure 2-10. The interconnection network 
can be a crossbar or an indirect network such as a butterfly network. 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
34 
 
Processors
Caches
Shared Bus
Main 
Memory
***
 
Figure 2-9 SMP Schematic [2] 
Processors
Caches
Interconnection
Main 
Memory 
modules
***
***
 
Figure 2-10 Dance-hall (UMA) System [2] 
 
In NUMA shared-memory architectures, each processing unit contains a processor, a cache, and a fraction 
of the global memory. The latter can either be centralised as shown in Figure 2-11 or decentralised (meshes) 
as in Figure 2-12. 
Processors
Caches
Interconnection
***
Local 
memory
***
 
Figure 2-11 Schematic NUMA Organisation with 
Centralised Interconnection Network [2] 
Processors
Caches
***
Local 
memory
***
Switch Switch Switch Switch
Switch Switch Switch Switch
Interconnection
 
     Figure 2-12 Decentralised Interconnection Network [2] 
 
At this point, we have shown the different COTS processing architectures layouts, starting with the basic 
Von Neumann Machine Model, then extending to more modern processing architectures, that can be 
subdivided depending on how the architecture processes the data and instructions. The other way to classify 
COTS processing architectures is depending on the network connecting its different components, such as the 
CPUs, the caches and the RAM. The main reason for the second classification is the caching system of the 
processor that is why the caches and their principal of work have been shown. 
Showing the different levels of caches, the idea behind the use of pipelining and the different multicore 
layouts is important for the Chapter 3 where the reliability study of the multicore processor depends on 
multiple parameters that are external and internal.  
 
2.1.5 Compilers 
A compiler is a computer program that converts source code written in a high-level programming language 
(the source language) into another computer language (the target language) [56].  
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
35 
 
2.1.6 Structure of a Compiler 
The standard design for a static compiler is the three stage design, containing the front end, the optimizer 
and the back end (Figure 2-13 and ) [56].  
Source Code Frontend Optimizer Backend Machine Code  
 
Figure 2-13 Compiler Stages 
For a better understanding of the compiler’s three stages the example of “LLVM” [57] is considered. 
For our implementation LLVM compiler has been chosen over GCC, because of its better performance 
(Section 2.4.13) in addition to the following reasons: 
• GCC is under GPL license. Clang is a BSD license. 
• LLVM has better documentation and easily understandable AST, unlike GCC, which is harder to 
master for new developers. 
• Clang is in API form, which could be reused by analysis tools, refactoring, IDE as well as code 
generation. GCC is built in a way that makes it hard to use its API, its design and policy make it hard 
to separate to the standard compiler design. 
• Clang is faster and uses less memory than GCC. 
• Clang has a better support for C++. 
In the following Section 2.1.7, the “LLVM” compiler is explained. 
2.1.7 LLVM Compiler  
The dramatic growth recently in application size, altering their runtime behaviours notably, especially with 
their components incorporating several different languages, and their support of dynamic extensions and 
updates. The execution time of an application can be either propagated throughout the application [58], or 
have little hot spots. Analysis and transformations must be added to the code to make it run optimally. 
Optimizations could be performed at link-time, or at install time which is machine-dependent, or runtime or 
dynamic optimizations. There are also between runs (idle-time) optimizations or profile-guided optimizations. 
Static analysis can be useful, especially for link-time purposes such as static debugging, static leak detection 
[59], and memory transformations organization [60]. More advanced analysis and transformation passes are 
emerging to ensure program’s healthy execution, which can be implemented at load-time or during software-
installation [61]. Enabling reoptimization of the code will provide architects the required tools to advance 
processors and have better interfaces [62, 63].  LLVM or Low-Level Virtual Machine is a compiler framework 
aiming to aid programmers have program analysis and transformations distinctly attainable. An abstract RISC-
like instruction set is used by LLVM to characterise a program. This incorporates type information, explicit 
control flow graphs, and dataflow [64]. LLVM code representation can be managed using three methods: the 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
36 
 
first one is a low level, and language independent, that can be applied on data types and operation from high-
level languages. Secondly, using instructions for type conversions and low-level address arithmetic while 
maintaining type information. The last one, using two low-level exception-handling instructions, applied for 
language specific exception semantics. 
LLVM does not stand for an acronym; it is the full name of the project. LLVM was a research project at 
the University of Illinois, written in C++, it is able of optimizing programs at compile-time, link-time, run-
time, and "idle-time" [65].  
The following languages are supporting LLVM as a compiler: Common Lisp, ActionScript, Ada, D, 
Fortran, Swift, Python, R, Ruby, Go, Haskell, Java bytecode, Julia, Objective-C, Ada, D, Fortran, OpenGL, 
Rust, Scala, C# and Lua [66]. The standard design of LLVM has the following three stages: 
2.1.7.1 Front-End (Figure 2-14) 
The Lexer: The first stage of the front-end, it transforms a sequence of characters into a sequence of 
tokens and classifies them [67]. Lexing has 2 steps:  
• The scanning: The input sequence is divided into token classes. 
• The evaluation: The conversion of the raw input characters into processed values. 
The Abstract Syntax Tree (AST): Representing the source code of a programming language in a tree form 
[67].  It is more used for semantics-checking, where the compiler examines the correct usage of the elements 
of the program and the language. The compiler also generates symbol tables depending on the AST while 
semantic analysis. A full traversal of the tree allows inspection of the correctness of the program. Next, the 
AST is taken as the backbone for code generation. 
Parser: Parser examines a code syntactically according to the rules of the language's grammar. The parsing 
stage decides if the input code can be used to compose a string of tokens according to the grammar used. A 
parse tree is built in this phase. Parser defines functions to organize language into a data structure (AST) [67].  
AST
Front end
Lexer
 
Token.Kwd 
'+'
‘+’
 Three 
tokens  into 
an AST
“x+y” Parser 
LLVM IR
High level 
language 
c,c++ ...
 
Figure 2-14 Front End of LLVM 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
37 
 
2.1.7.2 Middle Representation (Figure 2-15) 
IR Code Generation: Using API, the generated LLVM IR has a determined format that is generated by 
the inbuilt APIs of LLVM. LLVM IR is stored on a disk in either text files with .ll extension or in binary files 
with .bc extension [67].  
LLVM Optimizations: LLVM allows the IR to be optimized to generate more efficient code, prior to 
converting into the assembly. An LLVM pass aims to optimize LLVM IR.  A pass runs on the LLVM IR, 
processes the IR, examines it, defines the optimization possibilities, and alters the IR to generate optimized 
code [57]. 
IR 
Generation
IR optimizations 
IR
Optimised 
LLVM IR LLVM IR 
Different optimisation 
techniques added here 
 
Figure 2-15 Optimization Phase of LLVM 
2.1.7.3 LLVM Backend (Figure 2-16) 
 LLVM’s back end takes the intermediate representation, after the optimization phase and turns it into 
assembly or executable code. The optimizations added in this phase are architecture dependent [65]. 
Backend
define i64 @foo(i64 %a, i64 %b) {
 %1 = add i64 %b, %a
 ret i64 %1
}
Code Generation
foo:push %rbp
mov %rsp,%rbp
sub $0x1c,%rsp
mov %rdi,-
0x8(%rbp)
mov %rsi,-
0x10(%rbp)
mov -
0x10(%rbp),%rax
mov -
0x8(%rbp),%rcx
add %rcx,%rax
mov %rax,-
0x18(%rbp)
mov -
0x18(%rbp),%rax
leaveq
retq
IR
ARM Assembly 
LLVM IR With 
protection 
algorithms
Protected 
Machine language
 
Figure 2-16 Backend of LLVM 
Discussion 
LLVM is efficient with its lifelong analysis and transformations used to optimize the code while keeping it 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
38 
 
transparent to programmers. Based on low-level, typed SSA instructions set to represent a program, without 
runtime environment constraints. Its language-independent representation enables all the code for a program, 
inclusive of system libraries, with parts implemented in different languages to be compiled and optimized 
together.  
At multiple phases of the software lifetime, LLVM is capable of implementing optimizations, statically, 
dynamically, and idle-time using the gathered information from the profiling, collected from programmers. 
Up to now, LLVM provides robust link-time global and interprocedural optimizer, with low-overhead tracing 
for dynamic optimization, and Just-In-Time and compile-time code generation. Some aggressive LLVM 
transformations that are normally only used on type-safe languages could be used. LLVM’s intermediate 
representation is close in size to the X86 machine code and considered 25% smaller than SPARK code.  
The choice of this strong tool, providing the ability for multiple optimizations of the code, will enable this 
research to provide fault tolerant code that runs efficiently, and with minimum overhead. This research will be 
using this tool to mitigate SEEs using redundancy, where multiple instructions and data types that could be 
protected using analysis passes that can be implemented on top of high-level language after it is transformed 
to its intermediate representation. Note that the use of LLVM passes to implement this work will enable the 
portability of this work to multiple processing architectures, knowing that the transformation passes can target 
multiple high-level languages. 
2.2 Space Processing Systems 
Radiations in space are originated from Solar flares, coronal mass ejections (CMEs), Galactic cosmic rays, 
Solar winds, solar particle events, Van Allen radiation belts etc. Radiation environments contain particles such 
as electrons, neutrons, heavy ions, and photons. The two main categories of radiation effects are the cumulative 
effects, and Single Event Effects (SEEs) Figure 2-17. 
 
Figure 2-17 Various Radiation Effects Induced in Electronic Devices [10] 
Soft-errors occur when the striking charged particle cause failures without permanently damaging the 
device. On the other hand, hard-errors are when permanent damages occur. 
Long-term effects changing the circuit parameters are called cumulative effects. These effects are further 
subdivided to TID, resulted from the hole trapped in gate oxide region, altering the threshold voltages of the 
device and increasing leakage current. Another long-term effect is the Displacement Damage (DD), caused by 
energetic particles displacing atoms in silicon or insulator, causing electrically active effects. DD can also be 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
39 
 
denoted as TNID (Total Non-Ionizing Dose). 
SEEs are categorized into six different classes, including the SEUs, causing storage nodes upsets, or altering 
the logic level. Storage elements such as laches SRAM cells and Flip-Flops are susceptible to SEUs. Multiple-
Bit Upset and Multiple-Cell Upset (MBU & MCU) are SEUs affecting multiple nodes in memory. Single event 
functional interrupt (SEFI) is a type of SEEs disrupting the circuit’s functionality. Single Event Transient 
(SET) usually disturbs the nMOS or pMOS transistors in the circuits. Combinational logic is susceptible to 
SETs.  Single Event Latch-up (SEL) affect PNPN of CMOS circuits. Single Event Burnout (SEB) is observed 
in MOSFET devices when drain-source current is higher than the breakdown voltage of the parasitic structures. 
Single Event Gate Rupture (SEGR) will damage the gate oxide insulation of the MOSFET. 
2.2.1 Radiation Effects on Electronic Systems 
In Section 2.1, different basic notations have been shown for COTS processing architectures. This section 
discusses the radiation types in a generic way, and then introduces the different radiation environments, the 
terrestrial and the space environment. The different radiation affects will be classified and explained.  
Radiation environments are classified into two categories: terrestrial (on the earth) and extra-terrestrial 
(space). Both categories effect electrical devices. 
2.2.1.1 Terrestrial Environments 
The main origins of radiations on the Earth are either radioactive materials or cosmic rays. During the 
radioactive decay process of radioactive elements, α-particles will be produced, which can affect electrical 
devices. Remaining marks of radioactive material may still exist in ICs, as an example, [2, 68-76] has shown 
that residual traces of Polonium 210, Uranium 238 and Thorium 232 in aluminium, gold, processing chemicals, 
ceramic packages or lead solder bumps. Electrical circuits activated inside a nuclear power plant could be 
vulnerable to radioactive materials. Nuclear weapons can also generate powerful radiation blasts. Galactic 
cosmic rays are highly energetic particles from remote sources in the galaxy. The solar wind is a stream of 
charged particles emitted from the upper atmosphere of the Sun. The Earth’s magnetic field or the 
magnetosphere shields and protect it from part of the radiations.  Secondary particles could be produced by 
the part of radiations that have not been stopped by the magnetosphere when they hit and interact with the 
atmosphere, which may lead to further interactions and particles. The Earth is shielded by the atmosphere from 
the primary particles, however, it is still be prone to the secondary particles, that could reach and effect electric 
circuits at sea level. The flux of protons and heavy ions increases with altitude. High altitude aircrafts 
experience 100 times higher radiation rates than the sea level [77]. The secondary radiations are the source of 
the majority of ionizing particles at ground level, the majority of the upsets are caused by neutrons [78, 79]. 
Geographic location, i.e., longitude, latitude and altitude, can also affect the radiation flux intensity, a 
comprehensive study can be found in [80]. 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
40 
 
2.2.1.2 Space Radiation Environment  
Spacecrafts and satellites are not privileged with the protection provided by the atmosphere. The shielding 
against radiations, due to the magnetosphere is affected by the spacecraft’s position relative to the Earth. In 
the space environment, radiations are originated mainly from Van Allen radiation belts, solar winds, and 
galactic cosmic rays [81]. 
The Van Allen belts are composed of the inner and the outer belts, which are resulted from the trapped 
charged particles, by Earth’s magnetic field. The particles are mainly electrons and protons. The density of 
highly energetic charged particles is stronger in some regions of the magnetic field, creating two belts around 
the Earth. The high-energy protons are capable of causing upsets, but the energy of the trapped electrons is not 
sufficient to enable them to go through the spacecraft casing [82]. 
Solar winds consist of mainly protons, α-particles, heavy ions and electrons. The Sun has an extremely 
active surface, observed as spots on its surface. Coronal Mass Ejection (CME) [83] is the phenomenon where 
occasionally, the Sun ejects plasma from its corona. Some of these CMEs elude the Sun’s gravity producing 
powerful solar winds that could reach the Earth. These long blasts of high-energy charged particles can perturb 
satellite functionality. Solar activity is periodic and is estimated by the count of sunspots. The period of the 
solar change is estimated to be 11 years, where it will change between a maximum (solar max) and a minimum 
(solar min). During the solar max, the proton flux in the radiation belts is lower and higher during the solar 
min. 
System designers should anticipate for an average or a worst-case flux, depending on where the system 
will be applied. Most of the galactic cosmic rays are highly energetic heavy ions and have an isotropic 
(omnidirectional) flux. Outside the magnetosphere, spacecrafts are vulnerable to these radiations. Depending 
on their trajectories, space missions could be classified to Earth-orbiting, solar system, and deep space. The 
first class or Earth orbiting spacecrafts include satellites, which could possess different types of orbits. 
Satellites orbits are classified into Low-Earth Orbit (LEO), Medium Earth Orbit (MEO), and Geosynchronous 
orbit (GEO) [84]. 
The altitude of LEO orbits is around 1000 km. An LEO orbit is under the inner radiation belt.  The South 
Atlantic Anomaly (SAA) [85] is a result of the non-uniformity of Earth’s magnetic field, causing a shift of 
Earth’s magnetic dipole relative to the Earth’s rotational axis. SAA is geographically located over South 
America, and this area has the highest radiation flux, causing satellite electronics to have more upsets in this 
area than any other parts of the orbit. Polar cusps are created because of the lack of magnetic field over the 
polar areas. The magnetic field bends when it enters these regions, allowing a high density of charged particles 
in these unshielded areas. A satellite passing by the poles region will be vulnerable to a high radiation flux. 
Satellites in the Geosynchronous (GEO) orbits maintain their location with respect to the Earth and therefore 
remain over a certain point on the Earth. The GEO orbit is 35,790 km of altitude. At this high altitude, satellites 
cannot benefit from the Earth’s magnetosphere shielding against radiation, thus, making their electronics 
vulnerable to cosmic rays. System designers for GEO satellites must take into consideration maintaining their 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
41 
 
reliability, therefore, their design must be for a higher level of radiation mitigation compared to LEO and MEO 
orbits. 
For deep space missions, designers must always be prepared for the worst case [86] because the conditions 
of the spacecraft are unknown most of the mission. The lifetime of the mission is an important factor when 
designing a space mission, as well as the mission’s position with respect to solar activity cycle, to determine 
the amount of radiation and the peak that the electronics can endure.  
There are numerical models for the ionizing radiation environment in near-Earth orbits and for evaluating 
radiation effects in spacecrafts based on the collected data from space experiments. Two such models are 
CREME96 [47] and APEXRAD [87]. 
2.2.2 Technology and SEE Error Rate Relationships 
In the previous Section 1.1.1 we have seen the radiation effects damages classification, and the two main 
categories, the transient and permanent damages on electrical devices. In this section, the terminology used to 
evaluate the radiation effects will be introduced.  
The definition of an SEU is a bit flip in the logic state of an element, inflicted in the electric circuit by an 
ionizing particle. There are mainly two parameters that can determine the ability of a particle to induce an 
SEU: the circuit parameters and the particle’s energy. Qcrit is the minimum charge that a particle must unload 
in a node to cause an SEU. Qcrit depends on node parameters such as voltage and capacitance. The LET is used 
to quantify the energy transmitted to the device per unit length of the material through which the ionizing 
particle penetrates. The threshold LETth is the minimum LET that can be transferred by the Qcrit and inflicts an 
SEU. 
When the errors are transient, then their rate is referred to as the soft error rate (SER). In order to determine 
the SEUs rate, the particle strike on the circuit must be quantified first. Particle strike or hit rate is the flux, 
which is measured in particles/cm2/sec. The possibility of a particle causing an SEU depends on its energy, 
mass and angle of strike and also the circuit’s sensitivity. The sensitivity of a circuit Pe = P (error | particle 
hit), is the probability of error occurrence in this circuit for a fixed particle energy. Sensitivity is less than one 
because not all parts of the electric circuit are prone to incident particles errors (e.g., empty areas or areas with 
only wires). 
If an electric circuit has an area A, then its sensitivity σ = Pe × A. σ is known as the cross-section, defined 
as the device response to ionizing radiation. The particles fluence is measured by the unit of particles/cm2. The 
error rate λ = flux ×σ. A typical cross section versus LET graph is shown in Figure 2-18. 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
42 
 
 
Figure 2-18 Cross Section and LET 
A total dose of radiations is produced if the device is subjected to radiations over a long period of time. 
Total dose = dE/dm (E = energy, m = mass) is measured in rad (radiation absorbed dose). The total radiation 
dose is material dependent. At elevated dose rates, devices may fail at lower total doses, due to the short time 
for the charge to anneal [88]. 
In order to have a reliability prediction models (Chapter 4), and the fault injection experiment (Chapter 5 
& 6), multiple hardware characteristic must be provided, including the SEUs error rates for different circuit 
technology sizes and in different radiation environments.  
A 10-year study of radiation effects on COTS memory devices of microsatellites has been conducted in 
[89]. The satellites operating in an LEO orbit were designed and built at the University of Surrey (UOS). The 
ionizing particle environment has been explored using radiation monitoring payload developed by the UoS 
and the Defence Evaluation and Research Agency (DERA). This research has evaluated the SEEs of galactic 
cosmic-rays, geomagnetically trapped particles, and solar particles using monitoring instruments combined 
with a programme of ground-based testing of memory devices. The results are shown in Table 2-1. 
 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
43 
 
Table 2-1 In-Orbit Observation of SEUs [89] 
 
Error rates predictions from Low and high proton experiments data are shown in [90], including multiple 
bulk SI and SOI circuits from 20-90 nm technology nodes. The purpose of this research is to evaluate the 
amount of low-energy proton (LEPs) adding to the overall SEU rate in orbit. LEPs add up to 4.3 to the total 
SEU rate. The incident particle’s angle has also an effect on the SEU rate, wherein SOI circuit the gazing 
angles were the worst, whereas in bulk circuit normal angles were the worst. Table 2-2 shows the estimated 
error rates for different technology sizes, and different environments (Orbits).  
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
44 
 
Table 2-2 SEUs Rates for Different Circuit Technologies and Different Orbits [25]. 
 
Discussion 
From Table 2-1 and Table 2-2 it can be observed that the error rate of a processor in space depends on 
multiple parameters, these parameters can be internal or external. The internal parameters that are affecting 
the error rate include parameters of the circuit, such as the size of the caches and the RAM, the technology of 
the circuit node’s fabrication. The external parameters are related to the environment in which the circuit is 
operating, such as its orbit, its proximity to the South Atlantic Anomalies and to the poles, and the solar winds. 
Also, the error rate increases depending on the time of exposure.  
It is important to comprehend the effects of the internal and external parameters on the error rate, as it is a 
crucial step in determining the reliability of the processing architecture for Chapter 4. Comprehending the 
effects of some external parameters such as the spacecraft’s location, proximity to the SAA and the poles, and 
its orbit is important for Chapter 6, where we proposed an adaptive system for error detection and recovery, 
capable of changing the operation mode depending on the error rate in orbit. 
2.2.3 Radiation Hardening by Design  
Hardening can be achieved using two methods: By layout modifications, or by circuit design modification. 
Further explanations on the radiation hardening by design can be found in Appendix A. 
2.2.4 Comparing Processors for Space and Earth Applications  
Figure 2-19 shows the processor architectures that have been used for space applications and their different 
performances [91]. The performance in terms of Millions of instructions per second (MIPS). SpaceMicro 
Proton200K has the highest performance in the chart, with its 1 GHz dual-core [92].  
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
45 
 
COTS processors are much faster than the RH or RHBD ones, cheaper, and lower in power consumption, 
that is the reason why the space industry has invested in this technology, nevertheless, their vulnerability to 
radiation effects is a flow that needs to be carefully considered. 
 
Figure 2-19 Performance of Space Processors [91] 
(NoFT: No Fault Tolerance, SIFT: Software Implemented Fault Tolerance, TTMR: Time TMR) 
A research conducted by Pignol proposed three levels of availability (depending on the SEEs) [93]. 
Motivated by the need to optimize the trade-off between reliability and cost, where high-safety missions such 
as deep space probes or manned missions demand very high fault detection and recovery regardless to the time 
overhead, mass, power consumption, and cost. Other missions need medium fault mitigation such as non-
critical payloads, enabling them to recover from the SEEs using several ways, making them non-available for 
short periods of time. Cost-efficient scientific missions do not require high availability permitting them to use 
COTS products without fault mitigation. Nonetheless, the technology scaling in the COTS products increases 
their susceptibility to SEEs, meaning some mitigation measures must be applied.    
2.3 Error Correction Codes  
The introduction of the ECC was done by Hamming code [94] and the pioneering work of Shannon [95], 
which opened the gate for new ECC inventions such as Golay code [96]. Figure 2-20 illustrates the block 
diagram of a canonical digital communications/storage system. The majority of books on the ECC theory and 
digital communication [97] include Figure 2-20. Both the source and destination information include any 
source coding technique corresponding to the information’s nature.     
The encoder needs the information bits as input, and outputs redundant check bits with the original 
information, enabling the detection and correction of most of the errors originated from the process of 
modulating a signal and transmitting it over a noisy medium [98-100]. 
The receiver of the ECC decodes the information using the redundant bits in order to detect and correct 
errors. When an error is detected, the ECC decoder will recode the received information, then compares the 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
46 
 
redundant generated check bits with the previous ones generated at the sending end.  
There are multiple examples of the concatenation of ECCs, such as the combination of an outer Reed–
Solomon (RS) code, through intermediate interleaving, and an inner binary convolutional code. This 
combination has multiple applications, spanning from space communications to digital broadcasting of HD 
television. The main idea is that the decoder of the convolutional code generates bursts of errors that can be 
subdivided into smaller pieces with the deinterleaving scheme, allowing the RS decoder to manage them. RS 
codes are non-binary, with the ability to correct multiple errors. The advantage of the combined ECC is the 
ability of implementation using separate decoders, instead of a single complex one. 
This section will introduce the most common techniques of error detection and correction that have been 
used to protect the CPU and the memory system of the processor architecture. FEC is a well-developed field 
[101, 102]. FEC techniques enable digital data protection over unreliable storage media or communication 
channels. 
The protection of memories against SEUs is done by an FEC code in numerous computer architectures. 
Despite the common implementation of FEC codes in hardware, requiring supplementary memory and 
encoding/decoding circuitry, FEC codes can be implemented in software. Considering the software 
implementation of Reed-Solomon code that can do single-byte error correction is proposed in [103] for RAM 
discs protection in satellite memories. 
 
Figure 2-20 A Canonical Digital Communications System [104]. 
In the previous cases, both the TMR and the FEC require a certain level of redundancy, and resources 
availability. The FEC requirement for extra memory is dependent on the used algorithm, the most basic one is 
the Hamming code with double error detection, single error correction is the least demanding and the fastest. 
Algorithms with more correction abilities such as the BCH codes are demanding more resources and could be 
slower. TMR is very efficient in terms of error detection and correction, but it requires extra memory and 
energy and it is relatively slow. 
Further explanations on the theory of error correction coding can be found in Appendix B. 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
47 
 
2.4 Software Protection Techniques 
In this section, techniques for fault tolerance in single and multicore architectures are assessed and their 
implementations are shown in different software layers, User and Kernel in Figure 2-21 and Compiler . Every 
implementation layer is briefly introduced in this section. 
 
Figure 2-21 Application and Kernel Layers 
Kernel layer is the central core of an operating system (OS). Typically, the kernel is responsible for memory, 
CPU, and devices management. The application layer is designed to carry a certain function directly for the 
user or, for another application program. Application software runs on top of system software. At Compiler 
level, using compiler optimizations techniques are applied in order to protect the code at the instruction level, 
by adding redundant instructions, then compare their outcome for error detection and correction. 
2.4.1 Process-Level Replication (PLR) 
PLR is based on the concept that only the faults that impact software correctness are malicious and taken 
into consideration [105]. The affluence of the Maestro architecture [106] in terms of processors gives the 
advantage to PLR in error detection and correction, before errors spread to the application’s output. 
PLR is implemented at the kernel level. This technique runs the same application on three separate tiles. 
The outcomes of the three tiles are compared as in TMR, if all files are identical, we know there are no errors. 
If there is a mismatch, meaning the occurrence of an error, then a majority rule will decide the righteous output. 
Opening files was made possible by the shared memory that is equipped with bookkeeping. The results of this 
technique [105] using image compression of size 904K show a total overhead of 2,187,742,009 cycles 
including both the compression and PLR overhead.  
The implementation is achieved in three phases [105], the two main phases, init and interposition only 
cause slight overhead, however, the last phase fini, could add a large overhead to the system. In addition to 
that, this technique does not consider the case where more than one processor has erroneous data, or the case 
where the shared memory is damaged. In case of more than one core has erroneous data, some error detection 
and correction techniques could be used. This study has no error coverage information (the ability to detect 
and correct errors has 
not been included). 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
48 
 
2.4.2 Thread-level Replication (TLR)  
The TLR is a software-based fault tolerance [105] originating from the N-version programming [107] with 
thread replication. The user function is replicated on three threads running on separate cores (tiles). The 
execution is done in parallel. TLR compares the outputs from the three threads using TMR. The redundancy 
of Maestro architecture is used in this technique, where threads replace processes. 
Three copies of the original function will be created using pthread library on different tiles. The outcomes 
of the threads will be compared using TMR. The errors are determined from the function’s output. TLR has 
two functions: The main implemented by the threadReplicate function and the insertErrors for evaluation.  
The overhead caused by TLR is application dependent [105]. On an empty function, the total overhead was 
12,021,270 Cycles. TLR does not consider the case of more than one error in threads, and the shared memory 
is still unprotected. This study has no error coverage. 
2.4.3 Heartbeats  
Heartbeats use a network of CPUs, permitting to every CPU to check if another CPU is in an active or 
inactive state [105]. By checking the scoreboard, a CPU could examine another CPU’s status. If a CPU does 
not update the scoreboard after a certain time, other processors know it has been compromised. In contrast, if 
the scoreboard is updated before a certain time window, the process is healthy. The time required for a CPU 
to update the scoreboard before assuming that this CPU is compromised is called the slack time. Four CPUs 
are used, with one-second interval for heartbeat updates. The average number of cycles was 4,195 cycles for 
checking and 15,353 cycles for updating scoreboard meaning a total of 19548 cycles per CPU.  
In this technique the cores update a common scoreboard on the shared memory depending on their states, 
however, unless this shared memory is protected using FEC techniques, SEE in memory could lead to 
erroneous outputs or in worst case system failure. This technique does not show the error coverage, and its 
recovery method use resets of CPU meaning in case of error occurrence data could not be recovered. 
2.4.4 Fault Tolerance on Multicore Processors Using Deterministic Multithreading  
Deterministic multithreading is a software-based fault tolerance technique for multithreaded programs 
running on multicore processors [108]. Redundant execution of multithreaded processors is used for software 
error detection and mitigation. The same execution is guaranteed even if there is a non-determinism caused by 
shared memory access. 
Implemented at user-level, communication between the redundant processes is not needed. The objective of 
this technique is to minimize the code updating the clocks, so the thread suspended does not wait for long. 
This is achieved using special optimization techniques [109]. The overhead is application dependent, for 
benchmarks PARSEC [110] and SPLASH-2 [111], an average overhead of 49% is produced [108]. 
This technique has improved the performance because of the reduction of the clock updating code and also 
the clock updates are done in advance. However, in case of error occurrence, this technique uses rollback to a 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
49 
 
valid checkpoint which may add latency.  
2.4.5 A User-level Library for Fault Tolerance for Multicore  
This is an implementation of multicore processing library, a library enabling a multithreaded user-level 
application error detection and mitigation by simple alterations to the code. Detection is done by using the 
redundant cores found in the multicore system to accomplish redundant execution. Recovery is achieved using 
checkpoint/rollback [112]. 
The overage averhead was 25% for 2 threads, and 46% for 4 threads [112], the benchmarks used are 
PARSEC [110] and SPLASH-2 [111]. This technique only works when the memory is assumed protected, in 
addition to that the rollback takes a lot of time increasing the delay. 
2.4.6 Check-Pointing and Rollback Recovery for MultiProcessors 
This technique allows a processor to recover from its last recent valid state in case of error occurrence, the 
recovery is ameliorated by avoiding rollback to old checkpoints [113].  
2.4.7 Checkpoint/Rollback  
If an error is detected, Check-pointing and rollback could be used for mitigation. Implemented at Kernel 
level, the architecture-independent Linux-CR v15 [114] publicly available is used [113]. In this technique, the 
application will reconstruct the whole process order by rewriting each process, and then using sys_restart() 
from each clone.  
Checkpoint performances: the overhead increases with respect to both the size of the memory and the 
number of tiles included in the application. 
2.4.8 Respec: Multiprocessor Replay via External Determinism 
Support deterministic replay of shared memory multithreaded programs on commodity multiprocessor 
hardware. This technique adds 18% overhead to the original execution time for recording and replaying 
benchmarks with 2 threads and 55% overhead for benchmarks with 4 threads. The communication between 
leader and backup processes increases the overhead [115].  
2.4.9 DRIFT: Decoupled compiler-based Instruction-level Fault-Tolerance  
DRIFT is a compiler error detection technique based on replicating instructions of the program and 
introducing checks. This project is aiming to minimize the error detection overhead and enhancing the system’s 
performance without affecting fault coverage. DRIFT attains this by decoupling the execution of the code 
(original and replicated) from the checks [116].  
 DRIFT algorithm [116] Operates in four steps: 
• Replicating the Code: the possibility of instruction replication is checked. If yes, then a copy of the 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
50 
 
original instruction is emitted just before the original one. 
• Isolating the Code: the replicated code is isolated from the original. The isolation guarantees that 
the replicated code does not write on any of the original code’s registers. 
• Emitting checks: the original register is compared against the corresponding renamed one. DRIFT 
gathers all the compare instructions of a basic-block into the vector (CMP VEC) [116] which is 
used in step 4. 
• Decoupling the Checks: the compare instructions in CMP VEC are pushed into vector GROUP 
until either the group limit is reached, or the end of the basic-block is reached. At this point, a jump 
is emitted for each instruction in the group. 
This technique generates an overage overhead of 29%, using Mediabench II video [117] and SPEC 
CPU2000 [118]. The error detection and correction only cover the CPU, the memory is always vulnerable to 
SEEs. 
2.4.10 Composite Data Type Protection Algorithm (CDTP)  
Each variable is protected by extra error coding technique. Every read and write operation carried on the 
protected object is replaced by a set of operations responsible for checking its correctness, in case of error, the 
correct value of the variable is obtained from the redundant information stored together with the original data 
[119]. 
Automatic enrolment of the CDTP algorithm was implemented in the cc1 compiler as an independent stage 
of the compilation process, protection techniques are applied at the beginning of the source code optimization 
directly after the transformation of the protected program to GIMPLE internal representation [119]. The 
overhead depends on the benchmark tested and the algorithm used for protection, using Hamming algorithm 
generates an average overhead of 86%, extended Golay algorithm 146%, full iterated coding scheme 116%, 
and selective iterated coding scheme 117%. 
This technique only protects the memory of the processor architecture, the CPU is still vulnerable to SEEs. 
2.4.11 Live Variable Check Algorithm  
The Live Variable Check algorithm uses time redundancy to discover computational or memory errors 
impacting local variables. A secondary set of data is introduced, and the same operations are performed on 
both groups of variables. Duplicated instructions have no impacts on the result of the program, but permits to 
detect errors in the system at run-time [119]. Outcomes of duplicated computations done on both copies of 
variables are compared, in the case of a mismatch, prior variable at the start of the code is retrieved and a 
group of instructions is re-executed [119]. 
The CPU is not protected in this technique only memory is considered. In addition to this, the overhead is 
enormous, reaching 13 times the original benchmark. 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
51 
 
2.4.12 Further Techniques 
Shoestring [120], Fault Tolerance Software Checking [121], Error Detection by Duplicated Instructions  
(EDDI) [13] and Software Implemented Fault Tolerance (SWIFT) [14]  techniques are implemented by 
modifying compilers, the LLVM, the GCC, and the OpenIMPACT [122]  respectively,  the modifications 
duplicate instructions and inserts compare instructions where needed. While all the mentioned techniques in 
this section are only for CPU protection, EDDI is capable of covering both the CPU and memory from the 
SEE, but it only covers the MIPS architecture. 
DAFT [123] and SRMT [124] techniques use LLVM and Intel production compiler, ICC 9.0 respectively 
to automatically generate two redundant threads for application and compare their outcome for error detection. 
DAFT shows a huge improvement from SRMT, and it will be considered in our work. Both of these techniques 
are for CPU protection, the memory system is still vulnerable to the SEE. 
The CASTED [125] system is implemented by modifying the back-end passes in GCC-4.5.0 [126] compiler 
framework. This has been implemented with two passes, one for error detection and the other for error 
correction. Like most of the techniques in the literature, this only covers the CPU, memory has no coverage. 
Traditional fault-tolerant scheduling algorithms assume that the fault rate experienced by the system is 
constant and that the fault-tolerance strategy will also be constant. [127] is using adaptive protection on 
FPGAs, however, we are using a different method for switching the modes of protection, our work is aiming 
to optimize the switching in order to find the highest reliability and performance. In addition to that our work 
used compiler fault injection in order to evaluate the different protection modes.  
In [128] fault injection experiments have been conducted in order to determine the reliability/vulnerability 
of embedded processing architectures, the analysis and performance models can be found in [129], which 
introduce the concept of Architectural Vulnerability Factor (AVF). It can be seen from the injection experiment 
that some instruction types, such as branches, Load and Store have been included in this work, which has been 
proven in [12] as crucial in the study of the reliability. The research conducted in [128] has not provided any 
information about the mitigation that should be used to prevent SEUs from occurring. 
 The research presented in [130] is based on improving the reliability-aware scheduling for Heterogeneous 
Chip Multiprocessors (HCMPs). Using a scheduler to monitor the reliability of the different core types, for the 
executed applications, and depending on the overall system reliability, the scheduler decides if the application 
runs on the big or small core. The scheduler adapts dynamically depending on the workload, using software 
metrics called System Soft Error Rate (SSER), to measure reliability. This comes with a cost in hardware up 
to 296 bytes per core. This paper shows promising results in terms of overhead, which was 6.3 %, and 
reliability was improved by 25.4 %, however, this has been obtained using simulations on Sniper 6.0 [131]. 
In paper [132] the concept of switching ON/OFF cores and dynamically voltage frequency scaling has been 
introduced, using on a phase-driven Q-learning based on dynamic reliability management technique, for multi-
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
52 
 
core processors to maximize the processor performance. The learning algorithm running is itself vulnerable to 
SEUs, in addition to adding delays to the original benchmarks. The authors of [132] verified their proposal 
using simulation results, without degrading performance or surpassing the hardware limitations, and 
effectively reducing the peak temperature and hence reducing the physical failure mechanisms such as electro-
migration and dielectric breakdown, and furthermore, limit the number of intermittent failures. 
Reference [133]’s research is aiming to reduce the temperature of the processing architecture using 
compiler-based software techniques since the increase of temperature in the circuit can accelerate the fault rate 
of the entire chip. This research proposed to deterministically establish the logical to physical register mapping 
in a rotating manner, based on application-specific access profiles, using a compiler-based shuffling strategy. 
The research presented in [134] proposes an adaptive fault tolerance approach, capable of using the 
redundant double modular redundancy and rollback technique. This research adapts the level of Instructions 
Level Parallelism (ILP) allowing to have a trade-off between reliability and performance. The results of the 
fault injection experiment have shown up to 89.53% improvement, however, this is due to the fact that the 
injection of the unprotected code was low, with a 3.73% failure rate. The performance looks promising with 
5.86% overhead.   
The research in [135] was aiming to develop SIMD-based software error detection and recovery to improve 
the reliability of ARM processors, by using instructions level redundancy, generated by the LLVM compiler 
framework automatically. The overhead was from 4.7%-70.2%, and the fault-tolerant code is 1.44-2.54 times 
larger (in size) than the original. This research does not provide a study of the code’s resilience, it also assumes 
that the memory instructions (load/store) are protected in hardware, making it applicable for specific 
processing architectures. 
The paper of reference [136] evaluated the fault tolerance using traditional error detection and recovery 
methods such as TMR and DMR, and the influence of using the OS, and its functionalities, including the 
parallel libraries; pthread and openMP. Their result shows that the more parallel the code is, the more 
susceptible it will be to faults, and traditional protection methods are not very efficient against errors of fault 
injection. Their results showed bad error detection and recovery since not all instructions are replicated, in 
addition to that the function calls executed are of functions with vulnerable code. 
2.4.13 Comparison of the Techniques  
Different techniques have been studied and compared in  
Table 2-3, in order to identify the gaps that need to be addressed in this work and to narrow down the 
implementation level. Figure 2-22 shows the performance of the protection techniques. 
It has been observed that a minimal overhead was produced when error detection and correction codes were 
implemented at the compiler level. Amongst compiler implementations, LLVM has shown good performance, 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
53 
 
with a good error coverage, in addition to the time of development that could be minimal compared to the best 
alternative, the GCC compiler. 
When providing a multicore solution (error detection and correction) it is inevitable to have an overhead, 
caused by the requirement for spawning the newly created threads, the overhead of their communication, and 
finally the overhead of joining them. The solutions that are implemented in a single core do not show this 
penalty since no thread spawning, communicating and joining penalty is added.  
Some techniques like Shoestring, SRMT and CASTED have only been simulated, and even that did not 
prevent the SRMT from having an overhead of 400%, due to the multicore nature of this implementation. An 
improvement over SMRT is DAFT implemented using LLVM showing low overhead and providing a 
multicore solution but does not cover memory errors. 
CDTP technique provides a solution to the problem of memory protection, that most of the other techniques 
lack, however, this comes with a considerable overhead, estimated to 86%-146%, this is due to the decoding 
and encoding function calls for error detection and correction in memory read/write operations. CDTP does 
not cover CPU.  
 
Figure 2-22  Performance of the Protection Techniques 
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Techniques
Overhead of Different Techniques
CDTP1 (Memory)
CDTP2  (Memory)
DRIFT (CPU)
SWIFT (CPU)
EDDI (CPU +Memory)
Shoestring
SRMT (Multi CPUs )
DAFT (Multi CPUs)
CASTED (CPU)
Compiler Optimizations for Fault
Tolerance Software Checking (CPU)
Deterministic Multithreading
Hypervisor-based Fault-tolerance (Multi
CPUs)
Respec1 (Multi CPUs)
Respec2 (Multi CPUs)
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
54 
 
 
Table 2-3 Comparison of Different Techniques 
Algorithm (technique) Overhead (%) Error Coverage Hardware Cost Comment 
CDTP 
86-146 
Memory 
Uses memory 
and CPU 
Instructions are 
duplicated then check 
pointing 
DRIFT 
29 
CPU 
Uses memory 
and CPU 
Instructions are 
duplicated gathered then 
check pointing 
SWIFT 
41 
CPU 
Uses memory 
and CPU 
Instruction replication 
and check pointing 
EDDI 
62 
CPU and Memory 
Uses memory 
and CPU 
Instruction replication 
and check pointing 
Shoestring 15.8-30.4 
CPU 
Uses memory 
and CPU 
Instruction replication 
and check pointing 
SRMT 400 
Multi CPUs 
Uses memory 
and CPU 
Thread replication and 
check pointing 
DAFT 
38 
Multi CPUs 
Uses memory 
and CPU 
Thread replication and 
check pointing 
CASTED 
58 
CPU 
Uses memory 
and CPU 
Instruction Replication 
and check pointing 
Compiler Optimizations 
for Fault Tolerance 
Software Checking 
50 
CPU 
Uses memory 
and CPU 
Instructions are 
duplicated then check 
pointing + rollback 
Deterministic 
Multithreading 
49 
Multi CPUs 
Uses memory 
and CPU 
Uses rollback for error 
correction (will be 
considered as backup) 
Hypervisor-based Fault-
tolerance 
100 
Multi CPUs 
Uses memory 
and CPU 
Communication between 
primary and backup 
CPUs required 
Respec 
18-55 
Multi CPUs 
Uses memory 
and CPU 
Requires communication 
between CPUs 
 
2.4.14 Techniques for Memory Access and Coherency  
The different techniques for error detection and mitigation require access to shared data in memory, which 
could create data races between threads, in addition to the overhead produced from threads waiting to access 
a shared resource. In this section, techniques for managing the shared memory are shown. 
2.4.14.1 Mutex Locking Algorithms  
Mutex or Mutual exclusion is the necessity of guaranteeing that no concurrent processes are in their crucial 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
55 
 
section simultaneously. There are two intrinsic operations, which a mutex must have to be practical: lock and 
unlock [137]. The data of a mutex is an integer in memory, initialized as zero, meaning that it is unlocked. A 
mutex could be locked by checking if it is zero and then assigning one to it [137].  
Refined Synchronization - Condition Variables: A condition variable gives threads the ability to wait 
(without CPU cycles losses). One of the suspended threads on this condition variable could wake up, the same 
thing with all the threads suspended the same condition variable, using a broadcast. The necessary locking 
when accessing a condition variable is provided by mutex with a condition variable [137].  
Waiting on a Condition Variable: When a thread meets with a condition variable, other threads would 
possibly have to wait on it. It could be implemented via one of the following functions: 
pthread_cond_timedwait() or pthread_cond_wait() [137]. 
2.4.14.2 Semaphores 
Semaphores are used to avoid race conditions; however, a race-free program is not assured. Counting 
semaphores are semaphores with an arbitrary resource count, other semaphores which are limited to the values 
0 and 1 (or locked/unlocked, unavailable/available) are known as “binary semaphores” [137]. 
2.4.15 Code profiling  
In order to verify the efficiency of the added protection code multiple parameters should be quantified, 
including the time overhead; where the percentage of the delay added by the protection code will be measured. 
The delay can be obtained by measuring the time it take to execute the unprotected and the protected codes. 
Multiple tools can be used for measuring the delay such as the Linux tool perf, allowing to measure the delays 
and the number of processor cycles [138]. 
Another tool to consider is the LLVM’s profiling passes, enabling to count the number of instructions and 
their types: -instcount.  
In order to check that the protection code was added in the assembly and the executable code, both will be 
checked. The assembly can be produced in LLVM using llc tool. The executable codes of the benchmarks can 
be disassembled in order to check it using the GDB tool. 
Valgrind tool in Linux has been also used in order to check for memory leaks, memory debugging, and 
profiling. 
2.5 Summary 
In Section 2.1 the different multicore architectures that are in the market and the predecessor steps leading 
to the high performing multicore architecture are shown, where pipelining allows multiple instructions to run 
at the same time and caching provides the CPUs data and instructions locally causing an order of magnitude 
in performance. The new multicore architectures are also cheap and have low power consumption. Space 
radiations have been discussed in Section 2.1.5, in order to determine the limitations of software radiation 
Yasser Nezzari                                                                                                            Chapter 2, Literature Review 
56 
 
hardening of processor architectures, where the aim is to mitigate against the SEE, accumulative effects could 
not be mitigated with software solutions. In Section 2.1.7 the LLVM compiler is introduced, a strong and 
modern framework, using its libraries enables us to create a new compiler, able to support several languages 
and processor architectures, nominating it to do the implementation of automatic protection and parallelisation 
of code. It has been observed that the highest performance processors used in space are COTS, from the survey 
of processors in space in Section 2.2.4, where the space industry is trying to overcome the vulnerability of 
COTS using different protection techniques in software and in hardware. The theoretical background behind 
different error detection and recovery techniques has been introduced in Section 2.3, by showing the basic 
concepts of ECC, then showing the theory behind the Binary Cyclic Codes, Hamming (with the ability to 
detect two errors and correct single errors), Golay, Reed-Muller and Reed Solomon codes (have the ability to 
detect and correct multiple errors). In Section 2.4 of this chapter, some of the well-known techniques that have 
been used to mitigate against the SEE have been introduced, including TMR and FEC. With the compiler auto 
code generation, it is possible to make system calls (Kernel level) and change the code at the Application level, 
meaning that implementing protection techniques at compiler level will enable the combination of the different 
implementation layers if needed. In order to provide a multicore solution, the technique has to use system 
calls, in the literature p-thread library has been used mostly, furthermore, using a compiler could automatically 
generate code for different parallelism libraries including the OpenMP library. 
EDDI technique provides a full multicore architecture protection (CPUs and memory) from SEU of space 
radiations, and its overhead (62%) is acceptable, however, it is architecture dependent, and could only be used 
for MIPS. It could be implemented using the LLVM, thus, allowing it to target more architectures. 
Furthermore, EDDI could also be extended to be a multicore solution, but the multithreading cost of spawning, 
communicating, and joining should be minimized. The previous techniques require the use of additional 
computations of redundancy that will generate a certain amount of overhead, depending on the technique used, 
this is due to the additional instruction and the multithreading overheads. The challenges are to find the most 
reliable techniques that generate minimal overhead, make the protection process automated, protect multiple 
high-level languages, protect multiple processing architectures, and protect both the CPU and the memory 
system. Implementation at (instruction level) has demonstrated the lowest overhead, so it is a good point to 
start. Techniques implemented using LLVM such as Shoestring (with an overhead of 15.8-30.4%) and DAFT 
(with up to 38% overhead) have shown minimal overhead compared to the state-of-the-art compiler 
implemented techniques. Techniques implemented at the Application and the Kernel levels require adding 
manually the protection code which is not efficient, fully automated protection could be achieved by using 
auto code generation of LLVM compiler. The previous techniques did not inspect the possibility of trading 
between reliability and processing power. Adaptive protection system, depending on the error rate and the 
available recourses will be aimed in this work. Some error detection and recovery techniques will be using 
threads and processes in order to make software redundant, in Section 2.4.14 we have shown the different 
techniques to keep the shared resources between threads and processes coherent. 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
57 
 
3. NEW SOFTWARE PROTECTION TECHNIQUES FOR COTS 
The literature review of this work helped to narrow down a proposed solution that will answer all the 
objectives set up at the start of this research. It has been settled to continue the work on the LLVM platform 
due to its efficiency in terms of performance (techniques for ECC using LLVM have shown the lowest 
overhead amongst the others), and many other advantages mentioned in Section 2.4. 
     This work proposes implementing LLVM passes to automatically detect and correct errors caused by the 
SEUs. This work will be supported by multiple high-level languages, and multiple processor architectures 
(supported by the LLVM framework). The automatic implementation of error detection and correction code is 
achieved using the auto code generation of the LLVM framework, at the intermediate representation by adding 
redundant instructions and comparing their outcome using comparison instructions, after running an analysis 
pass that will exploit if the protection is beneficial, and where it should be added. Analysis pass will be able 
to analyse the code and determine its instructions types and furthermore, analyse functions and determine their 
types and return values, and information about the memory used, including new allocations and read/write 
operations. 
This work will protect both memory and CPUs, starting with the memory, and later the pass will be 
extended to protect the multicore architecture. The extension to multicore will be achieved by automatically 
calling parallelism libraries, starting with the pthread library, allowing the achievement of redundancy using 
the available abundant multicore CPUs resources. Techniques to reduce the spawning, communication and 
joining of threads will be implemented in order to minimise the overhead. 
The memory will be protected using N-Modular Redundancy (NMR) or FEC. NMR includes dual modular 
redundancy and triple modular redundancy (TMR) depending on the resources availability. The FEC will be 
implemented using single error correction and multiple error correction codes depending on the error rate. 
In addition to minimizing the overhead of multithreading, this work considers automatic parallelisation and 
improving data locality for better use of the caches.  
The error rate of the SEUs could be affected by multiple factors, including internal factors, related to the 
circuit parameters such as the technology node’s size, the size of the RAM and the caches, the network and 
configuration of the memory system with respect to the CPUs. The external factors are with respect to the 
environment in which the device is operating, including its orbit, proximity to the South Atlantic Anomalies 
and the poles, in addition to the solar flares causing higher radiation SEEs rates. In order to overcome the 
external factors, we proposed an adaptive system, capable of switching the operating mode depending on the 
change of SEEs error rates.  Having an adaptive system will enable the processor to have higher reliability, 
without sacrificing its performance all the time.  
The adaptive proposed system incorporates multiple operating modes, allowing the processor to switch  
from the high-performance mode in case the error rate of the SEEs is low and high-reliability modes where 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
58 
 
the reliability of the codes is enhanced, by adding the different software error detection and recovery methods. 
This work will be implemented on COTS processing architectures, as these platforms have been proven to 
have high performance, low energy consumption, small size and weight, and their software implementation 
efficiency. A typical COTS processor’s architecture incorporates four CPUs. Each CPU has two levels of cache 
memory. The third level of cache and the RAM are shared between all of the CPUs. 
3.1 Key Research Challenges  
The key challenges of this research are summarized in Table 3-1 and are derived from the critical gaps in 
technology as discussed in Chapter 2. 
Table 3-1 List of Research Challenges  
1. Error resilient CPU using software protection techniques 
a. At what level the software protection should be developed (Compiler, Kernel or 
Application)? 
b. How to minimize the overhead generated from using the protection code? 
c. How to automate the process of protection? 
d. Can this solution include multiple high-level languages and multiple processing 
architectures? 
2. Develop novel reliability prediction models  
a. How to develop a precise prediction model, and how to compare the theoretical prediction 
with the experimental work? 
b. Is the inclusion of multiple internal and external parameters in the reliability prediction 
model going to improve its accuracy? 
c. How to model both the protected and unprotected processor’s reliability? 
3. Develop an adaptive protection system 
a. Is it possible to develop an adaptive system, capable of changing its mode of protection in 
real time, depending on the error rate in orbit?  
b. Will the adaptive mode improve the overall performance of the system? 
4. Develop a test bed  
a. What is the best alternative to neutron radiation test?  
b. Is it possible to obtain the sensitivities of all instruction types of the tested benchmarks 
using software fault injection? 
c. Would it be possible to use fault injection to statically and dynamically inject? 
d. How to use the fault injection results to verify the precision of the prediction model? 
 
The system requirements and the key challenges will be answered in the following sections. 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
59 
 
3.2 Generating Hardened Code Using LLVM Compiler 
From the state-of-the-art papers about software error detection and recovery, it was concluded that the 
highest performing methods are using compilers to generate hardened code, however, most of these techniques 
are only for error detection, and the recovery part has been ignored. We propose a new scheme capable of both 
error detection and recovery, using the compiler as its backbone, we are expecting the performance to match 
or exceed the state of the art.    
Diving deeper into the work of compilers, it has been noted that most of them include three main parts, 
which are: the front end, the optimizer and the back end. Using the front or backend to implement error 
detection and recovery techniques will limit the use of the protection to only a few high-level languages and 
processing architectures. However, if the work is implemented at the optimizer part of the compiler, it will 
enable the inclusion of all the supported high-level languages and the processing architectures. This can be 
illustrated in Figure 3-2. 
The proposed modifications at LLVM’s optimizer include two passes [139], an analysis and transformation 
one. The analysis pass will iterate through all of the code’s layers, where every layer includes one or more 
subsequent sublayer. The module layer is followed by functions, then the basic blocks, and at last the 
instructions. This can be seen in . The iteration of the analysis pass through the different software layers will 
give it information about them, including the count of functions, blocks and instructions, in addition to the 
types of the different functions and instructions. 
 
Figure 3-1 Software Layers with LLVM 
 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
60 
 
 
 
Figure 3-2 Example of the High-Level Languages & Processing Architectures Supported by LLVM 
Other information the analysis pass could provide is the most appropriate location to add the protection 
code, enabling the added redundant instructions to detect and recover the errors that have occurred. The errors 
that can occur are classified into four main categories: Silent Data Corruptions (SDC), Crashes, Hangs and 
Control flow errors. In order to prevent all the error types mentioned, a transformation pass must be 
implemented. The transformation pass will use the information collected by the analysis pass in order to add 
the appropriate protection techniques accordingly. 
Another advantage of using this method is the automated process of code generation, where the newly 
added passes have the ability to transform any code in its intermediate representation [140], making the 
implementation of the protection techniques possible to multiple high-level languages and multiple processing 
architectures.  
3.3 Using FEC & Modular Redundancy Protection Techniques (TMR) 
We propose the use of multiple protection techniques in order to detect and recover the SEEs errors, this 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
61 
 
includes the FEC codes such as the Hamming and the BCH code. Hamming is the most wide-spread algorithm 
used in different fields for error detection and recovery, because of its optimization, enabling it to detect and 
recover quickly. Hamming is capable of detecting two errors and correct one error. On the other hand, The 
BCH code is capable of detecting and correcting multiple errors, making it add extra overhead compared to 
the optimized Hamming code. 
Similar to other FECs both the Hamming and the BCH codes use encoding and decoding of the data. The 
encoding part is done when writing the data to the memory system, where check-bits are added to the code-
word (the information). The decoding part is done when reading from the memory where the decoder algorithm 
compares the newly created check bits with the old ones in order to detect and correct errors.  
In order to encode and decode the data the LLVM passes have been proven efficient, especially where the 
transformation pass will iterate through all of the different code layers and detect the memory instructions, 
which will be transformed using the transformation pass, by adding check bits to the data before storing it, the 
check bits are added by an encoding function, added automatically to the instructions. Every time the data is 
read from the memory it will be decoded, with a decoder function added at the instruction layer. The 
implementation of FEC is illustrated in Figure 3-3. 
Similar to the implementation of FEC, implementing TMR requires two LLVM passes, an analysis and a 
transformation one, but unlike FEC every instruction will be triplicated (FEC adds check-bits for memory 
instructions only). TMR will be using a majority voter to detect and correct errors. What makes this a strong 
scheme is the ability to include multiple instructions and data types, reducing the error rate to very low levels, 
with negligible effects on performance. 
The expected delay is low due to the use of multicore processing architectures, possessing abundant 
processing resources, which makes it possible to add redundant instructions without adding large overheads.  
Once the FEC and the TMR are implemented at instruction level, their results will be compared in terms 
of the error detection and correction ability and also the overhead generated from using the different 
techniques. From now on we will be calling the Instructions TMR (ITMR). Traditional implementation of 
TMR at hardware level uses 3 ands and 2 ors (logic gates) to correct every single bit. Our implementation will 
be capable of taking n number of bits long words at once, instead of using ands and ors instructions for every 
single bit. There are potentially some vulnerabilities when using software FEC, specifically the vulnerability 
of the protection code itself. According to [141] this could be minimized by protecting the protection code, 
using their Flowchart of the self-check routine. 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
62 
 
 
Figure 3-3 Proposed Flow Chart of FEC Using LLVM 
 
 
 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
63 
 
3.4 Modelling Reliability Equations for COTS Processing Architectures 
At this stage we have proposed ways for the software error detection and recovery, however, we need to 
extend the work a step further by studying the reliability of the whole processing architecture. The reliability 
prediction model is a first effort to model the reliability of the whole processing architecture, when it is 
protected using software protection techniques, and when it is unprotected as well. This model depends on 
multiple parameters, which are hardware/software related, and others related to the environment.  
Our model starts with the reliability equations for the different processing architecture; the CPU, the 
different levels of cache and the RAM. The expected reliability will be compared to the reliability obtained 
from the injection experiment, in both cases when the software protection is used and when no protection is 
applied. 
Our model considers the sensitivity of every instruction type since different instruction types will have 
different failure rates. The failure is caused by software fault injection, using our fault injection tool that we 
have developed using the LLVM compiler. The sensitivity of an instruction type is the portion of injections 
that caused the different mentioned error types. 
The reliability model takes into account the relationship between the CPUs, the caches and the main 
memory, and how they are configured and connected to each other. The main difference in the usual COTS 
multicore architectures is if the RAM and the caches are incorporated within the CUPs, and if so, how many 
levels of cache are incorporated. This model will be based on the Poisson distribution. 
The prediction model will be for the unprotected mode, and for the protected mode as well. This will show 
how much the error detection and correction software has added in terms of reliability. 
3.5 New Adaptive Multicore Solution  
Implementing an adaptive solution will enable the system to have high reliability without abandoning its 
performance. This proposition is backed by the fact that error rates in orbit change depending on the proximity 
of the spacecraft to the SAA and the Polar Regions. An adaptive system will make it possible to operate the 
processing architecture in multiple modes, where the high protection code can be applied when the error rate 
is high, and high performing code without protection can be applied when the error rate is low. Implementing 
this will require changes in the operating system, by taking advantage of the abundant resources of the COTS 
and distributing the redundant code on different threads, thus different processing cores. 
We are proposing the adaptive system to have three operating modes, the first one without protection, the 
second one will be using threads TMR (TTMR) and the third mode is a combination of the previous ITMR 
with TTMR. 
The first mode, the unprotected mode, will be having no error detection and recovery code added to it, 
enabling it to run faster than the remaining operating modes. This mode will be enabled when the spacecraft 
is not exposed to harmful radiations causing SEEs. The SEE rate is detected depending on the location of the 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
64 
 
spacecraft in orbit.  
The second mode will be operated when the spacecraft is exposed to low radiation effects, causing low 
SEEs error rates, meaning that the processing architecture will be required to have error detection and recovery, 
but not in a rigorous way. This mode will use the OS of the processing architecture in order to run the code on 
separate threads, and at the end of the execution of the threads, a voting logic will be used to detect and correct 
any errors if they occurred. This can be shown in Figure 3-4, where the code will be running on redundant 
three threads all of the time (green dashed lines), and the voter will be deciding the correct thread amongst the 
three. 
 
Figure 3-4 Typical Multicore COTS After Applying Software Threads TMR 
The third mode will have the highest reliability since it combines both instructions and threads replication. 
This could be achieved by first executing both compiler passes on the code, enabling it to have redundant 
instructions and error detection and recovery instructions as well. After all instructions are triplicated, the code 
will be running on three threads with a voter at the end to decide the correct thread amongst the three of them. 
This mode is expected to have high error detection and recovery, but this will come with an overhead cost. 
The diagram in Figure 3-5 shows the whole architecture using the combined protection mode.  
 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
65 
 
 
Figure 3-5 Typical Multicore COTS After Applying the Combined ITMR & ITMR 
3.6 Proposed Fault Injection 
The aim of performing the fault injection experiments is to test the ability of the software error detection 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
66 
 
and recovery that have been developed. The injection experiment will also validate the reliability prediction 
models.  
The fault injector will be able to perform error injections statically (compile-time using static library i.e. 
files with .a extension), and dynamically (run-time using dynamic library i.e. files with .so extension). The 
injection will be performed using LLVM pass, where random errors will be inflicted in different instructions 
of the software. 
In order to validate the reliability prediction model, all the instructions of the chosen benchmarks will be 
injected. In some benchmarks possessing very large instruction population, there will be injections on samples 
with 95% confidence.   
Both protected and unprotected codes will be injected in order to quantify how much error rate have been 
reduced using purely the software protection techniques. Logs of the golden files or the fault-free results of 
the benchmarks will be saved, to compare later with the outcome of the injected code. Injecting in the software 
will produce multiple types of errors such as crashes, Silent Data Corruptions (SDC), control flow and hangs. 
The error rates will be expected to be very high (more than 50%) when the unprotected code is injected. The 
protected code will add error detection and recovery schemes, reducing the error rate, compared to where no 
protection has been applied. 
What makes our injector unique is its ability to inject multiple instruction and data types, this includes both 
the memory and the CPU instructions. In this context by memory instructions, it is meant the instructions 
responsible for creating new memory blocks, storing and reading to/from the memory. The CPU instructions 
are the arithmetic and logic operations. 
All the instruction of the benchmarks will be injected, however in large benchmarks (i.e. benchmarks with 
large number of instructions) there will be sampling in the number of injections, depending on the size of the 
benchmarks. [142] will be used to obtain the sample size (number of injection of large benchmarks), with 95% 
confidence level, and 0.1% margins of error. 
3.7 Summary 
The proposed solution is offering a new system, capable of automatically applying error detection and 
recovery. The error detection and recovery can protect the CPU in real time. According the literature using 
compilers to generate redundant code can be beneficial with its low overhead and high error detection. The 
proposed solution will be implemented using LLVM and should be capable of covering multiple high-level 
languages and multiple processing architectures. Most of the state of the art implemented work is only used 
for error detection without recovery. Our proposed solution will be capable of both, error detection and 
recovery, in addition to covering both the CPU and the memory system of the processor.  
This research also proposes a novel reliability prediction model for both the protected and unprotected 
processing architecture. The prediction will model every part of the CPU separately and by combining the 
reliabilities of the components, the reliability of the whole processing architecture will be obtained. The 
Yasser Nezzari                                                            Chapter 3, New Software Protection Techniques for COTS 
67 
 
prediction model includes multiple parameters, internal and external, aiming to improve its accuracy. 
The efficiency of the error detection and correction code will be tested using software fault injection. This 
will verify the resilience/vulnerability of the protected code. Comparing the error rate before and after the 
protection is added will show how much improvement was added using the newly developed software 
protection code. The results of the injections will also be used in the reliability prediction model to determine 
its accuracy. 
Adding static protection code will always result in overhead. Knowing that the error rate in orbit changes 
depending on the location of the spacecraft, it would be possible to lower the overhead using an adaptive 
system. The adaptive system will be capable of switching its protection mode in real time, depending on the 
error rate. This will allow the CPU to operate in high performance modes when the error rate is low, and switch 
to operating modes that are more reliable when the error rate is high. 
 
 
 
 
 
 
 
 
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
68 
 
4. RELIABILITY PREDICTIONS  
In this chapter a novel model for the reliability of the protected and unprotected processing architecture is 
developed, starting from modelling the reliability of the basic components, where different processor parts 
such as the CPU, the caches and the RAM have been modelled, and by combining the reliabilities of the 
components, the reliability of the whole system has been modelled. The reliability predictions in this chapter 
for the protected code is when the protection is added at the instruction level, where instructions TMR (ITMR) 
has been used to replicate instructions and call a voter function for error detection and correction. This section 
also provides the reliability predictions when the protection was added at threads level, or in other words using 
threads TMR (TTMR). This section is a preparation for the next one, where the reliability of the predictions 
of the protected architecture, will be compared to the reliability obtained from the fault injection experiment 
of the protected code. The prediction model is estimating the worst-case scenario, and does not consider the 
case where an error has occurred before writing to memory or loading to CPU registers. 
In the predictions for the reliability of the unprotected code, the sensitivity of every instruction type to the 
fault injection caused by our software to simulate SEEs will be included. The sensitivity of an instruction type 
is the number of errors divided by the total number of injections. 
The definition of the reliability of a system is the probability that it is operating without failures for a 
defined period of time. Assuming exponentially-distributed random faults, with rate λ, the reliability of a 
system is traditionally defined as: 
𝑅(𝑡) = 𝑒−𝜆𝑡 (4-1) 
Redundancy has been an efficient way of protecting memory systems. A non-redundant memory system 
fails if a fault occurs in one of its words. Assuming the Poisson process, which allows the analyst to bound the 
SEE rate at any given confidence level. The statistical independence among failures is assumed as well. The 
reliability of the memory system is the product of the reliabilities of all its N words.  
𝑅(𝑡) = 𝑒(−𝜆𝑊𝑁𝑡) (4-2) 
Systems could be serial or parallel, some conventional formulas can be applied. In order to understand the 
statistical reasoning behind reliability block formulas, some notations and definitions will be shown. 
Reliability of a serial system with “n” units 
When all the independent units of a system are connected serially, the whole system fails if any of its units 
fails Figure 4-1. 
 
Figure 4-1 Serially Connected Elements 
 
  
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
69 
 
Table 4-1 Notation 
Symbol Quantity 
λ Failure rate, bits/s 
λc Failure rate in cache memory, bits/s 
λp Failure rate in CPU, bits/s 
W No. of bits per word, bits 
r(t) Reliability of one word 
Rr(t), R1(t), R2(t) & R3(t) Reliabilities in the RAM & different cache levels 
RCPU (t) The reliability of the CPU 
Ni Total No. of instructions of type i 
H1, M1 Hit & miss rate for L1 (local) 
H2, M2 Hit & miss rate for L2 (local) 
H3, M3 Hit & miss rate for L3 (global) 
Hm, Mm Hit & miss rate for RAM (global) 
H2 M1 = X2 Rate of accessing L2 
H3 - (H1 + M1 H2) = X3 Rate of accessing L3 
1-H1-X2-X3 = Xm Rate of accessing RAM 
f Clock frequency, Hz 
IPS0 Instructions per second (single core) 
IPCN Instructions per cycle (N core) 
IPSN = IPCN f Instructions per second (N core) 
SP1 = IPCN f /IPS0 Speedup caused by the multicore 
M No. of pipeline stages 
SP2 = Ni m / (m+Ni -1) Speedup caused by the pipeline 
S The error rate from the injection experiment (sensitivity ratio) 
Ni0 No. of instructions in the CPU (single core) 
Nc0 No. of cycles (single core) 
N1, N2 & N3 No. of cache words used at runtime 
Sip Sensitivity of instructions type i after protection 
Nip No. of protected instructions type i 
Scpup, Smp Sensitivities of CPU & memory instructions after protection 
Ncpup , Nmp No. of CPU & memory instructions protected 
l Total No. of instruction types of the benchmark under examination. 
R1inj(t), R2inj(t), R3inj(t), Rrinj(t) & 
RCPUnj(t) 
The reliability of the different levels of cache, the RAM & the 
CPU, obtained from the experimental injection part 
 
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
70 
 
The total reliability Rt(t) of the system is given by the following Equations (4-3) and (4-4) :  
Rt(t)= R1(t)R2(t)…Rn(t) (4-3) 
𝑅𝑡(𝑡) =  𝑅𝑖(𝑡)
𝑛   (4-4) 
 
 If all n units are identical. 
4.1 Software TMR Reliability Prediction Model  
TMR [143] or triple modular redundancy is the most known term in computer redundancy reliability. The 
basic TMR system comprises three identical units, the output will be decided using a majority voting system. 
TMR is shown , which consists of three identical units, two of them are redundant. The function of the voter 
is illustrated in . If there is an independent fault in one of the units, it is masked, and the output remains correct. 
One of the wide spread and strongest methods for estimating reliability is the Markov modelling. Basically, 
Markov process model depends on using the system state and the state transition. At any given time, the system 
state can be fully described. While modules fail and are repaired, the state transition is capable of describing 
their behaviour. 
The failure rate for each individual word in the TMR system is λw, the Markov chain is given in Figure 
4-3. 
 
Figure 4-2 Triple Modular Redundancy with Voting 
 
0 1 23λw∆t 2λw∆t 
1-3λw∆t 1-2λw∆t 
 
Figure 4-3 Markov TMR States 
State 0 represents the state where all the TMR bits in the words are correct. State 1 represents the case 
where a fault has occurred in any of the three replicated words. State 2 represents the state where more than 
one word has an error. 
The following set of differential Equations (4-5), (4-6) and (4-7) represent the dynamics of the probability 
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
71 
 
system; 
 
𝑑𝑃0(𝑡)
𝑑𝑡
= −3λW𝑃0(t) (4-5) 
𝑑𝑃1(𝑡)
𝑑𝑡
= 3𝜆𝑊𝑃0(𝑡) − 2𝜆𝑊𝑃1(𝑡) (4-6) 
𝑑𝑃2(𝑡)
𝑑𝑡
= 2𝜆𝑊𝑃1(𝑡) (4-7) 
 
Using Laplace transforms [143], and assuming that the system starts without faults at time t = 0, meaning, 
P0(0) = 1, and P1(0) = P2 (0) = 0. The following expressions are obtained: 
 
𝑃0(𝑡) = 𝑒
−3𝜆𝑊𝑡  (4-8) 
𝑃1(𝑡) = 3𝑒
−2𝜆𝑊𝑡 − 3𝑒−3𝜆𝑊𝑡  (4-9) 
𝑃2(𝑡) = 1 − 3𝑒
−2𝜆𝑊𝑡 + 2𝑒−3𝜆𝑊𝑡  (4-10) 
 
The reliability using TMR is one minus the probability of failure or 1 − 𝑃2(𝑡): 
𝑟(𝑡) = 3𝑒−2𝜆𝑊𝑡 − 2𝑒−3𝜆𝑊𝑡  (4-11) 
 
The expression (4-11) represents the reliability of a single word of w bits, when protected with instructions 
TMR.  
4.2 Reliability Prediction of a Memory of N word Size with TMR  
The software TMR is a serial combination of N TMRs, if two errors occur on two different words of the 
same TMR, causing it to fail, the whole memory system will fail. The expression for reliability for TMR is: 
𝑅(𝑡) = (3𝑒(−2𝜆𝑊𝑡) − 2𝑒(−3𝜆𝑊𝑡))𝑁 (4-12) 
 
4.3 Reliability Prediction of RAM, Caches & CPU   
4.3.1 RAM, Caches and CPU Reliability Predictions Without Protection 
In this section, the reliability of the whole processing architecture was deduced. The reliabilities of the 
RAM, the caches and the CPU were combined, knowing that the components are serially connected, Figure 
4-4, taking in consideration that the error rate λ changes depending on the component’s cross-section area. 
 
 
Figure 4-4 Memory Modules Connection in Software Perspective 
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
72 
 
The reliabilities for each component is given by the following expressions: 
𝑅1(𝑡) = 𝑒
(−𝑊𝐻1𝜆𝑐𝑡 ∑ 𝑆𝑖𝑁𝑖 
𝑖=𝑙
𝑖=0 ) (4-13) 
𝑅2(𝑡) = 𝑒
(−𝑊𝑋2𝜆𝑐𝑡 ∑ 𝑆𝑖𝑁𝑖 
𝑖=𝑙
𝑖=0 ) (4-14) 
𝑅3(𝑡) = 𝑒
(−𝑊𝑋3 𝜆𝑐𝑡∑ 𝑆𝑖𝑁𝑖 
𝑖=𝑙
𝑖=0 ) (4-15) 
𝑅3(𝑡) = 𝑒
(−𝑊𝑋𝑟𝜆𝑟𝑡 ∑ 𝑆𝑖𝑁𝑖 
𝑖=𝑙
𝑖=0 ) (4-16) 
𝑅𝐶𝑃𝑈(𝑡) = 𝑒
(−𝑊
∑ 𝑆𝑖𝑁𝑖 
𝑖=𝑙
𝑖=0
𝑆𝑃1𝑆𝑃2
 𝜆𝑝𝑡)
 
(4-17) 
 
In addition to error rates λ’s, we also introduce a new variable: the instruction sensitivity Si which changes 
from one instruction type to another for each benchmark. The sensitivity of an instruction type is the total 
number of errors caused by the injection experiments divided by the total the number of injections. The 
sensitivity is expected to drop once protection code is applied to the benchmarks. The same prediction model 
can be used when the sensitivities are obtained from physical radiation tests. For small benchmarks, the number 
of instruction types is equal to the number of injections, but this may differ in case of large benchmarks where 
sampling for a specific confidence level must be reached for valid results. 
In order to determine the sensitivity of the instruction types of a certain benchmark, all the instructions of 
this type must be injected (can also inject a significant sample size, in case the benchmark has a large number 
of instructions).  
The reliability in the caches and the RAM is affected by the access rates, where the first level of cache has 
the highest rate, and the rate drops with every subsequent cache level until the RAM with the slowest access 
rate. 
In the CPU, the reliability is affected by multiple factors such as the speedup caused by the multicore SP1 
and the speed up caused by the number of pipeline stages SP2 (Equation (4-17)). We are assuming that the 
speedup improves the reliability, since it improves the time of execution, reducing the time of exposure to 
SEEs. 
The reliability R(t) expression of the whole system is given by the following Equation (4-18); 
 𝑅(𝑡) = 𝑅𝑟(𝑡)𝑅3(𝑡)𝑅2(𝑡)𝑅1(𝑡)𝑅𝐶𝑃𝑈(𝑡) (4-18) 
4.3.2 RAM, Caches and CPU Reliability Predictions when Protected with ACEDR 
The reliability predictions of cache levels, RAM and CPU when the ACEDR or (ITMR) protection is 
applied are given by the following expressions: 
𝑅1(𝑡) = ∏(3𝑒
−2𝑊𝑆𝑖𝜆𝑐 𝑡 − 2𝑒−3𝑊𝑆𝑖𝜆𝑐𝑡)𝐻1𝑁𝑖
𝑖 = 𝑙
𝑖=0
 (4-19) 
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
73 
 
𝑅2(𝑡) = ∏ (3𝑒
(−2𝑊𝑆𝑖𝜆𝑐𝑡) − 2𝑒(−3𝑊𝑆𝑖𝜆𝑐𝑡))
𝑋2𝑁𝑖
 
𝑖=𝑙
𝑖=0
 (4-20) 
𝑅3(𝑡) = ∏(3𝑒
(−2𝑊𝑆𝑖𝜆𝑐𝑡) − 2𝑒(−3𝑊𝑆𝑖𝜆𝑐𝑡))
𝑋3𝑁𝑖
𝑖=𝑙
𝑖=0
 (4-21) 
𝑅𝑟(𝑡) = ∏ (3𝑒
(−2𝑊𝑆𝑖𝜆𝑟𝑡) − 2𝑒(−3𝑊𝑆𝑖𝜆𝑟𝑡))
𝑋𝑟𝑁𝑖
𝑖=𝑙
𝑖=0
 (4-22) 
𝑅𝐶𝑃𝑈(𝑡) = ∏(3𝑒
(−2𝑊 𝑆𝑖𝜆𝑝𝑡) − 2𝑒(−3𝑊𝑆𝑖𝜆𝑝𝑡))
𝑖=𝑙
𝑖=0
𝑆𝑖 
𝑁𝑖
𝑆𝑃1𝑆𝑃2
 (4-23) 
 
The reliability of the whole protected system combined is: 
𝑅𝑖𝑛𝑠𝑡(𝑡) = 𝑅𝑟(𝑡)𝑅3(𝑡)𝑅2(𝑡)𝑅1(𝑡)𝑅𝐶𝑃𝑈(𝑡) (4-24) 
 
4.3.3 Reliability Equations Obtained From the Injection Experiments 
In order to compare the theoretical predictions of the reliability of the protected code in Section 4.3.2, we 
modelled the reliability expression of the whole processing architecture chain after it is protected with ITMR 
for the fault injection experiment. The difference between reliability of the protected and the unprotected codes 
is the sensitivity of instructions to the injection errors, that will drop after the protection is added, leading to 
the following equations representing the reliability of every processing component (different levels of caches, 
RAM and the CPU). 
  
𝑅1𝑖𝑛𝑗(𝑡) = e
(−𝑊H1𝜆𝑡 ∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0 ) 
(4-25) 
𝑅2𝑖𝑛𝑗(𝑡) = e
(−𝑊X2𝜆𝑡 ∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0 ) 
(4-26) 
𝑅3𝑖𝑛𝑗(𝑡) = e
(−𝑊X3𝜆𝑡 ∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0 ) (4-27) 
𝑅𝑟𝑖𝑛𝑗(𝑡) = e
(−𝑊X𝑟 𝜆𝑡 ∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0 ) (4-28) 
 𝑅𝐶𝑃𝑈𝑖𝑛𝑗(t) = 𝑒
(−𝑊
∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0
𝑆𝑃1𝑆𝑃2
𝜆𝑡)
 (4-29) 
The reliability of the whole system is given by: 
Rinj(t) = R1inj(t)R2inj(t)R3inj(t)Rrinj(t)RCPUinj(t) (4-30) 
4.3.3.1 Application of the Reliability Predictions  
The objective from this section is to identify the precision of our new prediction model, by comparing 
Equation (4-30) to Equation (4-24). We combine the results obtained from the theoretical model to the results 
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
74 
 
obtained from the experimental injection later in Section 5.3, where the injection experiment is necessary to 
obtain the different instruction sensitivities used in the prediction models.  
When the injection experiments of all instruction types was performed in our previous work [12], an 
observation has been made on the generic categories of instruction types, causing significant changes in the 
error rates. These categories are the CPU (Arithmetic and logic operations) and Memory (Creating new 
memory addresses, loading and storing to/from memory) instructions. We will apply the reliability equations 
shown previously on the CPU and Memory instructions categories. 
This means that we have two types of sensitivities, Scpu representing the sensitivity of the CPU instructions 
and Sm representing the sensitivity of the memory instructions. 
The resulting equations for the reliability prediction when no protection is applied is obtained using 
Equation number (4-18): 
 
 𝑒
−𝑡𝑊(𝜆𝑟𝐻𝑟+( 
 𝜆𝑝
𝑆𝑃1𝑆𝑃2
)+𝜆𝑐(𝐻1+𝑋2+𝑋3))(𝑆𝑐𝑝𝑢𝑁𝑐𝑝𝑢+𝑆𝑚𝑁𝑚)
  (4-31) 
The equation of the reliability prediction after adding protection is obtained from Section 4.3.2, we obtained 
the following expression using equations (4-19), (4-20), (4-21), (4-22) and (4-23): 
𝑅1(𝑡) = (3𝑒
(−2𝑊𝑆𝑐𝑝𝑢𝜆𝑐𝑡) − 2𝑒(−3𝑊𝑆𝑐𝑝𝑢𝜆𝑐.𝑡))
𝐻1.𝑁𝑐𝑝𝑢
(3𝑒(−2𝑊𝑆𝑚𝜆𝑐𝑡) − 2𝑒(−3𝑊𝑆𝑚𝜆𝑐𝑡))
𝐻1𝑁𝑚
 
(4-32) 
𝑅2(𝑡) = (3𝑒
(−2𝑊𝑆𝑐𝑝𝑢𝜆𝑐𝑡) − 2𝑒(−3𝑊𝑆𝑐𝑝𝑢𝜆𝑐𝑡))
𝑋2𝑁𝑐𝑝𝑢
(3𝑒(−2𝑊𝑆𝑚𝜆𝑐𝑡) − 2𝑒(−3𝑊𝑆𝑚𝜆𝑐𝑡))
𝑋2𝑁𝑚
 
(4-33) 
𝑅3(𝑡) = (3𝑒
(−2𝑊𝑆𝑐𝑝𝑢𝜆𝑐𝑡) − 2𝑒(−3𝑊𝑆𝑐𝑝𝑢𝜆𝑐𝑡))
𝑋3𝑁𝑐𝑝𝑢
(3𝑒(−2𝑊𝑆𝑚𝜆𝑐𝑡) − 2𝑒(−3𝑊𝑆𝑚𝜆𝑐𝑡))
𝑋3𝑁𝑚
 
(4-34) 
𝑅𝑟(𝑡) = (3𝑒
(−2𝑊𝑆𝑐𝑝𝑢𝜆𝑟𝑡) − 2𝑒(−3𝑊𝑆𝑐𝑝𝑢𝜆𝑟𝑡))
𝑋𝑟𝑁𝑐𝑝𝑢
(3𝑒(−2𝑊𝑆𝑚𝜆𝑟𝑡) − 2𝑒(−3𝑊𝑆𝑚𝜆𝑟𝑡))
𝑋𝑟𝑁𝑚
 
(4-35) 
𝑅𝐶𝑃𝑈(𝑡) = (3𝑒
(−2𝑊𝑆𝑐𝑝𝑢𝜆𝑝𝑡) − 2𝑒(−3𝑊𝑆𝑐𝑝𝑢𝜆𝑝𝑡))
𝑁𝐶𝑃𝑈
𝑆𝑃1𝑆𝑃2 (3𝑒(−2𝑊𝑆𝑚𝜆𝑝𝑡)
− 2𝑒(−3𝑊𝑆𝑚𝜆𝑝𝑡))
𝑁𝑚
𝑆𝑃1𝑆𝑃2  
(4-36) 
The reliability of the whole system is given by equation number (4-24). When taking the CPU and memory 
instruction types the reliability of the protected CPU from the injection experiment is obtained using equations 
(4-30). The reliability of the whole architecture is:  
𝑅𝑖𝑛𝑗(𝑡) = 𝑒
−𝑡𝑊((λ𝑟Hr+(
1
SP1SP2
λp)+λc(H1+X2+X3) )(𝑆𝑐𝑝𝑢𝑝𝑁𝑐𝑝𝑢𝑝+𝑆𝑚𝑝𝑁𝑚𝑝))
 (4-37) 
In this section we have shown the reliability models of the protected and unprotected code, using the 
software protection techniques that have been developed. By combining the reliabilities of the components,, 
and including multiple parameters in the model, aiming to have reliability predictions that can be valid for the 
COTS processing architectures.  
At the start the reliability of the unprotected processor was shown Equation (4-18), followed by the 
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
75 
 
reliability of the protected system Equation (4-24) and at last the Equation (4-30) for the reliability obtained 
from the injection experiment of the protected code. 
In order to check the precision of our prediction model, the reliability obtained from Equation (4-24) and 
(4-30) will be compared, this will show how close the reliability obtained from the prediction model to the 
reliability obtained from the injection experiment. In this model, we consider Equation (4-30) as the ground 
truth. Another comparison between Equations (4-24) and (4-30) will show the impact of adding reliability 
using the software protection techniques. 
At this point the prediction models have been obtained, in order to perform the different comparisons, we 
will need to obtain the different sensitives to the different instruction types, and ultimately the error rate of 
each instruction type. In order to do obtain the sensitivity of the instructions, injection experiments must be 
performed on both the protected and the unprotected codes. 
4.3.4 Reliability of RAM, Caches and CPU when Protected with Threads TMR  
When protecting the code using TTMR, every function will run on three redundant threads, and at the end 
of their execution, their results will be compared for error detection and recovery. The injection of the protected 
code is done once for every run, where every instruction of the benchmarks will be injected, one at a time. The 
injections will be done on all of three threads. Giving different names for the functions that the threads are 
executing will enable injecting every thread separately, because, this allows our fault injector to identify every 
thread and inject it separately from the others. The threads are running on a multicore, meaning every core 
will be injected every single run. 
In this case, we consider that the TMR protects three cores with their caches in addition to part of the RAM 
and the last level of cache, as shown Figure 4-5, because of the multithreading nature of this protection 
technique, where each thread runs on a different CPU. The reliability predictions of cache levels, RAM and 
CPU when the TMR multithreading protection is applied are given by Equation (4-40).  
 
Figure 4-5 Typical Multicore Processing Architecture 
The expression of the reliability of the CPU, the caches and RAM combined without threads TMR is given 
by Equation (4-18). The reliability of the whole system using TTMR is given by Rth(t): 
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
76 
 
Rth(t) = 3𝒆
(−2𝑾𝒕(∑ 𝑆𝑖𝑁𝑖) ( 
 𝜆𝑝
𝑆𝑃1𝑆𝑃2
 +𝜆𝑐𝐻1+𝜆𝑐𝑋2+𝜆𝑐𝑋3+𝜆𝑟𝑋𝑟)
𝒊=𝒍
𝒊=𝟎 ) − 2𝒆
(−3𝑾𝒕(∑ 𝑆𝑖𝑁𝑖) ( 
 𝜆𝑝
𝑆𝑃1𝑆𝑃2
 +𝜆𝑐𝐻1+𝜆𝑐𝑋2+𝜆𝑐𝑋3+𝜆𝑟𝑋𝑟)
𝒊=𝒍
𝒊=𝟎 ) (4-40) 
 
4.3.5 Reliability when the Combination of TTMR & TTMR is used 
The combination the protection techniques, the TTMR and the ITMR will improve the reliability of the 
Benchmarks. TTMR will create redundant independent threads and run functions in them concurrently, at the 
end their execution the threads results will be compared using a voter to decide the correct outcome. ITMR 
will replicate instructions three times, and calls a voter function to check them, and decide the right outcome 
out of the three. The Combination of the two protection techniques is done by first applying the TTMR and 
consecutively applying ITMR. 
The reliability predictions of cache levels, RAM and CPU when the ITMR protection is applied are given 
by the Equations (4-19), (4-20), (4-21), (4-22) and (4-23). The reliability of the whole protected system using 
ITMR is given by Rinst(t) in Equation (4-24). 
The reliability when the combined mode of protection is used:  
Rcom(t) = 3Rinst(t)2 – 2Rinst(t)3 (4-41) 
4.3.6 Reliability Model Limitations 
Amongst the model limitations is that it does not consider how much an object will be used, based on code 
paths. Furthermore, the prediction model is estimating the worst-case scenario, and does not consider the case 
where an error has occurred before writing to memory or loading to CPU registers. 
4.4 Summary 
New reliability prediction models have been shown in this chapter. The new models are an effort to model 
the reliability of the whole processing architecture, without software protection applied. This study also 
includes modelling the processing architecture when our novel software protection techniques have been 
applied. The reliability models are depending on several parameters including internal ones (specific to the 
processing architecture that is being used) and external parameters (specific to the environment where the 
processor is operated). The internal parameters include;  
• The width of the word, or the size of a single word in memory, depending on the architecture that 
has been used (64bits, 32bits, 16bits…etc), the reliability is inversely proportional to this 
parameter, 
• The total number of instructions, this parameter can be obtained using LLVM’s built-in pass (-
instcount [144]). The more instructions the code has, the lesser its reliability, 
• In addition to the total numbers of instruction, our model provides a step change in the level of 
detail, where the number and type of instruction is now included in the reliability modelling,  
Yasser Nezzari                                                                                                   Chapter 4, Reliability Predictions 
77 
 
• Hit and Miss Rates, these two parameters can be computed for different processing architecture 
components, such as the caches and the RAM, with the help of Linux tools such as perf [30]. 
Reliability is improved with high hit rate and low miss rate, and vice versa, 
• Clock frequency affects reliability positively. I.e. the higher the frequency, the faster the CPU will 
be in executing its instructions. The same thing can be said about the instructions per second and 
the number of pipeline stages. 
The external parameters are related to the external CPU environment includes the error rates of the 
processor elements, such as the CPU, the caches and the RAM. Error rates influence reliability negatively. 
These rates change with respect to the location of the spacecraft in orbit, where the areas near the SAA or near 
the poles can experience higher error rates than normal. These rates can also be easily influenced by solar 
activity. The prediction model is estimating the worst-case scenario, and does not consider the case where an 
error has occurred before writing to memory or loading to CPU registers. 
In order to validate the reliability model, error injection experiments need to be conducted. This can be 
achieved using software fault injection. For this purpose, new LLVM passes are required to analyse and 
validate our new model. The injection experiment must provide the necessary error rates for the different 
instruction types of the benchmarks under test. The injection is performed at the instruction’s data, which 
results in corrupting the instruction’s results. 
Other parameters that demonstrate a significant impact on the reliability are the architecture configuration, 
especially the CPU and how many levels of cache are incorporated within it. The configuration of the memory 
and the CPUs have been taken into account in the reliability modelling. 
The OS libraries such as the multithreading could add several instructions at the intermediate representation 
of the code which can also be verified at the assembly level, making the processing architectures more 
vulnerable to SEEs, thus affecting reliability negatively. This estimation has been confirmed in the following 
reference [136], however, the assembly and intermediate representation’s extra instruction have not been 
examined. This research provides a strong case, arguing the cause of vulnerability of using OS is due to the 
use of extra instructions added by the OS libraries when threads are created and joined or terminated. This is 
considered out of scope for this research.    
The next aim of taking all the key modelled parameters into account is to validate and understand the 
accuracy of our new predictions. 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
78 
 
5. AUTOMATIC ERROR DETECTION AND RECOVERY IN 
LLVM 
Multiple FECs to protect both the CPU and the memory system (RAM and caches) of the processor have 
been implemented using LLVM compiler, including the Hamming, BCH and TMR. In this chapter, we propose 
“Automatic Compiler Error-Detection and Recovery” (ACEDR) an original software error detection and 
recovery technique, for automatically applying protection code using the LLVM compiler framework, at 
compile time. The implemented error correction code is able of automatically detecting and correcting errors 
at runtime, using two compiler passes, an analysis and a transformation one. The analysis pass will go through 
all the layers of the code to be protected in its intermediate representation and provides information and 
statistics, such as the number and the types of the code instructions, memory instructions dependencies, and 
many other information. The transformation pass will add the protection code where necessary, depending on 
the information provided by the analysis pass. The transformation pass will add the redundant and the check 
instructions, using majority or TMR algorithm. This research shows the importance of protecting both the 
memory and the CPU registers, unlike most of the literature focusing only on protecting the CPU, without 
having any recovery scheme, our work is capable of protecting both CPU registers and memory, where it does 
both, error detection and recovery at runtime. At the start, only partial protection will be added to the code, 
where only CPU then only memory instructions are protected. At the end the results of combining both 
instruction types protection will be shown. This work is an extension to our initial results [10]. This work has 
some limitations; it only protects memory and CPU registers accessed by instructions (not the instruction 
itself), meaning that this work does not consider bit flips that would transform instructions into other 
instructions or jump/conditional jump instructions. 
The error-detection and recovery of ACEDR will be evaluated using fault injection experiments, where all 
instruction types have been injected, resulting in different error types. This work is capable of protecting from 
SEUs, but can also protect against multiple bit upsets (MBUs). The injection is performed at the instruction’s 
data, which results in corrupting the instruction’s results. 
This chapter uses the reliability prediction model of Chapter 4. This model is capable of estimating the 
reliability of the whole processing architecture. The model is based on Markov chain, where the reliability of 
the different processing elements such as the CPU, the caches and the RAM have been included, and by 
combining the reliabilities of the components, the whole processor’s reliability have been modelled. The model 
predicts the reliability of the processor when it is unprotected and when it is protected using our software FEC. 
The mode takes into account multiple parameters, which can be internal and specific to the processor used 
such as the number of cores, the pipeline stages and the clock frequency. Other parameters are external, and 
depend on the environment such as the SEUs error rate. 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
79 
 
5.1 Experimental Setup 
In these experiments, analysis and optimisation passes have been implemented in order to protect the 
memory of the processor architecture. The analysis pass checks the code and finds the memory operations 
(allocation, read/write). Optimisation pass adds protection code to the IR representation of the code, the pass 
is able to detect the type of memory instruction and summons the appropriate technique to protect it 
accordingly. 
 Memory read and write operations could be protected using NMR or FEC, in the first experiment TMR 
and FEC have been used. After the high-level code intended to protect is transformed to intermediate 
representation (IR), the analysis and transformation passes are run on it following the algorithm shown in 
Figure 5-1: 
• Detect the instruction in the IR and Check its type. 
• If the instruction is a memory Write-Read (W/R) instruction, either TMR or FEC are used, by 
encoding (when writing to the protected allocation) and decoding (when reading from the protected 
allocation), using different FEC techniques. The use of the FEC depends on the error rate, in case 
of one-bit flip then one error correction code could be used such as Humming code, if two or more 
errors BCH code is used. 
• For non-memory instructions protection, it is possible to use an appropriate technique depending 
on the reliability and cost (required CPU and memory instructions).   
The overhead is expected to be minimal in this experiment compared to the state of the art implemented 
techniques using the LLVM compiler, when implementing TMR or single error correction Hamming code, but 
the BCH code is expected to generate a considerable overhead, this will be reduced either by automatic 
parallelisation of the BCH encode/decode functions using compiler optimisations techniques like 
parallelisation of loops, and improving data locality. Another solution is by grouping the data being checked, 
as for now every 32 bits is encoded/decoded, BCH is able to encode/decode longer data, and detect and correct 
multiple errors.  
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
80 
 
Memory 
protection using 
TMR or EDAC
CPU’s 
Start 
Detect 
Instructions 
Memory 
instrcutions
Write 
instruction
Encode 
depending on 
error rate 
Read 
instructions 
Decode 
depending on 
error rate 
Assign priority 
to instructions 
Check 
priority level 
Use Compiler 
Checkpoint 
rollback or 
compiler TMR 
Depending on 
the priority 
level 
Yes
No
Yes
No
Yes
 
Figure 5-1 Overview of New Compiler Pass 
5.1.1 Iterating Through LLVM-IR Code Layers  
This section begins the implementation of the key features of our new passes using the LLVM compiler. In 
particular, starting by protecting the memory system using TMR, and different FEC codes, and then combining 
different techniques for more reliability.  
Before detailing the addition of instructions, LLVM IR’s different layers and its hierarchical organization 
is shown. Figure 5-2 shows the different layers of a code in the LLVM IR. A code in LLVM IR is divided to 
the following layers [145]: 
• A module is a top level LLVM class, every other layer is included in it. It represents the highest level 
structure. The module could be either the translation unit of the original program or a collection of 
several translation unites. 
• Modules include Functions, a class representing a single procedure containing chunks of executable 
code. 
• BasicBlocks are incorporated by functions, they represent single entry single exit section of the code. 
The BasicBlock incorporates a list of instructions, the last instruction in a list is a terminator 
instruction. 
• An Instruction, is a single code statement. Each instruction has an opcode and a parent (BasicBlock). 
Instructions operand could be accessed by the User class, a class that includes the instruction class.   
Iterating through the different layers is possible inside the pass, Figure 5-2 shows the code for different 
types of iterations, it could be achieved using the following iterations: 
• An iterator over a Module returns all the Functions of this module.  
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
81 
 
• An iterator over a Function returns all the BasicBlocks of this function.  
• An iterator over a BasicBlock returns all the instructions of this BasicBlock. 
virtual bool runOnModule (Module &M){ 
...
...
...
}
for (Module::iterator F = M.begin(), E = M.end(); 
F != E; ++F); {
…
…
…
}
for(Function::iterator bb = F.begin(), e = F.end(); 
bb != e; ++bb) {
…
…
…
}
for(BasicBlock::iterator i = bb->begin(), e = bb-
>end(); i != e; ++i)  {
…
…
…
}
 
Figure 5-2 Iterating Through Different LLVM IR Code Layers 
     Another tool that has been used to detect the type of instruction, is the dyn_cast<>, it could be used to find 
the type of instruction inside a BasicBlock or inside a function. It verifies if the operand is of a certain 
instruction type, if yes it outputs a pointer to it, if not, a null pointer is returned [146]. In the following 
instruction using auto* op = dyn_cast<BinaryOperator>(&I) in order to find binary operations like addition 
and point to them by the pointer op. This is shown in Figure 5-3. 
for (auto& I : B) {
    if (auto* op = 
dyn_cast<BinaryOperator>(&I)) {
…
}
}
 
Figure 5-3 Detecting a Specific Instruction Type in the Intermediate Representation 
 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
82 
 
5.1.2 Experiment 1: Software TMR Using Instructions   
In order to implement instructions TMR, two replicas of the original instructions are needed. Replicating 
instructions could be achieved either by cloning the original ones or by building new instructions. Here, the 
aim is to implement the algorithm shown in Figure 5-4, where the numbers 1, 2, and 3 show the instructions 
that will be TMR-ed.  
Return 1 or 2 or 3 
Return 1
Return 2
Return 3
Start TMR
End TMR
no
yes
yes no
yes no
yes
no
Cmp(1,2)AndCmp(1,3)AndCmp(2,3)
Cmp(1,2)AndCmp(1,3)
Cmp(1,2)AndCmp(2,3)
Cmp(1,3)AndCmp(2,3)
 
Figure 5-4 TMR Algorithm 
Instructions cloning is achieved using clone() [147] that returns a replicate of the instruction, similar to the 
original except that the clone instruction has no BasicBlock parent (and not inserted into a BasicBlock), and it 
has no class name. In order to successfully clone an instruction, it should be assigned a name and a parent 
BasicBlock, meaning it should be included in the control flow graph (CFG). Ignoring this part is the reason 
why the new instructions are deleted by the optimiser, even when compiling with –O0 optimization level. 
New instructions are built using IRBuilder [148] which provides a uniform API [146] for creating 
instructions and integrating them into a BasicBlock: either at the end of a BasicBlock, or at a specific location. 
Figure 5-5 shows an example of the IR before and after cloning the addition of instructions. 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
83 
 
%add = add nsw i32 %0, %1
 %add1 = add nsw i32 %0, %1
  %add2 = add nsw i32 %0, %1
  %add3 = add nsw i32 %0, %1
Our Pass
 
Figure 5-5 Replicating a Binary Instruction “add” 
The instruction TMR is implemented following the steps:  
1. Create a pass that goes through all the functions in the module, then, through all the BasicBlocks of 
the function and finally through all the instructions of a BasicBlock, until  the alloca instruction is 
detected, this instruction allocates memory depending on the size of the variable wanted to store in it, 
Figure 5-6. The alloca instruction is replicated three times, in the 1st two allocations the redundant 
values are stored, in the 3rd alloca or allocaTMR the protected value is stored after checking it using 
the TMR function. 
alloca
Alloca3 (TMR)
alloca2
alloca1
 
Figure 5-6 Replicating the Allocation of Memory 
Instruction 
Origin alloca1
alloca3
allocaTMR
alloca2
Origin 
store1
 
Figure 5-7 Storing Copies of the Original Data in 
the New Created Allocations 
 
2. The 2nd step is to iterate again in the BasicBlock and detect the store instructions, used to store a value 
in a allocation, then if the store’s address is the same as the original alloca instruction from step 1, 
new two stores in the replicated addresses allocation are created. See Figure 5-7.  
3. Iterate through all the load instructions of all the BasicBlock, load instruction is used to load the 
content of a certain location in memory. The address is checked if it matches the original alloca 
instruction of step 1, the load’s address is replaced with allocaTMR, the protected address. 
4. The allocaTMR will be protected using TMR function Figure 5-4 and Figure 5-8, the TMR function 
is called every time it is required to compare the three store instructions, the original and the two 
redundant ones. 
The control flow graph in Figure 5-8 shows that the TMR is added at the instruction level of the code. It 
was obtained using an LLVM pass “-view-cfg”, the instructions added by this function are detailed in Table 
5-1. This graph verifies the TMR algorithm in Figure 5-4.  
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
84 
 
 
 
Figure 5-8 The CFG of the Implemented TMR 
The TMR function uses the compare instruction to make decisions, and then it jumps to execute the right 
block accordingly. The jumps are a conditional or unconditional branch instructions, both cmp and branch are 
created using the IRBuilder.  LLVM does not allow adding branch instruction because of an optimization pass 
that is why it had to be deactivated, the new LLVM is build and installed excluding it. In the decision making 
the outcome of the compare instructions are combined using and instructions. 
TMR Requirements:  
In our implementation every alloca instruction in the LLVM IR code adds three new alloca, and every 
store instruction adds 3 new stores and a call to the TMR function. 
 
Table 5-1 TMR Additional Instructions 
Original instruction Additional instruction  
alloca 3 alloca 
store 3 stores + 1 call   
 
Table 5-1 shows the instructions added by a TMR function call, these results match Figure 5-8. In the 
worst case all the TMR instructions will be executed at run time, however the less errors the less instructions 
and blocks will be executed. 
Figure 5-9 shows the overall implementation of the TMR algorithm, starting by iterating through all the 
functions of module, then through all the BasicBlocks and then all the instructions. alloca instructions are 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
85 
 
determined, and replicated. store instructions are checked if they store to the allocations previously determined, 
then, replicates of stores are stored in the replicates of the allocations. Each time a new load from the protected 
allocations is found it is replaced with a new load from the TMR function. 
5.1.3 Experiment 2: In-line Software FEC  
This section includes two methods for automatically implementing any FEC code in order to protect the 
memory of a high-level programming language at runtime. For the proof of concept, protection of integers and 
floating point using some selected FEC algorithms have been implemented; the Hamming code and BCH 
codes. 
There are two methods to implement FEC code:  
• By 1st turning the FEC code to LLVM-IR and then link it with the benchmark that is intended to be 
protected, the benchmark has also to be in the intermediate representation of LLVM, and the pass add 
some extra instructions, creating new allocations for the FEC check bits and function calls to the FEC 
encoding and decoding functions Figure 5-10. 
• The FEC could be implemented directly in the pass, without using the linker, meaning the FEC 
encoding and decoding function are automatically generated using our compiler pass Figure 5-11. 
Table 5-2 TMR Function Instructions 
TMR Function instructions 
4 alloca 
6 stores 
7 loads 
3 cmp 
3 and 
3 conditional branch 
3 unconditional branch 
1 return instruction 
 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
86 
 
Hea
ding
d
i
Module
Function
Basic block
Inst 1
Inst 2
.
.
.
Alloca
.
.
.
Alloca
.
.
.
R
ep
licate 
R
ep
licate 
Iterate And find all store 
and load instructions 
If store, store 
same value in 
the replicate 
alloca then call 
TMR
If loadm 
replace the 
address of the 
load with the 
protected one 
Cmp(1,2)AndCmp(1,3)AndCmp(2,
3)
Cmp(1,2)AndCmp(1,3)Return 1 or 2 or 3 
Return 1 Cmp(1,2)AndCmp(2,3)
Cmp(1,3)AndCmp(2,3)Return 2
Return 3
Start 
TMR
Return the 
correct 
loading 
address End 
TMR
noyes
yes
no
yes no
yes
no
 
Figure 5-9 TMR Implementation  
 
Figure 5-10 Implementation of the FEC Using the Linker 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
87 
 
Benchmark 
in c
Run the pass on the 
benchmark and turn 
intermediate 
representation with 
ECC 
Benchmark 
with ECC in 
intermediate 
representati
on  
Clang
Assembly or 
executable 
code of the 
benchmark 
with ECC 
 
Figure 5-11 Implementation of the FEC Without the Linker 
The following steps explain how the pass is running, Figure 5-11. 
1. The implemented pass will run through the intermediate representation of the code and detect all the 
memory allocations.     
2. All the memory allocations will be defined as the protected area of the code, new allocations to store 
the check bits of every allocation in code are added. 
3. If a store is detected to the allocation that will be protected, it will be encoded, and then the check bits 
are stored inside the new allocations that were created in step 2.  
4. Each time a new store is detected to the allocations in step 1, new check bits are created to replace the 
check bits of step 3. 
5. In the decoding part, the decoder function could be called every time a read from the protected 
allocation is found, and correct data in case of error occurrence, in case of error detection we correct 
it and replace the damaged data and check bits. 
Figure 5-12 shows the overall implementation using the 2nd method of FEC algorithms, where the pass 
generates all the code of FEC. The pass iterates through all the layers of code and determines all the allocations 
of the code, and creates check bits for every allocation encountered. All the stores to the allocations determined 
previously are encoded and check bits are stored in the check bits-new-created allocations. With every load 
from the protected allocation we decode and correct when necessary, and replace the data and check bits with 
the corrected values. 
5.1.4 Discussion & Evaluation 
After the implementation, the work has been evaluated in terms of the delay, and the size of the additional 
code, different tools for our evaluation have been used. 
1. perf to measure delays and number of cycles [149]. 
2. valgrind [150], this tools has many profiling options including calgrind [151] for CPU, massif  [152] 
for stack and heap memory profiling. 
3. A profiling tool has been implemented, an LLVM pass to count the number of instructions in our code.  
Table 5-3 shows the CPU profiling results for Fibonacci series before and after implementing the following 
techniques: TMR, Hamming, TMR + Hamming and BCH code. 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
88 
 
 
Figure 5-12 ECC Implementation 
Table 5-3 Profiling Results of Fibonacci Series Benchmark 
Code Time of execution (s) Number of 
cycles 
Number of 
instructions 
Overhead 
(%) 
Fibonacci 0.001286844 1,856,346 48 0 
+ TMR 0.001419664 1,898,231 157 10.32 
+ Ham 0.001359356 2,052,954 316 5.63 
+ TMR + HAM 0.001424670 2,218,110 500 10.71 
+ BCH 0.008445104 15,638,019 1961 556.26 
 
The primary results show a very encouraging outcome, where code for memory protection has been 
automatically generated, this method supports LLVM high level languages and its CPU architectures.  When 
using TMR, Hamming code, and TMR combined with Hamming, the memory of code is protected with a 
minimal overhead, and this was expected, since both techniques do not have very complicated computations 
when detecting and correcting errors. 
On the other hand, BCH has shown enormous overhead this is due to the intense computations of the 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
89 
 
encoding/decoding functions, this could be improved by taking larger the BCH blocks in the 
encoding/decoding functions, instead of only using 32 bits BCH blocks, to have less encoding/decoding 
function calls. Some compiler optimisations could be used to make the code run faster by automatically 
parallelising function calls and reducing data dependency, especially in loops for better cache memory use. 
5.2 Automatic Compiler Error-Detection & Recovery (ACEDR) 
The LLVM compiler is chosen as the baseline source of this research, where we created passes in order to 
automatically add protection code to the intermediate representation of any code of choice. The users of our 
software protection method do not have to write a single line of protection code, all they have to do is to 
compile their unprotected code via our passes, and the code will be protected. The passes include an analysis 
and transformation one. The analysis pass will go through the code line by line in order to determine all types 
of instructions, and provides statistical information about them. The transformation pass will use the 
information provided by the analysis to call the appropriate protection technique [10]. In our previous work 
[10] we studied the overhead of applying different protection techniques. We started by studying the 
implementation of the Hamming code for our automatic compiler error detection and recovery, with the ability 
to detect two bits errors and recover one error. We also implemented the BCH code, with ability to detect and 
recover multiple bits. We have eliminated the use of both (Hamming and BCH) in our automatic compiler 
error detection and recovery, because of their large time and memory overheads. In addition to that, these 
codes are most suitable to protect memory instructions, where there is access to memory with read and write 
operations. For CPU instructions like arithmetic and logic operations, it is more suitable to use redundancy 
methods such as NMR to detect and/or recover errors.   
Automatic compiler implemented TMR [10] has high potential, because of its low overhead and ability to 
detect and recover any single bit error, and in some cases it was capable of detecting and recovering more than 
one bit error, provided that the multiple errors happen in the same word, or two different words in two different 
TMR-ed words. If two or more errors occur in two different words of the same TMR-ed words, it would be 
possible to detect the errors, without recovery, since the TMR does not know which one the right word is. 
5.2.1 Adding ACEDR Instructions in IR 
The creation of redundant instructions is achieved using LLVM Compiler passes, allowing automatic 
addition of protection code to the code intended to be protected in its IR format. Our previous work [10] has 
been extended to include multiple data types (i32, i32*, i1, i8, i8*, i64, float & double, float & double pointers). 
This extension will allow our work to have high coverage compared to the state of the art. 
ACEDR-TMR will add two redundant instructions to the original one, and then calls a voter function in 
order to decide the correct outcome, amongst the three instructions, the replicated and the original. In this work 
we found that both memory instructions (alloca, load, store and GetElementpointer) and CPU instructions 
(Arithmetic and logic Operators…etc) require protection, because unlike previous works, our work does not 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
90 
 
assume that the memory system is protected with any type of hard-ECC protection techniques, allowing us to 
extend our implementation to more processing architectures. 
Traditional implementation of TMR at hardware level uses 3 ands and 2 ors (logic gates) to correct every 
single bit. Our implementation will be capable of taking n number of bits long words at once, instead of using 
ands and ors instructions for every single bit. There are potentially some vulnerabilities when using software 
FEC, specifically the vulnerability of the protection code itself. According to [141] this could be minimized 
by protecting the protection code, using their Flowchart of the self-check routine. 
5.2.1.1 Memory Instructions Protection 
Memory and CPU instruction types were separated from each other, enabling the protection of both types. 
The memory instructions are treated differently because there is a dependency when writing and reading from 
a memory location. Our software protection techniques detect this dependency and adds the appropriate 
protection code accordingly. The implementation was done by changing the intermediate representation of the 
code intended to protect, instead of having a single memory location, two redundant locations were added. In 
the following code %i is the original alloca instruction and %pwtc21 and %pwtcx32 are the redundant ones 
we created Figure 5-13. 
 
%i = alloca i32, align 4 // Original Code 
 
%pwtc21 = alloca i32 // ACEDR Code 
%pwtcx32 = alloca i32 
%i = alloca i32, align 4 
Figure 5-13 Protecting the “alloca” Memory Instruction 
Redundant writes to the memory locations created previously will be added, store i32 10, i32* %i is the 
original store, store i32 10, i32* %pwtc21 and store i32 10, i32* %pwtcx32 are the newly created stores Figure 
5-14. 
store i32 10, i32* %i, align 4 // Original 
Code 
 
store i32 10, i32* %pwtc21 // ACEDR Code 
store i32 10, i32* %pwtcx32  
store i32 10, i32* %i, align 4 
Figure 5-14 Protecting the “store” Memory Instruction 
Every time a load or (read) instruction is detected (the original read is %2 = load i32* %i, align 4 ) from a 
memory location, redundant reads are added (the redundant reads are %0 = load i32* %pwtc21 and %1 = load 
i32* %pwtcx32 ) and the outcome is compared using a voter %func = call i32 @vote(i32 %2, i32 %1, i32 
%0), resulting that the correct memory location only will be the one with the final read Figure 5-15.  
%2 = load i32* %i, align 4 // Original 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
91 
 
Code 
 
%0 = load i32* %pwtc21 // ACEDR Code 
%1 = load i32* %pwtcx32 
%2 = load i32* %i, align 4 
%func = call i32 @vote(i32 %2, i32 %1, 
i32 %0) 
Figure 5-15 Protecting the “load” Memory Instruction 
5.2.1.2 CPU Instructions Protection 
The memory instructions have already been protected, meaning they could be used by the CPU safely. 
However, the CPU registers are not immune from the SEEs yet, meaning they require protection. Using 
redundancy each CPU instruction is replicated, and the outcome will be obtained after calling the voting 
function, as shown in the next code Figure 5-16; 
 
%add1 = add nsw i32 %11, 1 // Original 
Code 
 
 
%add2 = add nsw i32 %11, 1 // ACEDR 
Code 
 
%add3 = add nsw i32 %11, 1  
%add1 = add nsw i32 %11, 1  
call i32 @vote(i32 % add1, i32 % add2, 
i32 % add3) 
 
Figure 5-16 Protecting the “add” CPU Instruction 
5.3 Error Injection 
We provide an original LLVM fault injection tool to validate and measure our software protection methods 
– either statically at compile time or dynamically at runtime. Our injector can induce multiple error types such 
as silent data corruptions (SDCs), control flow errors, hangs and crashes. Control flow errors occur when 
branch instructions are corrupted, leading to change in the order in which individual statements are executed. 
Hangs occur when loops get corrupted, resulting in wrong iterations and sometimes infinite execution. 
We use our tool to inject faults into unprotected and protected codes, and make quantitative comparisons 
of the error and associated confidence on the presented interval. 
Figure 5-17 shows how the error injection works, on the left side, the protected code will be injected and 
traced, meaning the outcome of its instructions will be logged in files, to compare them with the Golden files. 
The Golden files contain the correct outcome of each instruction of the code after tracing it (no injection is 
done with the Golden files outputs). The tracing is done on the code in its intermediate representation that can 
be obtained using the LLVM compiler. A python script adds instructions to show the outcome of every 
instruction of the traced code. 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
92 
 
On the right side of Figure 5-17, the injection and tracing of the protected code is done. First the code will 
be protected using our developed software protection techniques, by running the protection pass on the code 
in its intermediate representation. In the second part, the code will be randomly injected in one of its 
instructions, causing it to fail with a certain failure rate, depending on the nature of the injected code. At last 
the code protected and injected will be traced in order to show the outcome of each of its instructions, then log 
them into files for comparison with the golden outputs obtained previously. The comparison will show the 
number and the types of errors obtained. The types of errors include: SDCs, control flow errors, crashes and 
hangs. The last comparison is between the error rate of the unprotected injected code, and the protected injected 
code. This step will determine how much coverage our protection has added. 
In order to inject the intermediate representation of code, instructions have been divided to their different 
types, and every type is injected differently, and this is achieved by calling different function types specific to 
every instruction type, including (i32, i32*, i1, i8, i8*, i64, float & double, float & double pointers). 
Limitations of our fault injector are that: 
• It cannot inject the void type, 
• Branches are void. In order to inject these, the decision instruction (ex “cmp”) must be injected, and 
• Return instructions are void. In order to inject these, the “load” instruction before must be injected. 
 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
93 
 
 
Figure 5-17 Flowchart of the Error Injection Process 
5.3.1 Injection Experiments of different Instruction types 
5.3.1.1 Injecting the CPU Registers 
The following code snippet shows how the injection of the CPU registers is performed 
 
                   %add1 = add nsw i32 %11, 1 // Original Code 
 %callfl = call i32 @flip(i32 %add1)// Injection Code 
Figure 5-18 Injection of CPU Instruction 
Once the instruction result has been injected using the @flip function responsible for flipping randomly 
one if its bits, the following function will be called inside the LLVM injection pass: 
replaceAllUsesWith(callfl). This function is responsible for replacing the uses of the previous register 
with the newly injected one. 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
94 
 
5.3.1.2 Injecting Memory Instructions 
The process of injecting the load instructions is similar to the injection of the CPU registers, the only 
difference is when injecting the store instruction, which is depicted in the following snippet: 
store i32 %add, i32* %x3, align 4// Original Code 
 %call5 = call i32 @flip(i32* %x3)// Injection Code 
                  store i32 %call5, i32* %x3 
Figure 5-19  Injection of Store Memory Instruction 
Instead of calling the method responsible for replacing the uses of the store instructions, it is sufficient to 
add a new store instruction, to store the injected value in the original memory address, where the correct value 
was supposed to be stored.   
The flip function will take a value or pointer to the value, then it will randomly flip one of its 32-bit (in 
case an int32 is flipped), then returns the flipped value, or a pointer to the flipped value (if a pointer is passed 
to it).  
In order to detect the different types of error occurred, a python script (Algorithm 1) will compare the two 
log files, of the golden outputs and the injected logs. SDC errors are detected if the output files have the same 
length, but different outcomes. Hangs and Control flow errors are detected when the injected log files are 
larger than the golden files. Crash errors are known, when the size of the output injected file is 0.  
5.3.2 Injecting Unprotected Code 
In this experiment, we predict high error rates, more than or equal to 50 % since no protection is added to 
the benchmarks. In order to evaluate our method’s error detection and recovery ability we have implemented 
it on 9 known benchmarks: Fib, Qsort, SolveCubic, Rad2Deg, Deg2Rad, UQsort, FFT (Fast Fourier 
Transformations), MM (Matrix Multiplication) and Suzan from MediaBench [29]. SolveCubic, Rad2Deg, 
Deg2Rad, UQsort were all combined to a single benchmark we called MATH benchmark. We have chosen 
this variety of benchmarks in order to have different instruction types and coverages, and to evaluate the 
different error rates resulting from injecting the different instruction types.  
All instructions for each benchmark were injected, divided by their type. At the start, we injected all the 
instructions of the unprotected code and checked how many injections resulted in errors. 
The unprotected code is highly vulnerable to the error injection, since it does not have any protection. All 
the benchmarks except for Susan have shown more than 50% error rates Figure 5-20. Hangs only occurs in 
the Susan benchmark, we haven’t recorded any hangs in the rest of the benchmarks. All of the benchmarks 
have suffered SDCs and control flow errors, same with crashes except for Fibo benchmark where only SDC 
and control flow errors occurred.    
In the coming Sections 5.3.2.1, 5.3.2.2, 5.3.2.3, 5.3.2.4, 5.3.2.5 and 5.3.2.6 we will break down the injection 
of unprotected code to more detailed results, where every instruction types injection results are shown, since 
the sensitivity of all instructions types are needed for the reliability predictions equations of Chapter 4. 
 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
95 
 
Algorithm 1. Error Types Classification 
1   Initialize(Crash_Error_Count = 0) 
2   Initialize(Control_Error_Count = 0) 
3   Initialize(SDC_Error_Counter = 0) 
4   Initialize(Hang_Error_Counter = 0) 
5   For (Injected_File in Injected_Files) 
6       If (Injected_File_Size == 0) 
7           ++Crash_Error_Count; 
8       Else If (Injected_File_Size > Max_File_Size)  
9          ++ Hang_Error_Counter; 
10     End 
11     For (Golden_File in Golden_Files) 
12         If (Golden_File_Size != Injected_File_Size) 
13             ++Control_Error_Count; 
14         Else If (Golden_File_Index == Injected_File_Index)                              
15             If (Golden_ File_Lines != Injected_ File_Lines) 
16                 ++SDC_Error_Counter; 
17             End 
18         End 
19     End 
20 End 
5.3.2.1 Injecting “load” Instructions 
The “load” instruction is used to read from memory. In the Fibo benchmark 71.43% total error rate has 
been recorded, the highest rate was the SDC with 50%, followed by 21.43% of control flow. No crashes have 
been recorded.  
In the QSRT benchmark we have recorded all types of errors, with a total rate of 47.4%, crashes are the 
major error causes with 24.07%, followed by the 17.6% SDC and 5.56 % Control flow. 
 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
96 
 
 
Figure 5-20 No Protection Applied 
The MATH benchmark has shown the highest error rate when injecting the “load” instruction, with 88.62% 
total error rate. The main cause of errors was the SDC with 51.74%, followed by 36.88 % control flow, 
however no crashes were recorded. 
In the FFT benchmark the “load” instruction had 67.52% total error rate. The main cause of errors was the 
control flow with 36.88 %, followed by 30.64 % SDC, without crashes. 
In the MM benchmarks the total error rate was 47.4 %, crashes were the highest error rate with 26.59 %, 
followed by the 17.6 % SDC and 15.36 % Control flow. 
5.3.2.2 Injecting “store” Instructions 
The “store” instruction is used to write to memory. Injecting the 1st benchmark (FIBO) has shown a total 
error rate of 75%. This includes 50% SDC, followed by 25% of control flow. No crash encountered. 
In the QSRT benchmark we recorded all types of errors, with a total rate of 66.66%. The highest error rate 
in this benchmark was the SDC with 38.1%, followed by 14.29% for both the control flow and the crashes. 
The MATH benchmark has shown the highest error rate in the store injection experiment, reaching up to 
0 20 40 60 80 100
Fibo
Qsort
Math
Suzan
FFT
MM
Correct % Hang % Crash % SDC % Control %
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
97 
 
90.32 %, without crashes, we recorded 53.39% control flow and 36.94% SDC respectively.   
Injecting the FFT benchmark has caused a total error rate of 61.29 %, divided to 42.10% control flow and 
19.19 SDC. 
Injecting the MM had a total error rate of 70.64%. The highest error rate in this benchmark was the SDC 
with 41.13%, followed by 16.24% and 13.28% for control flow and crash errors respectively. 
5.3.2.3 Injecting “GEP” Instructions 
The GEP or “getelementptr” instruction is used to get the address of a subelement of an aggregate data 
structure. It performs address calculation only and does not access memory.  
In the injection experiments of GEP in QSRT, MATH, FFT and MM benchmarks, we only observed SDC 
with 32.52 %, 52.38 %, 44.81 and 36.58 rates respectively. 
5.3.2.4 Injecting Binary Operators “BinOp” Instructions 
Binary operators are used to do most of the computations in a program. They require two operands of the 
same type, execute an operation on them, and produce a single value.  
The number of BinOp instructions in the Fibo benchmark are statistically insignificant so we decide to 
exclude injecting the BinOps of this benchmark. 
In the Qsort benchmark we had a 50 % total error rate, including all types of error, where crashes have the 
highest rate with 25%, SDC and control flow represent equally 12.5% of the total error rate. 
In the MATH benchmark, the error rate was the highest recorded in all the injection experiments, reaching 
92.73%, this is because the MATH benchmark contains a large number of BinOp instructions. SDC had the 
highest error rate, with 61.82%, followed by the control flow errors occupying 30.91%. We have not recorded 
any crashes in this injection experiment. 
In the FFT benchmark, the total error rate was 61.82%. SDC had the highest rate, with 38.18%, followed 
by the control flow errors occupying 23.64%. No crashes occurred. 
In the MM benchmark the total error rate was 62.5%, including all types of error, where control flow had 
the highest rate with 27.25%, crashes and SDC represent 25% and 10.25% of the total error rate respectively. 
5.3.2.5 Injecting “sext” Instructions 
The “sext” instruction takes a value to cast, and a type to cast it to. This type of instructions is included in 
the (Qsort, Math, FFT and MM) benchmarks  
The total error rate was 50% in the Qsort benchmark, divided between 39% of crashes and 11% of SDCs. 
In this injection experiment we have not recorded any control flow errors. 
Injecting the Math benchmark has shown total failure, meaning a total error rate of 100; every “sext” 
instruction that has been injected in this benchmark produced a faulty outcome. The highest fault type was 
crashes with 78.57%, followed by 18.57% of SDC and 2.86% control flow. 
Injecting the FFT benchmarks has resulted in 53.43% total error rate, divided to 27.14% and 26.29% SDC 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
98 
 
and Control flow errors respectively. 
For the last benchmark MM, the total error rate was 44.15%, with 33.15% crashes and 11% SDCs. 
5.3.2.6 Injecting “icmp” Instructions 
The “icmp” instruction returns a Boolean value or a vector of Boolean values based on comparison of its 
two integer, integer vector, pointer, or pointer vector operands. 
For both the 1st and 3rd benchmarks the number of “icmp” instructions was insignificant so we decided to 
exclude them from the injection experiment.  
The total error rate in the Qsort benchmark after injecting “icmp” instructions was 44.44%, this includes 
the SCDs and control flow with an equal error rate of 22.22%. 
For the MM, the total error rate was 53.43%, divided to 44.15% and 9.32% for crashes and SCDs 
respectively. 
The results of injection of the different instructions types are shown in Figure 5-21 to Figure 5-26. 
 
 
Figure 5-21 Injection of the Different Types of Instructions of the Unprotected Code (fib) 
 
 
 
 
 
 
 
 
 
 
2
1
.4
3
2
5
.0
0
5
0
5
0
.0
0
0
.0
0
0
.0
0
2
8
.5
7
2
5
.0
0
L O A D S T O R E
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
99 
 
 
Figure 5-22 Injection of the Different Types of Instructions of the Unprotected Code (qsrt) 
 
 
Figure 5-23 Injection of the Different Types of Instructions of the Unprotected Code MM 
5
.5
6
0
.0
0
1
4
.2
9
1
2
.5
0
0
.0
0
2
2
.2
2
1
7
.6
3
2
.5
2 3
8
.1
0
1
2
.5
0
1
1
.0
0
2
2
.2
2
2
4
.0
7
0
.0
0
1
4
.2
9
2
5
.0
0
3
9
.0
0
0
.0
0
5
2
.6
0
6
7
.4
8
3
3
.3
3
5
0
.0
0
5
0
.0
0 5
5
.5
6
L O A D G E P S T O R E B I N O P S E X T C M P
Control/Flow % SDC % Crash % Correct %
1
5
.3
6
0
.0
0
1
6
.2
4
2
7
.2
5
0
.0
0
3
2
.1
2
1
7
.6
3
6
.5
8 4
1
.1
3
1
0
.2
5
9
.3
2
3
1
.2
2
2
6
.5
9
0
.0
0
1
3
.2
8
2
5
.0
0
4
4
.1
5
0
.0
0
4
0
.4
5
6
3
.4
2
2
9
.3
5
3
7
.5
0
4
6
.5
3
3
6
.6
6
L O A D G E P S T O R E B I N O P S E X T C M P
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
100 
 
 
Figure 5-24 Injection of all Instructions of Unprotected FFT 
 
 
 
Figure 5-25 Injection of all Instructions of Unprotected math (solvecubic, rad2deg, deg2rad, uqsort) 
3
6
.8
8
0
.0
0
4
2
.1
0
2
3
.6
4
2
6
.2
93
0
.6
4
4
4
.8
1
1
9
.1
9
3
8
.1
8
2
7
.1
4
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
3
2
.4
8
5
5
.1
9
3
8
.7
1
3
8
.1
8
4
6
.5
7
L O A D G E P S T O R E B I N O P S E X T
Control/Flow % SDC % Crash % Correct %
3
6
.8
8
0
.0
0
5
3
.3
9
3
0
.9
1
2
.8
6
5
1
.7
4
5
2
.3
8
3
6
.9
4
6
1
.8
2
1
8
.5
7
0
.0
0
0
.0
0
0
.0
0
0
.0
0
7
8
.5
7
1
1
.3
8
4
7
.6
2
9
.6
8
7
.2
7
0
.0
0
L O A D G E P S T O R E B I N O P S E X T
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
101 
 
 
Figure 5-26 Injection of all Instructions of the Protected math (solve cubic, rad2deg, deg2rad, uqsort) 
5.3.3 Injecting Protected Code 
Injecting the protected code and comparing it to the injected unprotected code of Section 5.3.2 will quantify 
the error coverage provided using our protection technique to the benchmarks. Even whilst injecting in every 
single instruction, the new developed protection code we observed 0% error for the Fibo and Qsort benchmarks 
against the simulated SEEs from our fault injector. We achieved this result as the protection code replicates all 
instruction types (CPU registers + memory instruction) and all data and instruction type formats; i32, i32*, i1, 
i8, i8*, i64, float and double, float and double pointers. Our passes call a voter function to decide the correct 
outcome that will replace the faulty register value at runtime. 
In the thirds benchmark (Math) including (SolveCubic, Rad2Deg, Deg2Rad, UQsort), we had 100% correct 
outcome of injecting the Store memory instruction and all CPU instructions (“BinOp” & “sext”), however we 
have recorded some errors when injecting “load” and “GEP” memory instructions as shown in Figure 5-26. 
In this 3rd benchmark, we recorded 20.59% error rate, representing SDC when we injected the “GEP” 
instructions, and an error rate of 8.04 % when injecting the Load instruction. The total error rate (counting all 
instruction types) of this benchmark has been dropped from 87.10% to 3.97%. 
We have built our fault pass assuming single faults to quantify SEEs, however we predict our protection 
code will be able to mitigate multiple bit errors provided they are in the same word. We also can protect 
multiple but separate variables provided they are TMR’ed using our method. Directly using our protection 
technique prevented crashes and control flow errors compared to when no protection is added. 
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
08
.0
4 2
0
.5
9
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
9
1
.9
6
7
9
.4
1
1
0
0
.0
0
1
0
0
.0
0
1
0
0
.0
0
L O A D G E P S T O R E B I N O P S E X T
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
102 
 
 
Figure 5-27 Injection of all Instructions of the Protected math (solve cubic, rad2deg, deg2rad, uqsort) 
In the next Sections 5.3.3.1, 5.3.3.2 and 5.3.3.3 we will break down the injection of the protection code to 
more detailed results, where we decided to only protect the processing architecture partially, starting by 
protecting only the CPU, then protecting the memory system and at last combining the protection of them 
both. This will show the importance of protecting all instructions types including the memory instruction that 
have been ignored in most of previous works.  
5.3.3.1 Protecting CPU Register Types 
Unlike most of the literature that injects only CPU registers, we decided to take a more holistic, and 
realistic, approach by injecting all instruction types to assess whether our protection technique could reduce 
the original error rates obtained from the first experiment. In our second injection experiment, only the CPU 
registers (and not cache memory) have been protected. Figure 5-28  shows that the error rate has been reduced 
slightly in most of the benchmarks. In the Fibo benchmark, protecting only CPU registers has not reduced the 
error rate and stayed similar to the unprotected experiment. The error rates have been dropped to 9.78%, 
26.85%, 33.36%, 28.41% and 33.35% in Susan, Qsort, Math, FFT and MM benchmarks respectively. This 
shows that protecting only the CPU registers is not enough to guarantee good error coverage. This high error 
rate is due to the nature of benchmarks where the CPU registers occupy a small part of the code. 
 
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
08
.0
4
2
0
.5
9
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
9
1
.9
6
7
9
.4
1
1
0
0
.0
0
1
0
0
.0
0
1
0
0
.0
0
L O A D G E P S T O R E B I N O P S E X T
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
103 
 
 
Figure 5-28 Protect CPU Registers (Binary Operations, Arithmetic & Logic instructions) 
5.3.3.2 Protecting Memory Instruction Types 
In this third experiment, only memory instructions have been protected, including the read and write 
operations from/to cache memory. In the literature this type of instruction has been ignored, because of the 
assumption of using hard-ECC. We do not assume having any special hardware architecture, allowing us to 
extend our work to multiple processing architectures. In this experiment we injected all the instruction types 
and check the error coverage. The error rate has been reduced by just protecting the memory instructions, 
0 20 40 60 80 100
Fibo
Qsort
Math
Suzan
FFT
MM
Correct % Hang % Crash % SDC % Control %
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
104 
 
compared to the second experiment where only the CPU registers have been protected. The error rates have 
been dropped to 0%, 4.77%, 9.02%, 15.18%, 12.40% and 9.62% in Fibo, Qsort, Susan, Math, FFT and MM 
benchmarks respectively, Figure 5-28. This improvement has been made due to the high portion occupied by 
memory instructions in the benchmarks, meaning this type of instruction requires protection. 
 
Figure 5-29 Protect Memory Instructions 
5.3.3.3 Protecting All Instruction Types 
Injecting the protected code and comparing it to the injected unprotected code will quantify the reliability 
provided by our protection technique to the benchmarks. In the fourth experiment all instruction types have 
been protected. This mean the combination of protection of both the CPU registers and memory instructions. 
This has dramatically improved the coverage where in some benchmarks the error rate has been reduced to 
0%. In the Math benchmark including (SolveCubic, Rad2Deg, Deg2Rad and UQsort), FFT and MM 
benchmarks, the total error has been dropped to 3.97%, 1.23% and 0.71% respectively. For the Susan 
benchmark where random errors have been injected, we noticed that the error rate has been dropped to 0.83%, 
0 20 40 60 80 100
Fibo
Qsort
Math
Suzan
FFT
MM
Correct % Hang % Crash % SDC % Control %
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
105 
 
Figure 5-30. 
The low error rate achieved using our software protection, was due to the replication of all instruction types 
(CPU registers & memory instructions). The fusion of the protection of both instruction types will ensure 
higher coverage. The datatype formats; i32, i32*, i1, i8, i8*, i64, float and double, float and double pointers 
have been included in this work.  
We have built our fault injection tool considering single faults to assess SEEs, however we expect that our 
protection code will detect and recover multiple bit errors, with the condition that they are in the same word. 
We also can protect multiple but separate variables, knowing that they are TMR’ed using our software. Directly 
using our protection technique prevented crashes and control flow errors. 
 
Figure 5-30 Protect CPU Registers & Memory Instructions 
0 20 40 60 80 100
Fibo
Qsort
Math
Suzan
FFT
MM
Correct % Hang % Crash % SDC % Control %
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
106 
 
5.3.4 ACEDR Time Overhead  
The overhead or the time delay added when applying our protection techniques will be compared to the 
time delay where no protection is applied. We started by recording the execution time of the unprotected code, 
after that we protect the code with ACEDR, and then measured its execution time. We used the Linux tool 
perf, to measure the delays and the number of processor cycles [138].  
The following Table 5-4, shows the different time overheads for the different benchmarks protected with 
ACEDR. 
Table 5-4 ACEDR Time Overhead For the Different Processing Platforms 
Benchmarks Intel core i5-
3470 Overhead (%) 
Raspberry Pi 3 
Overhead (%) 
Fibo 6.44 12.24 
Qsort 13.89 16.66 
Math 8.55 15.28 
Susan 2.46 257.03 
FFT 10.58 13.45 
MM 9.39 11.63 
 
The study of the time overhead when injecting the code is platform independent (same results on the 
Raspberry Pi 3 and the Intel core i5-3470 were obtained). The different overheads are depending on the error 
types, as follows; 
•  SDC, did not generate any significant overhead (less than 1%), 
• Hang, produced and indefinite overhead, since the program is stuck in an infinite loop, 
• Control flow errors have produced significant overhead, which could range from 0% up to multiple 
times the original time of execution, in some cases this reached an order of magnitude of the original 
time. When the overhead is infinity, a hang has occurred, 
• Crash, means that the code did not execute, or terminated incorrectly, producing 0% overhead. 
The high performance (low time overhead) was thanks to the pipeline, where independent redundant 
instructions have been executed in parallel. The desktop performance is better than the embedded Raspberry 
Pi 3 (especially when running Susan benchmark), due to the big difference in the two platforms performances. 
Another important factor is the compiler optimizations, improving highly the performance. This demonstrates 
the portability of the ACEDR, and compilation of protected code for differing processing architectures. 
5.3.5 Comparing ACEDR & State-Of-The-Art  
In this section, a comparison is done between our work and the state of the art including EDDI [13] and 
SWIFT [14]. Both of the previous schemes are used for error detection only, without the ability to recover. All 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
107 
 
instructions have been duplicated (Memory and CPU instructions for EDDI, partial memory and all CPU 
instructions for SWIFT).  
Our method offers better performance in terms of the overhead generated. Furthermore, our method is 
capable of error detection and recovery as well, which is very important in real time applications, where the 
recovery part enables the system to keep running reliably. In the state of the art, the error rate of injection of 
the baseline code (unprotected) was predetermined (20% error for EDDI and 37% for SWIFT) which could 
impact dramatically the results of injecting the protected code. Our work does not predetermine the baseline 
error rate. Every benchmark’s instructions are injected independently, which results in different baseline error 
rates (from 25.06% up to 87.09%). This will yield more realistic results, when the protected code is injected. 
5.4 Reliability Comparison of Injection Experiments to Predictions   
The previous Section 5.3, where the injection experiments have been performed, showing that the error 
rate has been dropped using purely the software protection techniques developed. Since we have obtained the 
different sensitivities for the different instructions types, we can measure the precision of our reliability 
prediction model from Chapter 4, where the reliability obtained from Equation (4-24) and (4-30) will be 
compared, this will show the precision of the theoretical reliability prediction model. In this scenario we take 
Equation (4-30) as the ground truth. In order to show that the protection code did not just drop the error rate 
but it also improved the total reliability of the whole system, a comparison between Equations (4-24) and 
(4-30) will be done. 
 The instruction types are divided to two major categories (CPU/Memory), however the number of 
instruction types can be subdivided to the total number of types “l” that exists in the benchmark. 
The key remarks are: 
• Both, the reliability of the theoretical prediction model and the reliability obtained from the injection 
experiment show an improvement in reliability when protecting using ACEDR, since the error rate 
has been reduced to less than 1%, 
• The reliability of the prediction is higher than the reliability of the injection experiment at the start, 
then both curves intersect. After the intersection point the reliability of injection becomes higher than 
the reliability of predictions. 
Our reliability prediction model is limited if the values of λp/λc are less than the optimal points, where the 
MTTF error can reach high rates. The optimal points have shown the lowest error rates, corresponding 8.9 x 
10-4, 3.2 x 10-3, 2.3 x 10-3, 0.028, 8.06 x 10-3 and 2.88 x 10-5 for Fib, Qsort, FFT, and MATH benchmarks. The 
prediction model shows good precision (error rate less than 1%) when the λp/λc is higher than the optimal 
point and this was expected since the error rate in the CPU registers is higher than the error rate in the L1 
cache. These results can be explained by the fact that the reliability equations are dominated by the L1 cache 
and the CPU, since the hit rate of the 1st level of cache is always more 99% in all of the benchmarks. Despite 
that the reliability plots are not perfectly matching, our model shows low error in terms of the MTTF. Further 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
108 
 
analysis of these result is needed in the future. 
5.4.1 Mean Time to Failure (MTTF) 
In order to determine the accuracy of our reliability prediction models, compared to the results of the 
injection experiment, we investigated the Mean Time To Failure (MTTF) as a key performance metric. By 
definition, MTTF is the length of time a device or other product is expected to last in its operating state. MTTF 
is obtained by integrating reliability over time as the following: 𝑀𝑇𝑇𝐹 = ∫ 𝑅(𝑡)
∞
0
. The MTTF of the injection 
experiment is obtained by integrating Equation (4-30), and the MTTF of the prediction is obtained by 
integrating the Equation (4-24). The accuracy of our predictions is determined by the error between the injected 
and predicted MTTFs. The residual error is dependent on the error rates of the CPU and the 1st level of caches 
λp and λc, the reason for this is the fact that the hit rate of the 1st level of cache was more than or equal to 98% 
in all of the tested benchmarks. We plot the error as a function of the ratio λp/λc we obtained in Figure 5-31. 
 
Figure 5-31 The MTTF Error with Respect to (λp/λc) Ratio 
 
Figure 5-31 includes the error of MTTF for 6 benchmarks. We noted that the lowest errors for the 
benchmarks are 8.9 x 10-4, 3.2 x 10-3, 2.3 x 10-3, 0.028, 8.06 x 10-3 and 2.88 x 10-5 respectively, we define these 
as the optimal points. The λp/λc ratio error becomes linear and constant after this point at 0.030, 0.042, 0.076, 
0.4, 0.7 and 1.6 respectively. Multiple parameters could affect the error and optimal points of the MTTF, such 
as the number of instructions of the benchmarks, their type and the initial error rate (error rate without 
protection) also changes dramatically the MTTF. Other parameters that are affecting Figure 5-31 are the 
hit/miss ratio for the different levels of caches and the operating system calls in the benchmarks. The Suzan 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
109 
 
benchmark has shown the highest error in MTTF due to the random nature of the fault injection conducted in 
this benchmark, where random samples of instructions have been injected.  
The precision of our prediction model depends mainly on the error rates of the CPU registers and the caches, 
the model can reach very high accuracy level if the ratio λp/λc correspond to the optimal point.  
The minimal value of the error is different for each benchmark that was tested, and this is due to the different 
instruction types and their portions in every benchmark. This means that for benchmarks that have the exact 
same portions of instruction, the minimal value is expected to be exactly the same with some marginal errors. 
 Errors corresponding to ratios less than the minimal points are high and considered inaccurate. We would 
recommend users of our method for future predictions and models if greater than these minima.  
5.5 Summary 
The flexibility of implementation of the software error detection and recovery makes them more attractive 
to overcome the problem of SEUs. Designer have multiple options when software FEC are applied on COTS, 
providing an order of magnitude performance improvements, compared to the hard-redundancy.  
The followings are the findings of this chapter;    
• The newly developed protection code has low overhead compared to the state of the art, less than 
15% in Intel core i5-3470 and a less than 17 % in Raspberry Pi 3, 
• A reliability prediction model that predicts the reliabilities of all the processing architecture 
components, and to quantify the reliability added using the software protection code have been 
developed,  
• The protection code has high error detection and recovery where error rates can be reduced to less 
than 1% in some benchmarks, 
• Multiple data types (i32, i32*, i1, i8, i8*, i64, float & double, float pointers & double pointers) 
[11] have been protected, after [11] was extended, in addition to both the CPU and the memory 
Read/Write instructions types, 
• Comparison of the reliability predictions with the reliability obtained from the injection experiment 
of the protected code. 
This work has automatically implemented protection code on C and C++ codes. This was enabled using a 
compiler pass to generate the protection code, including additional memory instructions, and the function calls 
for the TMR and FEC encoder/decoder functions. The FEC could be implemented using the linker as additional 
step or directly using our pass. Different benchmarks have been tested, our implementation is able to protect 
multiple data types. We conclude that any high level language supported by LLVM could be protected by our 
pass, and the output of the pass gives exactly what has been predicted in the LLVM IR, assembly and 
executable code. The executable codes of the benchmarks have been disassembled in order to check them 
using the GDB tool.  
The primary results show good performance for TMR and single error correction codes, compared to the 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
110 
 
state of the art the CDTP [119] technique used to protect memory with GCC compiler with an overhead of 
86%-146%, and this is encouraging to implement CPU protection side by side with memory protection that 
have been implemented, with an overhead predicted to be in the norms of the state of the art EDDI  [13], 
protecting both memory and CPU with an overhead of 62%, furthermore this work will be extended to 
multicore CPU protection. The more powerful memory protection algorithms such as BCH should optimised 
in order to have a minimal overhead and improve overall performance. 
Turning to multicore solution will be achieved by using automatic code generation for parallelism libraries 
call, starting with the pthread library and extending the work towards OpenMP. 
Bit-flips originating from SEUs are becoming a prominent problem in the processor architectures. It is 
crucial for designers in both mainstream and embedded or critical processing systems to ensure the reliability 
of their systems. Systems with redundant hardware that make use of hard-ECC and hard-TMR will elevate the 
design complexity of terrestrial applications, often eliminating it as an option. Soft error detection and recovery 
methods are viable alternatives because of their high coverage and low overhead and allowing for the best 
trade space between reliability and performance, providing engineers with flexible ways of protecting their 
processing architectures.  
New automatic compiler error-detection and recovery (ACEDR) techniques have been implemented at 
compiler level in this research. We have implemented and tested a new error injection tool with experiments 
on different benchmarks in order to test the reliability of our software protection techniques. We have injected 
all instructions of the chosen benchmarks, where we divided the instructions to two main categories; Memory 
instructions, allowing read/write operations from/to the cache memory of the processing architecture, and CPU 
instructions including logic & arithmetic operations. The injection is performed at the instruction’s data, which 
results in corrupting the instruction’s results. To quantify our results, we injected both the protected and the 
unprotected code and compared the results. We have shown that CPU registers and their data or instructions 
can be fully protected against the bit-flips caused by the fault injection experiment simulating SEUs. For both 
the first and second benchmarks, we demonstrated that all instructions can be fully protected with almost 100% 
error coverage. When injecting errors in code, we have greatly reduced error rates in the benchmarks: from 
73.08% to 0% for Fibo, 46.16% to 0% for Qsrt and 87.41% to 4.45% for Math, and from 25.03% to 0.83% in 
Susan. The error rate is defined as the number of the injections that have caused an error divided by the total 
number of the injections. 
In the 3rd benchmark (Math) including (SolveCubic, Rad2Deg, Deg2Rad, UQsort), the CPU instructions 
were fully protected, even though the 3rd benchmark was not fully covered, all crashes and control flow errors 
have been eliminated using ACEDR. The high error detection and recovery was due to the replication of 
multiple data types, including i32, i32*, i1, i8, i8*, i64, float & double, float & double pointers. The ACEDR 
protection mitigates different error types (crashes, SDC, control flow), with low time overhead in multi-core 
across multiple platforms, enabling the portability of our protection technique to multiple architectures (the 
ones supported by LLVM). We found that the delays measured were not highly significant because of the 
Yasser Nezzari                                                      Chapter 5, Automatic Error Detection and Recovery in LLVM 
111 
 
pipelining of independent redundant instructions, allowing the use of abundant resources of the CPU 
architectures, without causing a bottleneck. Abundant resources refer to the clock frequency and the size of 
the RAM and caches. 
This work has some limitations; it only protects memory and CPU registers accessed by instructions (not the 
instruction itself), meaning that this work do not consider bit flips that would transform instructions into other 
instructions or jump/conditional jump instructions. This work does not consider functions whose outcome is pointers, 
since voting is based on binary consistency of function outputs. 
In this research the injection experiments confirm the reliability predictions, where for both benchmarks 
the curves show that the reliability of the protected code is higher than the reliability of the unprotected one. 
The precision of our prediction model depends on the value of the initial variables (the error rates of the CPU 
registers and the caches), the model can reach very high accuracy level if the ratio λp/λc correspond to the 
optimal point. Our reliability prediction model is limited if the values of λp/λc are less than the optimal point, 
where the MTTF error can reach high rates. 
Despite the high performance of the protection code using LLVM compiler to protect at instruction level, 
there are still more optimizations to be done. Optimizations will enable even higher performance. Knowing 
that the error rate in space changes, depending on the location of the spacecraft in orbit, an adaptive system 
has the potential to improve the performance. An adaptive system could activate high performance modes 
when error rate is low, and activate high reliability modes when the error rate is high. The adaptive system is 
the next chapter of this thesis. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
112 
 
6. TUNABLE MULTICORE PROTECTION  
Studying the state of the art of COTS reliability enhancement using software protection techniques, has 
shown that most these techniques are not efficient, because of the added delays and the lack of error recovery. 
The other downside is the focus on protecting only the CPU registers and ignoring the memory system, which 
we have proven to have a crucial impact on reliability in the previous chapter. 
System designers usually assume the worst-case scenario by anticipating to recover the worse SEUs rates, 
which is not very efficient, especially when knowing that the worst case only happens when the satellite is 
near the SAA, the poles region or when solar winds occur. The previous assumption is very costly in terms of 
the overhead added, where large delays could be added by the software protection techniques. 
From Chapter 5, where a novel software protection technique called ACEDR has been implemented to 
enhance the reliability of the processing architecture, it was concluded that there is a necessity for an efficient 
optimal system, capable of adapting its error detection and recovery, and permitting designers to have both 
options; performance and reliability. The answer to this research question is the implementation of an adaptive 
system. The adaptive system switches the modes of operation of the processor, allowing it to operate in modes 
with high reliability when the SEU rate is very high, and could switch to less reliable but higher performing 
operating modes when the SEU rate is low. 
Four modes are suggested in this research; the unprotected mode is the one expected to have the highest 
performance, since no protection code is applied. This mode can be switched on when the error rate is at its 
lowest level. The second mode is the ACEDR [12] technique from Chapter 5, that is named ITMR in this 
chapter, with its triplicated instructions allowing for error detection and recovery, without significantly 
affecting the overhead. The third mode TTMR, using multithreading to run different threads on different 
processing cores, and at the end their results are compared for error detection and recovery. The last mode is 
the combination of (ITMR & TTMR), expected to have high reliability, with performance cost. Our approach 
for TTMR does not include the case where threads require interpretation of pointers or global variables. 
All the previous modes will be tested rigorously, in terms of their ability to detect and correct errors, via 
error injection tests. Another decisive metric is the overhead added using the different protection modes. Based 
on the previous metrics only the modes showing high-reliability/low-overhead will be implemented in the 
adaptive system.    
A reliability prediction model has also been included, where equations to estimate the reliability of the 
whole processing architectures for the different modes used have been developed. This model is based on 
Markov chain, then extending it to obtain the reliability of the whole processing architecture, including the 
CPU, RAM and caches. This model includes multiple parameters, internal and external ones. Internal 
parameters are specific to the processor used, external ones are depending on the environment where the 
processor is being used. 
The last part of this chapter is the injection experiment, where the reliability of the protection modes will 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
113 
 
be tested, where adding the different protection techniques should reduce the error rate, depending on the 
power of the error detection and recovery used. The results of the injection experiments will also be used to 
verify the precision of the prediction model. 
6.1 Adaptive Multicore Platform Concept 
We are motivated to implement adaptive multicore CPU protection because of the varying nature of SEUs 
errors in orbit, where the error rate increases near the poles, and near the South Atlantic Anomalies “SAA”. 
The adaptive implementation will guarantee a reliable system without sacrificing performance. In this work, 
the dependent variables are the protection modes and the duration of a particular protection mode. The 
independent variables are reliability and performance. 
A trade-off between reliability and performance can be achieved by controlling the protection mode, and 
its duration before switching to another mode. The switching is done in a dynamic environment where the 
error rate changes with time, because of the variation in radiation flux in orbit. 
If we can switch the operation mode in real time, we will be able to apply more redundancy, and hence 
more reliability to the system when the error rate is high, and if the error rate drops, the higher performance 
mode (mode with less powerful protection techniques or without any protection techniques) will be switched 
on. The switching between modes will be automated depending on the error rate. 
6.1.1 Implementation of Adaptive Multicore Protection 
The Adaptive software works in three different modes, the first one is the simple mode: where no protection 
is applied, and the software is in its highest performance in terms of execution time (no overhead is added 
since there is no protection code added). The second mode is the ITMR, which uses compiler passes to 
automatically add protection code, by adding redundant instructions. The third mode is TTMR, running 
functions in three independent threads, and the outcome of every function will be compared to decide the right 
one. The fourth operation mode consists of combining the protection of ITMR with TTMR.  
The mode of running will alternate periodically depending on the error rate Figure 6-1. For T1 the error 
rate is minimal, the unprotected mode is enough, where the code is running at its best performance. For T2 the 
error rate reaches high levels requiring the activation of the combined TTMR & ITMR mode, which can add 
an extra overhead that is benchmark dependent, however, the reliability will be the at its highest level during 
this mode. During T3 the error rate decreases, so the ITMR mode will be switched on, with lower overhead, 
compared to the combined protection mode. The switching is done periodically. At T4 the error rate has 
dropped again, meaning that the unprotected mode can be switched on to have higher performance Figure 
6-1. 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
114 
 
 
Figure 6-1 Alternating Protection Mode in Real Time 
At the start, a periodic function of time will be created, representing the change of the error rates in orbit. 
According to the different values of the error function, the different protection modes will be called and 
executed. Calling the TTMR protection mode will create three threads and runs them, if the function running 
in the threads returns values, then the values will be compared to determine the correct ones, if the function is 
a continuous one (runs indefinitely), the compare function will be called in a different thread periodically in 
order to detect and correct errors. The threads are created using the pthread C++ library in a manner to avoid 
data racing while changing the values of the shared variables to detect and recover errors. The whole process 
is shown in the flowchart in Figure 6-1. 
6.1.2 Adaptive Protection Modes 
6.1.2.1 Instructions-TMR Mode 
The LLVM compiler is chosen as the baseline source of this protection mode, where LLVM analysis and 
transformation passes have been created in order to automatically add protection code to the intermediate 
representation of any code of choice (supported by LLVM) [12]. The users of our software protection method 
do not have to write a single line of protection code, all they have to do is compile their unprotected code via 
our passes, and the code will be protected. The passes include an analysis and transformation one. The analysis 
pass will go through the code line by line in order to determine all types of instructions and provides statistical 
information about them. The transformation pass will use the information provided by the analysis to call the 
appropriate protection technique. 
6.1.2.2 Threads-TMR Mode 
This technique is using multithreading to enable code to run on multicore platform, where three redundant 
threads are running on three different processing cores. At the end of their execution, the different threads 
outcome will be compared for error detection and recovery. 
6.1.2.3 Combined Protection Mode 
TTMR creates three parallel threads. The threads will execute functions that have instructions triplicated 
using the ITMR protection technique explained previously. The functions can either execute a finite number 
of times and at the end update or return values, or will run indefinitely and keep updating their variables.  
The TTMR threads will check if there are any variable updates from previous executions after their 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
115 
 
creation, if yes then they will be used in the threads, else the threads will use the initial values of these variables, 
(Figure 6-3). 
The function inside the threads will be checked, if it is infinite, then the threads will be paused every (n 
Seconds) and wait for the voter to detect and correct their values. There is also a periodic check for threads 
timeout, if there is no timeout then the threads variables will be updated, and the execution of the threads will 
continue. If there is a timeout, then the threads will be restarted using their last correct values.  
If the function inside the threads executes a finite number of times, then the threads will be joined, and their 
values returned for error detection and correction, in case there is no timeout. If a timeout occurred, then the 
threads will be terminated and restarted using their last updated variables if they exist, Figure 6-2. 
 
Figure 6-2 Adaptive Protection Flowchart 
 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
116 
 
 
Figure 6-3 TTMR Mode of Operation 
6.2 Injection Experiments 
6.2.1 Injecting the Unprotected Code 
In this section, the error injector will be used to test the reliability of the unprotected code. The injection 
will target all of the instructions of the benchmarks that have been chosen (Fibo, Qsort, Roots and Math) [153]. 
The injection is done once in every run, where a random bit flip will be inflicted in the selected instruction. In 
order to validate the reliability prediction model, all the instructions of the chosen benchmarks will be injected. 
In some benchmarks possessing very large instruction population, there will be injections on samples with 
95% confidence level, and 0.1% margins of error according to [142].   
The injection will provoke different error types, including the SDC, control flow, hangs and crashes. The 
results can be seen in Figure 6-4, Figure 6-5, Figure 6-6 and Figure 6-7. 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
117 
 
All of the Benchmarks of the unprotected code are highly vulnerable to error injection, since they do not 
have any protection. This will cause all types of errors, which can reach more than 50%. The overall error rates 
for the unprotected benchmarks are; 73.07%, 50.25%, 87.09% and 74.91% for Fibo, Qsort, Math and Roots 
respectively. This can be broken down by the instruction type level:  
6.2.1.1 Load Instructions 
The Load instruction is used to read from memory, its argument is a memory address. Injecting this instruction 
will corrupt the loaded data from the memory, leading to different error types, in all of the benchmarks, since 
the error propagate through the code.  
SDC errors have been recorded in all of the benchmarks, with rates of 68.50%, 51.74%, 50% and 17% for 
Roots, Math, fib and Qsort benchmarks respectively. Control Flow errors have been recorded in all of the 
benchmarks except Roots, with the rates of 36.88%, 21.43% and 5.56% for Math, fib and Qsort benchmarks 
respectively. Crashes were only recorded in the Qsort benchmark with a rate of 24.07%. 
6.2.1.2 Store Instructions 
This instruction is used to write to memory, using a memory address as an argument. Injecting store 
instruction will corrupt the data stored in the memory, this corruption could propagate leading to different 
types of error including:  
SDC, where all of the benchmarks have suffered from them with rates of 59.09%, 50%, 38.10%, 36.94% 
for Roots, Fibo, Qsort and Math benchmarks respectively. Control Flow errors have also been recorded in all 
of the benchmarks with the rates of 53.39%, 25%, 14.29% and 9.09% for Math, Fibo, Qsort and Roots 
benchmarks respectively. Crashes have been recorded only in the Qsort benchmark with a rate of 14.29%. 
6.2.1.3 Binary Operation Instructions 
Used for computations, it uses operands of the same type. Injecting Binary Operations will corrupt the 
results of the operation leading to multiple error types including:  
SDC, recorded in Roots, Math and Qsort with rates of 88.46%, 61.82%, 12.50% respectively. Control Flow 
errors have been observed in Math, Qsort and Roots with rates of 30.91%, 12.50% and 7.69% respectively. 
Crashes have been observed in the Qsort benchmark, with a rate of 25%. 
6.2.1.4  “GEP” Instructions 
The GEP or “getelementptr” instruction is used to get the address of a sub-element of an aggregate data 
structure. It performs address calculation only and does not access memory. Injecting GEP will corrupt 
memory addresses of arrays and other data containers (maps, lists…etc). 
These instruction types can only be found in Qsort and Math benchmarks. The SCD error rates were 52.38% 
and 32.52% in Math and Qsort respectively. No other error types have been recorded. 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
118 
 
6.2.1.5 “sext” Instructions 
The “sext” instruction takes a value to cast, and a type to cast it to. Injecting this instruction will corrupt all 
conversion operations, and since the errors propagate through the code, different error types occur. 
This instruction is only found in the Math benchmark, where injecting it has caused 78.57% crashes, 
18.57% SDC and 2.86% Control Flow errors. 
 
 
Figure 6-4 Injection of the Different Types of Instructions of the Unprotected Code (fib) 
 
Figure 6-5 Injection of the Different Types of Instructions of the Unprotected Code (qsrt) 
 
2
1
.4
3
2
5
.0
0
5
0
5
0
.0
0
0
.0
0
0
.0
0
2
8
.5
7
2
5
.0
0
L O A D S T O R E
Control/Flow % SDC % Crash % Correct %
5
.5
6
0
.0
0
1
4
.2
9
1
2
.5
0
1
7
.6
3
2
.5
2
3
8
.1
0
1
2
.5
0
2
4
.0
7
0
.0
0
1
4
.2
9 2
5
.0
0
5
2
.6
0
6
7
.4
8
3
3
.3
3
5
0
.0
0
L O A D G E P S T O R E B I N O P
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
119 
 
 
Figure 6-6 Injection of all Instructions of Unprotected math (solvecubic, rad2deg, deg2rad, uqsort) 
 
Figure 6-7 Injection of the Different Types of Instructions of the Unprotected Code (Roots) 
6.2.2 Injecting the Protected Code 
6.2.2.1 Injecting the Protected Code with Instruction-TMR 
The results of injecting the code protected with ITMR are shown in Chapter 5, since ITMR is identical to 
the ACEDR protection technique results.  
6.2.2.2 Injecting the Protected Code with Thread-TMR 
In this section, the code will be protected using TTMR, where every function will run on three redundant 
threads, and at the end of their execution, their results will be compared for error detection and recovery. The 
injection of the protected code is done once for every run, where every instruction of the benchmarks will be 
injected, one at a time. The injections will be done on all of three threads. Giving different names for the 
functions that the threads are executing will enable injecting them distinctively. The threads are running on a 
multicore, meaning every core will be injected every single run. 
The protection provided using TTMR has improved the reliability by reducing the error rate. However, the 
error rate is still high compared to the error rates of the ITMR used previously [12], which resulted in lower 
error rates. The results of injecting TTMR are shown in Figure 6-8, Figure 6-9, Figure 6-10 and Figure 6-11. 
3
6
.8
8
0
.0
0
5
3
.3
9
3
0
.9
1
2
.8
6
5
1
.7
4
5
2
.3
8
3
6
.9
4
6
1
.8
2
1
8
.5
7
0
.0
0
0
.0
0
0
.0
0
0
.0
0
7
8
.5
7
1
1
.3
8
4
7
.6
2
9
.6
8
7
.2
7
0
.0
0
L O A D G E P S T O R E B I N O P S E X T
Control/Flow % SDC % Crash % Correct %
0
.0
0 9
.0
9
7
.6
9
6
8
.5
5
9
.0
9
8
8
.4
6
0
.0
0
0
.0
0
0
.0
0
3
1
.5
0
3
1
.8
2
3
.8
5
L O A D S T O R E B I N O P
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
120 
 
The results of injecting the different instruction types are as follows: 
Load instructions 
SDC errors have been recorded in all of the benchmarks, with rates of 22.64%, 7.32%, 2.99% and 2.55% 
for Math, Fib, Root and Qsort benchmarks respectively. Control Flow errors have been recorded in Roots and 
Fib benchmarks with rates of 2.44% and 1.49% respectively. Crashes were only recorded in the Qsort 
benchmark with a rate of 13.88%. 
Store instructions 
SDC occurred in Math and Qsrt benchmarks with rates of 15.94% and 2.78% respectively. Control Flow 
errors have also been recorded in two benchmarks the Fibo and Roots with rates of 6.98% and 2.99% 
respectively. Crashes have been recorded only in the Qsort benchmark with a rate of 16.67%. 
Binary Operation instructions 
SDC, recorded in Math and Fibo with rates of 10% and 4.55% respectively. Control Flow errors have been 
observed in Roots with rates of 2.7%. Crashes have been observed in the Qsort benchmark, with a rate of 
27.27%. 
 “GEP” instructions 
These instruction types can only be found in Qsort and Math benchmarks. The SCD error rates were 3.57% 
and 3.23% in Math and Qsort respectively. No other error types have been recorded. 
“sext” instructions 
This instruction is only found in the Math benchmark, where injecting it has caused 3.57% SDC. 
 
 
Figure 6-8 Injection of the Different Types of Instructions of the Protected Code Using TTMR (fib) 
2
.4
4
6
.9
8
7
.3
2
0
.0
0
0
.0
0
0
.0
0
9
0
.2
4
9
3
.0
2
L O A D S T O R E
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
121 
 
 
Figure 6-9 Injection of the Different Types of Instructions of the Protected Code Using TTMR(qsrt) 
 
 
Figure 6-10 Injection of the Different Types of Instructions of the Protected Code Using TTMR (Math) 
 
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
2
.5
5
3
.2
3
2
.7
8
4
.5
5
4
.0
0
3
.5
71
3
.3
8
0
.0
0
1
6
.6
7 2
7
.2
7
0
.0
0
0
.0
0
8
4
.0
8 9
6
.7
7
8
0
.5
6
7
2
.7
3
9
6
.0
0
9
6
.4
3
L O A D G E P S T O R E B I N O P C M P S E X T
Control/Flow % SDC % Crash % Correct %
0
.0
0
0
.0
0
0
.0
0
0
.0
0
2
2
.6
4
3
.5
7 1
5
.9
4
1
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
7
7
.3
6
9
6
.4
3
8
4
.0
6
9
0
.0
0
L O A D G E P S T O R E B I N O P
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
122 
 
 
Figure 6-11  Injection of the Different Types of Instructions of the Protected Code Using (Roots) 
6.2.2.3 Injecting Protected Code using the combined TTMR and ITMR  
Next, we combine the protection techniques, the TTMR and the ITMR to confirm further reliability 
improvements of the Benchmarks. TTMR as explained in the previous Section 6.2.2, will create redundant 
independent threads and run functions in them concurrently, at the end of their execution, the threads results 
will be compared using a voter to decide the correct outcome. ITMR will triplicate instructions. The 
Combination of the two protection techniques is done by applying the ITMR, and consecutively applying 
TTMR.  
After the benchmarks are protected with the combined techniques we performed fault injection experiments 
on them, where all the instructions of the benchmarks have been injected, which will quantify the reliability 
provided using the combined techniques. The fault injector will go through the code line by line, enabling it 
to inject all of the threads, once for every run. Our injector is capable of injecting the following data and 
instruction types: i32, i32*, i1, i8, i8*, i64, float, double, float and double pointers. After the instruction to be 
injected is chosen, the injector function will randomly select one of its bits and then flips it (zero turned to one, 
and one turned to zero). Flipping the instruction’s bit will introduce different types of errors, including the 
SDC, control flow, hang and crash. The error rate will be determined by the portion of the injections that 
caused the mentioned error types. In order to validate the reliability prediction model, all the instructions of 
the chosen benchmarks will be injected. In some benchmarks possessing very large instruction population, 
there will be injections on samples with 95% confidence and 0.1% error [142].       
Results of injecting the combined protection technique 
In this section, we will show the results of fault injection of the protected code using ITMR & TTMR 
consecutively. We used the previous four benchmarks (Fibo, Qsort, Roots and Math), that have different 
instructions and data types in order to show the efficiency of the new protection technique.  
Overall the error rate in all of the benchmarks has been reduced. For both the Roots and Fibo benchmarks, 
1
.4
9
2
.9
9
2
.7
0
2
.9
9
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
9
8
.5
1
9
7
.0
1
9
7
.3
0
L O A D S T O R E B I N O P
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
123 
 
the error rate has been dropped to 0%, for the Qsort and Math benchmarks we had error rates of 0.14 %, and 
0.60 % respectively, Table 6-1. 
The combination of both protection methods has improved the reliability by reducing the error rate to less 
than 1%, this is due to the high redundancy of the code, where not only threads have been replicated, but also 
the instructions inside the threads have been replicated as well. Table 6-1 shows the different error rates using 
the different software protection techniques. The error rates of ITMR were obtained from [12]. 
Table 6-1 Error Rates of Different Software Protection Techniques 
Error (%) Fib BasicMath Qsort Roots 
No Protection 73.07 87.09 50.25 74.91 
ITMR 0 3.97 0 1.85 
TTMR 8.33 14.73 14.62 3.51 
TTMR+ ITMR 0 0.60 0.14 0 
 
 
Figure 6-12 Injection of the Different Types of Instructions of the Protected Code Using TTMR & ITMR (Math) 
 
Figure 6-13 Injection of the Different Types of Instructions of the Protected Code Using TTMR & ITMR (Qsort) 
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
3
.5
7
2
.6
5
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
1
0
0
.0
0
9
6
.4
3
9
7
.3
5
1
0
0
.0
0
L O A D G E P S T O R E B I N O P
Control/Flow % SDC % Crash % Correct %
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
3
.2
3
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
0
.0
0
1
0
0
.0
0
9
6
.7
7
1
0
0
.0
0
1
0
0
.0
0
1
0
0
.0
0
1
0
0
.0
0
L O A D G E P S T O R E B I N O P C M P S E X T
Control/Flow % SDC % Crash % Correct %
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
124 
 
6.2.3 Study of Time Overhead  
The overhead or the time delay added when applying our protection techniques will be compared to the 
time delay where no protection was applied. We started by recording the execution time it takes to execute the 
unprotected code, after that we protect the code with TTMR, ITMR and their combination, then measure the 
execution time. The Linux tool perf has been used to measure the delays and the number of cycles [138]. The 
results were obtained using Intel core i5-3470 with 3.2 GHz frequency. 
Table 6-2 Time Overhead 
Error (%) Fibo BasicMath Qsort Roots 
TTMR 10.32% 41.14% 13.35% 13.17% 
ITMR 6.44% 8.55% 13.89% 87.54% 
TTMR&ITMR 14.97% 18.18% 29.40% 131.70% 
 
The study of the time overhead when injecting the code will be shown next. The different overheads are 
depending on the error types, as follows; 
• SDC, did not generate any significant overhead (less than 1%), 
• Hang, produced and indefinite overhead, since the program is stuck in an infinite loop, 
• Control flow errors have produced significant overhead, which could range from 0% up to multiple 
times the original time of execution, in some cases this reaches an order of magnitude of the original 
time. When the overhead is infinity, a hang has occurred, 
• Crash, means that the code did not execute, or terminated incorrectly, producing 0% overhead. 
The overheads added when protecting using the combined technique (TTMR & ITMR) was higher than 
when only TTMR or the ITMR was used. The main reason for the delays was the creation of new threads, 
joining them, in addition to the delays added by the voting function, and the addition of redundant instruction 
using the ITMR. However, the reliability has been dramatically improved using this mode, by reducing the 
error rates. 
Applying TTMR to protect the code has resulted in low overhead, this can be explained by the parallel 
nature of this protection technique, where redundant threads run on different cores. The only time where the 
overhead is generated is when the threads are compared for error detection and recovery. 
For the ITMR, the overhead was low in most of the benchmarks, except for roots benchmark. The compiler 
creates redundant independent instructions, taking advantage of the processing architecture’s pipeline, thus, 
improving the overhead. Compiler optimizations also play a role in reducing performance bottlenecks. 
6.2.4 Reliability of the Injection Experiment VS Reliability Predictions  
After the injection and overhead results were obtained, it has been decided to discard the TTMR, since it 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
125 
 
does not provide high reliability, compared to the ITMR capable of reducing the SEE error rates without 
affecting severely the performance. 
The comparison of the reliability prediction model to the reliability of the injection experiment for ITMR 
can be found in Chapter 5. This section deals with comparing the reliability predictions of the combined 
protection mode to the results of the injection experiments. 
We will be using the results of the injection experiments in order to determine the precision of our reliability 
prediction models from Section 4.3.5, of the processing architecture. From the previous Section 6.2.1 and 
6.2.2, we can notice that the sensitivity of instructions has been reduced when using our protection methods. 
We will be evaluating the reliability predictions of the protected code using the combined protection techniques 
(TTMR & ITMR).  
The experimental reliability of the protected code using the results of the injection experiment is modelled 
using the following Equations (6-1), (6-2), (6-3),        (6-4) and (6-5) for the reliability of the different levels 
of cache, the RAM and the CPU: 
𝑅1𝑖𝑛𝑗(𝑡) = 𝑒
(−𝑊𝐻1𝜆𝑡 ∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0 ) 
(6-1) 
𝑅2𝑖𝑛𝑗(𝑡) = 𝑒
(−𝑊𝑋2𝜆𝑡 ∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0 ) (6-2) 
𝑅3𝑖𝑛𝑗(𝑡) = 𝑒
(−𝑊𝑋3𝜆𝑡 ∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0 ) (6-3) 
𝑅𝑟𝑖𝑛𝑗(𝑡) = 𝑒
(−𝑊𝑋𝑟 𝜆𝑡 ∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0 )        
(6-4) 
 𝑅𝐶𝑃𝑈𝑖𝑛𝑗(𝑡) = 𝑒
(−𝑊
∑ 𝑆𝑖𝑝𝑁𝑖𝑝
𝑖=𝑙
𝑖=0
𝑆𝑃1𝑆𝑃2
𝜆𝑡)
 (6-5) 
 
The reliability of the whole system is given by: 
𝑅𝑖𝑛𝑗(𝑡) =  𝑅1𝑖𝑛𝑗(𝑡)𝑅2𝑖𝑛𝑗(𝑡)𝑅3𝑖𝑛𝑗(𝑡)𝑅𝑟𝑖𝑛𝑗(𝑡)𝑅𝐶𝑃𝑈𝑖𝑛𝑗(𝑡)   (6-6) 
The reliability obtained from Equation  (6-6) will be denoted as the experimental reliability, and will be 
taken as the ground truth, in order to compare it with the theoretical reliability prediction model. The reliability 
obtained from Equation (4-41) and the results of injecting the unprotected code from the previous Section 
6.2.1 will be denoted as the predicted reliability. For each benchmark we have computed the reliability after a 
certain period of time we obtained the following Table 6-3.  
For all the tested benchmarks, we can observe that the predicted and experimental reliability for the 
protected mode is higher than the reliability of the unprotected mode. This is a result of applying the error 
detection and recovery code, where error rates have been reduced from (73.07%, 50.25%, 87.09% and 74.91%) 
to (0%, 0.60%, 0.14% and 0%) in Fibo, Qsort, Math and Roots benchmarks respectively Figure 6-14 to 6-17 
and .  
 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
126 
 
Table 6-3 Experimental & Predicted Reliability 
Fibo Benchmark 
Time Predicted R(t) Experimental R(t) Error (%) 
After 60 Seconds 0.99999 0.99998 1.0e-3 
After 10 Minutes 0.99999 0.99984 1.5e-2 
After 1 Hour 0.99999 0.99907 9.2e-2 
After 6 Hours 0.99995 0.99995 0 
After 1 Day 0.98970 0.97810 1.17 
Roots Benchmark 
Time Predicted R(t) Experimental R(t) Error (%) 
After 60 Seconds 0.99999 0.99999 0 
After 10 Minutes 0.99999 0.99998 1E-3 
After 1 Hour 0.99999 0.99992 4.6E-2 
After 6 Hours 0.99976 0.99953 2.3E-2 
After 1 Day 0.95325 0.99813 4.7 
Math Benchmark 
Time Predicted R(t) Experimental R(t) Error (%) 
After 60 Seconds 0.99999 0.99973 2.6e-2 
After 10 Minutes 0.99999 0.99731 0.26 
After 1 Hour 0.99999 0.98402 1.59 
After 6 Hours 0.99934 0.90790 9.15 
After 1 Day 0.88849 0.67947 23.52 
Qsort Benchmark 
Time Predicted R(t) Experimental R(t) Error (%) 
After 60 Seconds 0.99999 0.99996 3.0e-3 
After 10 Minutes 0.99999 0.99961 0.038 
After 1 Hour 0.99999 0.99769 0.23 
After 6 Hours 0.99986 0.98625 1.36 
After 1 Day 0.97136 0.94613 2.59 
 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
127 
 
Figure 6-14 Comparing the Predicted and Experimental 
Reliability of the Fib Benchmark 
Figure 6-15 Comparing the Predicted and Experimental 
Reliability of the Roots 
Figure 6-16 Comparing the Predicted and Experimental 
Reliability of the Math benchmark 
Figure 6-17 Comparing the Predicted and Experimental 
Reliability of the Qsort benchmark 
 
 
The protection is the result of combining two protection techniques, the ITMR which ensures that all 
instructions are triplicated and the TTMR which triplicates threads on different CPU cores. The reliability 
plots show that the reliability prediction precision for the protected mode is inversely proportional to the time 
which the device is under SEUs. We can observe accurate reliability predictions, with an error (error between 
the predicted and experimental reliabilities of the protected mode) less than 2.66% for times less than or equal 
to 1 day for all of the benchmarks, except the Roots benchmark where the error reached 10.07% after 6 hours 
and 30.76% after one day. In some benchmarks, the prediction was accurate up to one week of SEUs exposure. 
6.2.5 Discussion 
To have a better understanding of the results, profiling of the different benchmarks has been done, in order 
to determine the different percentages of certain instruction type. Table 6-4 shows the different ratios of the 
different instruction types of the tested benchmarks in 3 different mode: The unprotected, the ACEDR 
protected with ITMR and the combined ITMR and TTMR protection. 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
128 
 
Table 6-4 Profiling instruction types before and after protection 
Fibo 
Instruction type No Protection ITMR Combined mode 
load 81.81% 49.38 44.44% 
store 18.18% 50.61% 55.55% 
Qsort 
Instruction type No Protection ITMR Combined mode 
store 24.71% 12.16% 15.76% 
load 55.05% 55.55% 67.25% 
gep 8.98% 13.75% 5.42% 
binop 11.23% 9.52% 11.55% 
Roots 
Instruction type No Protection ITMR Combined mode 
load 40.98% 38.80% 36.47% 
store 37.70% 40.12% 40.25% 
binop 20.63 24.07% 23.27% 
Math 
Instruction type No Protection ITMR Combined mode 
load 39.06% 27.89% 39.05% 
store 34.37% 36.31% 19.48% 
binop 20.31% 21.05% 39.05% 
gep 6.25% 14.73% 2.41% 
 
 
The reliability is inversely proportional to the number of instructions, that is why the blue curved are 
different from one benchmark to the other. For example, the Fibo benchmark has the lowest number of 
instructions and shows the highest reliability. The other reason for obtaining different reliability curves is the 
number of memory instructions in the benchmark, which are highly vulnerable to injection errors. Injecting 
memory instructions will not only corrupt the memory, but also parts of the CPU, since the data loaded/stored 
from/to memory is corrupted, making the errors propagate, causing more error events. Branch instructions can 
be corrupted indirectly - by corrupting the compare (cmp) instruction, the preceding load instruction or the 
branching memory address. This makes them very susceptible to injection errors. The memory instructions in 
the Math benchmark represent a high percentage (73.43%), in addition to the fact that this benchmark contains 
large number of instructions (1160) compared to the rest, making it more vulnerable than the other tested 
benchmarks. 
These results demonstrate the new capability of using the prediction model without performing injection 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
129 
 
experiments on a new benchmark, however, a user of this model must have a data set of a generic instruction 
types sensitivities, obtained from quantitative software injection (or from neutron radiation test) of other 
benchmarks. Our work provides a starting point of values to such investigations as found in Sections 6.2.1. 
and 6.2.2. 
The reliability prediction model is a first attempt in research to model the reliability of a whole processing 
architecture, making it prone to multiple parameters that are affect accuracy. These parameters are 
hardware/software related, and others are related to the environment. The different levels of cache, the CPU, 
and the RAM all have different error rates (depending on the orbit and the circuit technology size), and 
different Hit/Miss rates (obtained using perf Linux profiling tool [138]). Another parameter to take into count 
is the CPU and how many levels of cache it incorporates which can dramatically affect the reliability prediction 
model. The error rates obtained from the injection experiments are affected by the benchmark’s instruction 
types, the size of the code and the Operating System. According to [136] enabling the Operating System will 
make the system more prone to SEUs bit-flips. The current prediction model takes into account the processing 
architecture configuration and how the processor is connected to its different parts, such as the RAM and the 
caches. The prediction model also takes the OS into consideration, when running TTMR in three threads 
parallelly, on three CPUs. This will change the configuration of the CPUs and their connection to the caches 
and the RAM.  
The accuracy of the prediction model can be improved further by including other OS parameters, specific 
to the running benchmark (if it calls OS libraries) such as input/output system calls, exec, fork…etc. Just like 
the TTMR, any calls to the OS will change the configuration of the platform and how it runs in the perspective 
of the prediction model, and this can affect the reliability prediction. 
For our case study, which is predicting the reliability of processing architectures in orbit, one day of 
accurate predictions is enough for LEO orbits [84], as in that period of time the satellite could achieve multiple 
orbits. 
Our research provides a new, fast and open way to not only estimate but practically test the reliability of 
COTS processing architectures using software fault injection operating in the space environment. However, 
in order to fully understand the architecture and validate our model, a physical radiation test should be 
conducted. We are hoping in the future to compare the reliability of the software injection experiment and the 
reliability prediction model with proton radiation test. 
6.3 Summary 
Protecting COTS against the SEUs has become a necessity, especially with scaling in the processing 
architectures, making SEUs a prominent problem, not exclusively in the space domain, but SEUs have also 
been a source of disturbances at sea level for ICs. It is crucial for designers in mainstream, embedded or 
mission-critical fields to ensure the reliability of their systems. Systems with redundant hardware such as TMR 
and ECC have been used to mitigate against SEUs, however, this trend always elevates the complexity of the 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
130 
 
system, and adds more overhead and higher costs. COTS are the alternative if their reliability is improved 
using software error detection and recovery techniques. This will enable more processing power with less cost 
and energy consumption. 
A new adaptive framework has been presented in this work, capable of changing the COTS processing 
architecture’s protection mode depending on the error rate of the SEUs. Adaptiveness will enable the system 
to keep high reliability without sacrificing performance. We first have demonstrated how the adaptive system 
can be implemented, enabling the running code to switch its state depending on the SEUs rate. Three 
operational modes have been selected, depending on their performance and reliability. The first mode is the 
unprotected mode, which has the highest performance, but no error detection and recovery schemes are added. 
The second mode is when the system is protected using ITMR, this mode can offer relatively higher reliability 
compared to the unprotected mode, with an acceptable overhead. The third mode of protection is when the 
system is protected with the combined protection techniques including both ITMR and TTMR. This mode of 
protection reduces the error rate dramatically, and provides the highest reliability, however it adds a time 
overhead to the system, which is benchmark dependent. The error rate is defined as the number of the injections 
that have caused an error divided by the total number of the injections. 
The reliability prediction equations demonstrate that the unprotected code is the least reliable mode, 
subsequently, the ITMR offers better reliability by reducing the error rate. The best reliability was obtained 
when the third mode of protection was used. The reliability equations for the different protection techniques 
have been developed, including multiple parameters related to the processing architecture hardware, such as 
the number of cores, the pipeline stages, the hit and miss rate of components. Other parameters are specific to 
the environment such as the SEUs error rate. This model also takes into account the different instructions types 
and their different sensitivities to a particular SEU error rate. Including all the previous parameters is necessary 
to alleviate the accuracy of the predicting equations. 
The error injection experiments verify the reliability prediction equations, since the error rate was at its 
highest when no protection has been applied to the operating mode. An improvement in reliability was noticed 
when the ITMR has been applied to protect the system, where the error rate has been dropped drastically 
compared to when no protection was applied. Applying the third protection mode has shown the highest 
reliability since the error rate has been dropped to the lowest. 
The adaptive solution contributes to the state of the art with the following points; 
• Low overhead with the adaptive solution, switching modes depending on the SEUs error rates and 
utilizing redundant independent instructions, taking advantage of the CPU abundant resources, 
• A reliability prediction model for all the processing architecture components,  
• High error detection and recovery rate, where the error rate has been reduced to less than 1% in some 
benchmarks, 
• Our protection modes include the protection of multiple data and instruction types (i32, i32*, i1, i8, 
i8*, i64, float & double, float pointers & double pointers) [12], in addition to both the CPU and the 
Yasser Nezzari                                                                                      Chapter 6, Tunable Multicore Protection 
131 
 
memory Read/Write instructions types, 
• Comparison of the reliability predictions with the reliability obtained from the injection experiment. 
Yasser Nezzari                                                                                  Chapter 7, Conclusion & Future Work 
132 
 
7. CONCLUSION & FUTURE WORK 
Recently there is an increasing demand for more powerful processors, for the next-generation space 
missions, especially for the more demanding space applications such as communication and earth observation. 
The downlink bottleneck has been a major issue for space applications and drives the need for higher 
performing processors, capable of handling more raw data, by processing and compressing it, before 
transmitting to the ground station. Powerful processors can also be used for autonomous missions, such as 
satellites constellation, automatic manoeuvring and guidance and in-situ studies, satisfying all their high real 
time demands. 
Multiple high standards should be kept in mind before using a processor in the space domain such as the 
size, power consumption and weight. All the previous parameters are decided by the spacecraft’s photovoltaic 
solar array’s capacity and physical dimensions. Another crucial parameter is the processor’s reliability, which 
is the reason behind the emergence of technologies such as Radiation Hardened (RH) and Radiation Hardened 
By Design (RHBD) processing architectures, known for their resilience against SEEs, however, these 
technologies still lag behind with their low performance, estimated to be 5-10 years behind the new generation 
COTS [5]. Other downsides of the RH and RHBD are their cost and high-power demand. The downsides of 
the RH and RHBD has steered the space applications into considering the use of COTS. The gap of 
performance between the COTS and their RH/RHBD relatives is due to the technology scaling in the ICs that 
took place in the recent years, where smaller and faster transistors can be fitted in the same IC, allowing for 
higher performance with less power consumption, size and cheaper costs. On the other hand, COTS are 
susceptible to the imminent threat of SEUs in harsh environments such as space, and the reason for that is their 
low threshold voltage. Intermittent effects can also disrupt the functionality of the CTOS processors, caused 
in particular by internal factors, specific to the IC, when its operating conditions change (e.g. Component 
overload or component wear-out) [1-3]. The small ICs are increasingly prone to SEUs even at sea level. Despite 
that these effects do not cause permanent damage, mainstream and embedded systems should be kept safe 
against them. 
In order to understand the reliability of CTOS processors, it is necessary to understand the modern 
processors architectures. The different available architectures have been studied, and their different 
arrangements, with a special detailed description for the Caches and the Pipelining, which are necessary tools 
to model the reliability in the next chapters. Modern COTS show great potential, with their high processing 
power enabling an order of magnitude performance improvements. Other features of theses processors include 
their low cost and low energy consumption. The different radiation effects on electronics have been shown, 
including the radiation environments, the different classes of radiations and a quantification of the radiation 
effects. This enabled this research to know how much error rate can be tolerated, in addition to the limitations 
of the proposed solutions. SEUs can be mitigated using the proposed solutions, other effects such as the 
accumulative effects will cause hardware errors and are out of the scope of this research. 
Yasser Nezzari                                                                                  Chapter 7, Conclusion & Future Work 
133 
 
The different hardware modifications used in manufacturing radiation hardened processors are studied. 
Despite their resilience to the radiation effects, RH and RHBD processors are costly, power hungry, have lower 
performance compared to their COTS relatives and have limited market. Typically RH are 150nm, on the other 
hand COTS are 10-28nm, which is approximately generations more advanced [5]. RH can mitigate up to 10-11 
errors/bit-day, considering the downscaling occurring in the new chips the SEU rate is close to 10-5 errors/bit-
day. 
The ECC theoretical background is studied, this section includes the different ECC algorithms used in 
communication systems and in memory systems protection. The different algorithms possess different error 
detection and recovery abilities, including the Binary Cyclic Codes such as the Hamming code, capable of 
detecting two errors and correcting single bit. Golay, Reed-Muller and Reed Solomon codes can detect and 
correct multiple errors. COTS have the best performance, after the comparison done is Section 2.2 between 
them and the RH/RHBD processors. However, they still need to be protected against the SEUs of radiation in 
harsh environments such as space in order to improve their reliability. The first proposed solution in this work 
was to use the LLVM compiler, a strong versatile framework, where using its different libraries will enable the 
user to create their own compiler, or extend the current platform to add more customized options such as 
optimizing the generated code in order to improve its performance or adding compiler passes to extend the 
generated code to make it more reliable in the presence of SEUs. LLVM is capable of supporting multiple 
high-level languages and multiple processing architectures.  
The state of the art in the error detection and correction software that have been used was studied in this 
research, where the different techniques have been compared in order to find the gaps. Most of the state-of-
the-art focus on error detection, ignoring the recovery part, which can be impractical. Another crucial point 
that have been ignored in the literature is that most of the techniques assume special hardware where the error 
detection is done using hard-ECC memory systems, and the proposed techniques only focus on protecting the 
CPU registers. Furthermore, the protection techniques add large overheads, and can be architecture dependent, 
limiting their use. Most of the implementations of the protection code done at the application and kernel levels 
are added manually, which may take a lot development time, especially for large benchmarks. Adding 
protection code using compiler passes will enable automatic generation of the protection code. The literature 
considers always the worst-case scenario, which could add large overheads, meaning that an adaptive system 
capable of switching the protection ability of the code will have both, high reliability and high performance 
when needed. Other techniques studied are the different software tools used for coherency of the shared 
resources between the threads. 
The reliability of the whole processing architecture was assessed by developing a theatrical model, based 
on Markov chain. The model includes both cases when the processor is unprotected and when the processor is 
protected using the error detection and correction algorithms developed in this research.  The model includes 
multiple parameters to enhance its accuracy, depending on their origins, the parameters can be internal (relative 
to the processing architecture used) or external (relative the environment where the processor is operated). 
Internal parameters include the width of the word, depending on the processing architecture used (64bits, 
Yasser Nezzari                                                                                  Chapter 7, Conclusion & Future Work 
134 
 
32bits, 16bits…etc), the number and types of instructions (depending on the benchmark used), the hit/miss 
rate of the different levels of cache and the RAM and the CPU’s clock frequency. The prediction model is 
estimating the worst-case scenario, and does not consider the case where an error has occurred before writing 
to memory or loading to CPU registers. 
The environment where the processor is operated can affect its reliability, depending on the particle flux, 
causing different error rates. The error rate in orbit changes depending on the location of the spacecraft, where 
near the SAA and the pole regions can have higher radiation rates, causing more SEUs. The solar activity can 
also alter the SEU rate. The reliability model was validated by the mean of fault injection, where software fault 
injection experiments were conducted on both the protected and the unprotected processing architecture in 
order to determine the reliability added using the software protection code and the accuracy of the reliability 
prediction model. The injection was achieved using LLVM passes, capable of static and dynamic error injection 
at any part of the code of the benchmarks used. The reliability mode relies on the sensitivity of the different 
instruction types, making the injection experiment a crucial part of this work. The sensitivity of an instruction 
is the portion of the injections that caused errors. Different types of error have been observed in the 
benchmarks, including the SDC, control flow, crash and hang. The protection code has shown its effectiveness 
by dropping error rate of the injected protected code. The injection is performed at the instruction’s data, which 
results in corrupting the instruction’s results. 
The reliability of the CPU is significantly affected by its configuration, especially the cache levels and how 
the whole processing architecture is configured. The OS libraries can affect the reliability of the processing 
architecture, as they add extra instructions to the benchmarks, elevating the probability of error occurrence. 
The model can reach very high accuracy level if the ratio λp/λc correspond to the optimal point, for values more 
than the optimal point the accuracy will be from 92.35% to 98.5% depending on the benchmark, except for the 
Suzan benchmark where the accuracy was 77%, which is due to the randomness of the error injection in this 
benchmark. 
This research has successfully implemented error detection and correction automatically using a compiler. 
The compiler contains two passes, an analysis and transformation one, enabling the addition of redundant 
instructions and checks at instructions level, enhancing the reliability of the benchmarks used, by reducing the 
error rate, produced from the SEUs, causing bit flips in the memory and the CPU registers of processing 
architectures. The error rate is defined as the number of the injections that have caused an error divided by the 
total number of the injections. The error detection and correction libraries can be included inside the LLVM 
pass or could also be linked to the benchmarks requiring protection. 
This research offers a software protection method capable of protecting multiple data types, and it could be 
implemented on multiple high-level languages and multiple processing architectures. Both the intermediate 
representation and its assembly have shown the addition of our software protection code. GDB tool has been 
used to verify the assembly level of the protected and unprotected codes in order to compare them. The initial 
results of this research were encouraging to continue using compiler passes for software protection, compared 
Yasser Nezzari                                                                                  Chapter 7, Conclusion & Future Work 
135 
 
to the state of the art that have been using GCC compiler to implement FEC [119] and protecting the memory 
system of the CPU. Compared to EDDI  [13], this research shows better performance, and unlike EDDI, the 
software protection developed in this research is capable of recovery, and it is portable to multiple high level 
languages and multiple processing architectures. The use of ECC such as the Hamming and the BCH codes is 
only capable of protecting the memory system and not the CPU. 
After considering the initial results, this research has implemented a new automatic compiler error-detection 
and recovery (ACEDR), at the optimization stage of the compilation process, via compiler passes. ACEDR 
has been tested using fault injection, where the injection is done using LLVM pass, capable of static and 
dynamic fault injections. The protected code had very low error rate when injected compared to the unprotected 
code, meaning that ACEDR has improved the reliability of the system. The injection experiment was 
performed on multiple benchmarks, where all the instructions of the benchmarks have been injected, one at a 
time. The instructions were classified to two main categories; Memory instructions (responsible for 
reading/writing from/to memory, and creating new memory locations), and CPU registers, such as the 
arithmetic and logic operations. This research has shown that ACEDR is capable of fully protecting the CPU 
registers from the fault injections. The high error detection and recovery was achieved using ACEDR, because 
of the numerous data types it has included, such as i32, i32*, i1, i8, i8*, i64, float & double, float & double 
pointers. The different error types such as the crashes, SCDs and Control Flow errors have been mitigated 
using ACEDR, and this was resulted with minimal overhead compared to the state of the art in software 
protection techniques. Furthermore, ACEDR relies on using LLVM, meaning it has the ability to be ported to 
multiple processing architectures, which has been tested on Intel and ARM platforms, and it can be 
implemented on multiple high-level languages. The overhead resulted from ACEDR was low, since it 
replicates redundant independent instructions, taking advantage of the pipelining of the processing 
architecture. This work has some limitations; it only protects memory and CPU registers accessed by 
instructions (not the instruction itself), meaning that this work do not consider bit flips that would transform 
instructions into another instructions or jump/conditional jump instructions. According to [141] it is possible 
to protect the protection code, using their Flowchart of the self-check routine. This work does not consider 
functions whose outcome is pointers, since voting is based on binary consistency of function outputs. 
Another main outcome of injecting the ACEDR is confirming the reliability prediction model. The 
reliability was predicted to be higher when ACEDR is applied, and that has been confirmed using the injection 
experiment results. The precision of the reliability prediction model is relative to ratio between the error rate 
of the processor and the cash memory (λp/λc), where the precision is limited if the ratio is less than optimal 
point. The optimal point depends on the benchmark and its variety of instructions types. ACEDR improves the 
reliability of the system by reducing the error rate of “Single Event Upsets” (SEUs). In some benchmarks, the 
error rate was reduced to less than 1%. This research has been tested in two machines; Intel core i5-3470 and 
a Raspberry Pi 3. On the 1st processor the overhead was less than 15% and on the 2nd one the overhead was 
less than 17%. 
The previous solutions to the problem of SEUs causing bit-flips in the memory and CPU registers could 
Yasser Nezzari                                                                                  Chapter 7, Conclusion & Future Work 
136 
 
add large overhead, especially since the protection is added to protect from the worst-case scenario of error 
rate. An adaptive solution will be able reduce the overhead by switching the protection modes, this could be 
achieved using multiple modes, where the ones with highest reliability can be switched on when the error rate 
is high, with the cost of adding extra overhead. Other modes, with less reliability can be switched on when the 
error rate is low. The idea of an adaptive system is inspired by the fact that the error rate in orbit is changing, 
depending on the spacecraft’s location. The areas near the SAA and the poles can have very high radiation 
densities, causing an increase in the SEUs. Other expected causes of high radiation flux are the solar flares. 
This research has demonstrated the ability of the running system to switch its state depending on the SEUs 
rate, using three operation modes. The operation modes are classified depending on their reliability and 
performance. The unprotected mode is when the code is running without any FEC code added to it, allowing 
it to be the one with the highest performance, however its reliability is the lowest. 
The ITMR mode is when the system is protected at instructions level, similar to ACEDR, where the reliability 
is improved with the cost of adding an acceptable overhead. The last mode combines ITMR and TTMR, which 
has the highest reliability and the largest overhead.  
The reliability predication models for different protection modes of the adaptive solution have been 
included in this research, where the highest predicted results match what was expected for every mode used in 
this system. The reliability prediction model includes multiple parameters, in order to achieve better accuracy. 
All the protection modes used in the adaptive system have been injected using the developed fault injector. 
The goal from the injection experiment is to show the reliability improvement of the different protection modes 
that have been able to reduce the error rate. The injection experiment is also used to obtain the parameters used 
in the reliability predictions equation, and this will enable verifying the predictions model’s accuracy. Both 
predictions and injection experiments confirm that the best reliability results were obtained when using the 
combined protection techniques, where the error rates have been reduced to an interval between 0% & 0.60%. 
The second-best results were obtained when ITMR was used, where the error rate has been dropped to an 
interval between 0% & 3.97%. Using TTMR has dropped the error rates to an interval of 3.51% to 14.73% 
which we do not recommend for a mission critical systems. The error detection and recovery came with an 
overhead between 14.97% & 131.70% for the combined protection techniques, 6.44% to 87.54% when ITMR 
was used and 10.32% to 41.14% when TTMR was used. 
Future Considerations 
Our research provides a cheap way to test the reliability of COTS processing architecture using software 
fault injection, however, in order to deeply understand the SEU’s, a physical radiation test should be conducted. 
We are hoping in the future to compare the reliability of the software injection experiment with a radiation test. 
Another point that we will investigate in the future is the addition of a feedback loop to our error detection and 
recovery, as currently the adaptive part depends on a feed forward to determine the error rate, but in case the 
feedforward fails to precisely determine the error rate, we believe that adding a feedback loop in a real time 
scenario will be able to keep the system more reliable by switching the protection modes depending on the 
Yasser Nezzari                                                                                  Chapter 7, Conclusion & Future Work 
137 
 
error rate detected using our software protection techniques. 
Publications 
The list of publications can be found in Table 7-1. 
 
Table 7-1 List of Publications 
Title Publication Status 
Compiler extensions towards reliable multicore 
processors 
2017 IEEE/AIAA Aerospace, Big 
Sky, Montana 
Accepted 
Modelling processor reliability using LLVM 
compiler fault injection 
2018 IEEE/AIAA Aerospace, Big 
Sky, Montana 
Accepted 
ACEDR: Automatic Compiler Error Detection & 
Recovery for COTS CPU & Caches 
IEEE Transaction on reliability 
Minor Corrections 
Reliability Experiments in Multicore Processors: 
Instruction & Threaded EDAC 
IEEE Transaction on Software 
Engineering 
Submitted 
  
Yasser Nezzari                                                                                                                           References 
138 
 
8. REFERENCES 
1. Baumann, R.C., Soft errors in advanced semiconductor devices-part I: the three radiation 
sources. IEEE Transactions on device and materials reliability, 2001. 1(1): p. 17-22. 
2. O'Gorman, T.J., et al., Field testing for cosmic ray soft errors in semiconductor memories. IBM 
Journal of Research and Development, 1996. 40(1): p. 41-50. 
3. Shivakumar, P., et al. Modeling the effect of technology trends on the soft error rate of 
combinational logic. in Dependable Systems and Networks, 2002. DSN 2002. Proceedings. 
International Conference on. 2002. IEEE. 
4. Pukite, P. and J. Pukite, Markov modeling for reliability analysis. 1998: Wiley-IEEE Press. 
5. Ginosar, R., Survey of processors for space. Data Systems in Aerospace (DASIA). Eurospace, 
2012: p. 1-5. 
6. Horst, R.W., R.L. Harris, and R.L. Jardine. Multiple instruction issue in the NonStop Cyclone 
processor. in ACM SIGARCH Computer Architecture News. 1990. ACM. 
7. Slegel, T.J., et al., IBM's S/390 G5 microprocessor design. IEEE micro, 1999. 19(2): p. 12-23. 
8. Tremblay, M. and Y. Tamir. Support for fault tolerance in VLSI processors. in Circuits and 
Systems, 1989., IEEE International Symposium on. 1989. IEEE. 
9. Phelan, R., Addressing soft errors in ARM core-based SoC. ARM White Paper, 2003. 
10. Nezzari, Y. and C. Bridges. Compiler extensions towards reliable multicore processors. in 
Aerospace Conference, 2017 IEEE. 2017. IEEE. 
11. Lattner, C. and V. Adve, LLVM language reference manual. 2006. 
12. Nezzari, Y. and C. Bridges. Modelling processor reliability using LLVM compiler fault 
injection. in 2018 IEEE Aerospace Conference. 2018. IEEE. 
13. Oh, N., P.P. Shirvani, and E.J. McCluskey, Error detection by duplicated instructions in super-
scalar processors. IEEE Transactions on Reliability, 2002. 51(1): p. 63-75. 
14. Reis, G.A., et al. SWIFT: Software implemented fault tolerance. in Proceedings of the 
international symposium on Code generation and optimization. 2005. IEEE Computer Society. 
15. Baer, J.-L., Microprocessor architecture: from simple pipelines to chip multiprocessors. 2009: 
Cambridge University Press. 
16. Hennessy, J.L. and D.A. Patterson, Computer architecture: a quantitative approach. 2011: 
Elsevier. 
17. shingaridavesh. How Cache Memory works. 2012; Available from: 
https://www.engineersgarage.com/mygarage/how-cache-memory-works. 
18. Espinozahg. SiSoftware Official Live Ranker. 2017; Available from: 
https://ranker.sisoftware.co.uk/show_device.php?q=c9a598d1bfcbaec2e2a1cebcd9f990a78ab
d88b888ddfb9ca191b7c5f8c8ee87ba88aec6fbcbed95a899bfdabf82b294e7daea&l=en. 
19. Chiappetta, M. AMD Ryzen Review: Ryzen 7 1800X, 1700X, And 1700 - Zen Brings The Fight 
Back To Intel. 2017; Available from: https://hothardware.com/reviews/amd-ryzen-7-1800x-
1700x-1700-benchmarks-and-review. 
20. FTW, I.I., Intel Core i7-6950X @ 3.00GHz. 2016. 
21. Cortex-A73. Cortex-A73. 2019; Available from: https://developer.arm.com/ip-
products/processors/cortex-a/cortex-a73. 
22. Williams, R. Core i7-5960X Extreme Edition Review: Intel’s Overdue Desktop 8-Core Is Here. 
2014; Available from: https://techgage.com/print/core-i7-5960x-extreme-edition-review-
intels-overdue-desktop-8-core-is-here/. 
23. Benchoff, B. BENCHMARKING THE RASPBERRY PI 2. 2019; Available from: 
https://hackaday.com/2015/02/05/benchmarking-the-raspberry-pi-2/. 
24. Shvets, G. Test: Sandra Dhrystone (MIPS). 2018; Available from: http://www.cpu-
world.com/benchmarks/browse/910_80,965_61,993_80,1035_96/?c_test=6&PROCESS=Sho
w%20Selected. 
25. AMD FX-8350 Black Edition vs Intel Core i7-4770K. Available from: 
https://versus.com/en/amd-fx-8350-black-edition-vs-intel-core-i7-4770k. 
26. Group, B. ALU Performance: SiSoftware Sandra 2010 Pro (ALU). 2013; Available from: 
Yasser Nezzari                                                                                                                           References 
139 
 
https://archive.is/20130204153332/http://www.tomshardware.com/charts/desktop-cpu-charts-
2010/ALU-Performance-SiSoftware-Sandra-2010-Pro-ALU,2408.html#selection-7597.22-
7599.17. 
27. Intel Core i7-3630QM. Available from: https://www.notebookcheck.net/Intel-Core-i7-
3630QM-Notebook-Processor.80051.0.html. 
28. Arm. Arm Processors for the Widest Range of Devices—from Sensors to Servers. 2019; 
Available from: https://www.arm.com/products/silicon-ip-cpu. 
29. Shimpi, A.L. ARM's Cortex A7: Bringing Cheaper Dual-Core & More Power Efficient High-
End Devices. 2011 Available from: https://www.anandtech.com/show/4991/arms-cortex-a7-
bringing-cheaper-dualcore-more-power-efficient-highend-devices. 
30. Angelini, C. ASRock's E350M1: AMD's Brazos Platform Hits The Desktop First. 2011; 
Available from: https://www.tomshardware.com/reviews/asrock-e350m1-amd-brazos-zacate-
apu,2840-10.html. 
31. NVIDIA. NVIDIA® TEGRA® MOBILE PROCESSORS. 2019; Available from: 
https://www.nvidia.com/object/tegra-3-processor.html. 
32. Hinum, K. Samsung Exynos 5250 Dual. 2013; Available from: 
https://www.notebookcheck.net/Samsung-Exynos-5250-Dual-SoC.86886.0.html. 
33. Hagedoorn, H. Core i5 2500K and Core i7 2600K review - Performance - Dhrystone | 
Whetstone 2011; Available from: https://www.guru3d.com/articles-pages/core-i5-2500k-and-
core-i7-2600k-review,13.html. 
34. Angelini, C. The Intel Core i7-990X Extreme Edition Processor Review. 2011; Available from: 
https://www.tomshardware.com/reviews/core-i7-990x-extreme-edition-gulftown,2874-6.html. 
35. Bennett, K. Intel Core i7-3960X - Sandy Bridge E Processor Review. 2011; Available from: 
http://www.hardocp.com/article/2011/11/14/intel_core_i73960x_sandy_bridge_e_processor_r
eview/4. 
36. Logan, T. Intel 980x Gulftown. 2010; Available from: 
https://www.overclock3d.net/reviews/cpu_mainboard/intel_980x_gulftown/4. 
37. ARM. ARM Cortex-M0. 2019; Available from: https://developer.arm.com/ip-
products/processors/cortex-m/cortex-m0. 
38. Journal, E. ARM11 vs Cortex A8 vs Cortex A9 - Netbooks processors. 2011; Available from: 
https://web.archive.org/web/20110719103301/http://www.eeejournal.com/2009/12/arm11-vs-
cortex-a8-vs-cortex-a9.html. 
39. Shvets, G. AMD Phenom II X4 940 specifications. 2018; Available from: http://www.cpu-
world.com/CPUs/K10/AMD-Phenom%20II%20X4%20940%20Black%20Edition%20-
%20HDZ940XCJ4DGI%20(HDZ940XCGIBOX).html. 
40. Redaktion. Qualcomm Snapdragon 400 MSM8926 vs vs ARM Cortex A8 1.2 GHz. 2017; 
Available from: https://www.notebookcheck.net/400-MSM8926-vs-S4-Plus-MSM8230-vs-
Cortex-A8-12-GHz_5958_4062_3316.247596.0.html. 
41. Bluetooth. Benchmarks of ECS 945GCT-D with Intel Atom 1.6GHz. 2008; Available from: 
http://www.ocworkbench.com/2008/ecs/ECS_945GCT-D_Atom_board/b1.htm. 
42. Intel. Intel® Core™2 Extreme Processor QX9770. 2019; Available from: 
https://ark.intel.com/content/www/us/en/ark/products/34444/intel-core-2-extreme-processor-
qx9770-12m-cache-3-20-ghz-1600-mhz-fsb.html. 
43. Pro, S.S., ALU Performance: SiSoftware Sandra 2010 Pro (ALU). 2010. 
44. Staff, E. MIPS Makes 600MHz 20Kc Core Available for Design Starts. 2002; Available from: 
https://www.edn.com/electronics-news/4347882/MIPS-Makes-600MHz-20Kc-Core-
Available-for-Design-Starts. 
45. Merritt, B.R. Startup takes PowerPC to 25 W. 2007; Available from: 
https://www.eetimes.com/document.asp?doc_id=1165095#. 
46. Arm. Cortex-R4. 2019 Available from: https://developer.arm.com/ip-
products/processors/cortex-r/cortex-r4. 
47. Tylka, A.J., et al., CREME96: A revision of the cosmic ray effects on micro-electronics code. 
IEEE Transactions on Nuclear Science, 1997. 44(6): p. 2150-2160. 
48. Sandra, S. Synthetic SiSoft Sandra XI CPU. 2007; Available from: 
Yasser Nezzari                                                                                                                           References 
140 
 
https://archive.is/20130204130212/http://www.tomshardware.com/charts/cpu-charts-
2007/Synthetic-SiSoft-Sandra-XI-CPU,333.html#selection-967.13-967.26. 
49. Intel. Intel® Core™2 Extreme Processor QX6700. 2019; Available from: 
https://ark.intel.com/content/www/us/en/ark/products/28028/intel-core-2-extreme-processor-
qx6700-8m-cache-2-66-ghz-1066-mhz-fsb.html. 
50. Mini-ITX, VIA EPIA PX10000 Pico-ITX Review. 2007. 
51. Arm. Cortex-A8. 2019; Available from: https://developer.arm.com/ip-
products/processors/cortex-a/cortex-a8. 
52. Wilson, D., AMD Athlon 64 FX-57: The Fastest Single Core. 2005. 
53. Wilson, A.L.S.D. Microsoft's Xbox 360, Sony's PS3 - A Hardware Discussion. 2005; Available 
from: https://www.anandtech.com/show/1719/3. 
54. Microchip. PIC10F200. 2019 Available from: 
https://www.microchip.com/wwwproducts/en/en019863. 
55. Arm, Cortex-M3. 2019. 
56. Mogensen, T.Æ., Basics of Compiler Design. 2009: Torben Ægidius Mogensen. 
57. Lattner, C.A., LLVM: An infrastructure for multi-stage optimization. 2002, University of 
Illinois at Urbana-Champaign. 
58. Cohn, R.S., et al., Optimizing alpha executables on windows nt with spike. Digital Technical 
Journal, 1997. 9: p. 3-20. 
59. Heine, D.L. and M.S. Lam. A practical flow-sensitive and context-sensitive C and C++ memory 
leak detector. in ACM SIGPLAN Notices. 2003. ACM. 
60. Lattner, C. and V. Adve, Automatic pool allocation for disjoint data structures. ACM SIGPLAN 
Notices, 2003. 38(2 supplement): p. 13-24. 
61. Dhurjati, D., et al. Memory safety without runtime checks or garbage collection. in ACM 
SIGPLAN Notices. 2003. ACM. 
62. Chernoff, A., et al., FX! 32: A profile-directed binary translator. IEEE Micro, 1998(2): p. 56-
64. 
63. Ebcioğlu, K. and E.R. Altman, DAISY: Dynamic compilation for 100% architectural 
compatibility. ACM SIGARCH Computer Architecture News, 1997. 25(2): p. 26-37. 
64. Cytron, R., et al., Efficiently computing static single assignment form and the control 
dependence graph. ACM Transactions on Programming Languages and Systems (TOPLAS), 
1991. 13(4): p. 451-490. 
65. Lattner, C. and V. Adve. LLVM: A compilation framework for lifelong program analysis & 
transformation. in Code Generation and Optimization, 2004. CGO 2004. International 
Symposium on. 2004. IEEE. 
66. The LLVM Compiler Infrastructure. 2016; Available from: 
http://llvm.org/ProjectsWithLLVM/. 
67. Kaleidoscope: Implementing a Language with LLVM. 2016; Available from: 
http://llvm.org/docs/tutorial/index.html. 
68. Hasnain, Z. and A. Ditali. Building-in reliability: Soft errors-a case study. in Reliability Physics 
Symposium 1992. 30th Annual Proceedings., International. 1992. IEEE. 
69. Ziegler, J.F., et al., IBM experiments in soft fails in computer electronics (1978–1994). IBM 
journal of research and development, 1996. 40(1): p. 3-18. 
70. Ziegler, J.F., Terrestrial cosmic rays. IBM journal of research and development, 1996. 40(1): p. 
19-39. 
71. Ziegler, J.F., et al., Accelerated testing for cosmic soft-error rate. IBM Journal of Research and 
Development, 1996. 40(1): p. 51-72. 
72. Ziegler, J.F., P.A. Saunders, and T.H. Zabel, Portable Faraday cup for nonvacuum proton beams. 
IBM journal of research and development, 1996. 40(1): p. 73-76. 
73. Srinivasan, G., Modeling the cosmic-ray-induced soft-error rate in integrated circuits: an 
overview. IBM Journal of Research and Development, 1996. 40(1): p. 77-89. 
74. Tang, H.H., Nuclear physics of cosmic ray interaction with semiconductor materials: Particle-
induced soft errors from a physicist's perspective. IBM journal of research and development, 
1996. 40(1): p. 91-108. 
Yasser Nezzari                                                                                                                           References 
141 
 
75. Murley, P.C. and G. Srinivasan, Soft-error Monte Carlo modeling program, SEMM. IBM 
Journal of Research and Development, 1996. 40(1): p. 109-118. 
76. Freeman, L.B., Critical charge calculations for a bipolar SRAM array. IBM Journal of Research 
and Development, 1996. 40(1): p. 119-129. 
77. Normand, E., Single-event effects in avionics. IEEE Transactions on nuclear science, 1996. 
43(2): p. 461-474. 
78. O'Gorman, T.J., The effect of cosmic rays on the soft error rate of a DRAM at ground level. 
IEEE Transactions on Electron Devices, 1994. 41(4): p. 553-557. 
79. Tosaka, Y., et al. Impact of cosmic ray neutron induced soft errors on advanced submicron 
CMOS circuits. in 1996 Symposium on VLSI Technology. Digest of Technical Papers. 1996. 
IEEE. 
80. Ziegler, J.F., Terrestrial cosmic ray intensities. IBM Journal of Research and Development, 
1998. 42(1): p. 117-140. 
81. Stassinopoulos, E. and J.P. Raymond, The space radiation environment for electronics. 
Proceedings of the IEEE, 1988. 76(11): p. 1423-1442. 
82. Schwank, J.R., Basic mechanisms of radiation effects in the natural space radiation 
environment. 1994, Sandia National Labs., Albuquerque, NM (United States). 
83. Khodachenko, M.L., et al., Coronal mass ejection (CME) activity of low mass M stars as an 
important factor for the habitability of terrestrial exoplanets. I. CME impact on expected 
magnetospheres of Earth-like exoplanets in close-in habitable zones. Astrobiology, 2007. 7(1): 
p. 167-184. 
84. Riebeek, H., Catalog of Earth Satellite Orbits. 2009. 
85. Heynderickx, D., Comparison between methods to compensate for the secular motion of the 
South Atlantic Anomaly. Radiation Measurements, 1996. 26(3): p. 369-373. 
86. Townsend, L.W., J.W. Wilson, and J.E. Nealy, Space radiation shielding strategies and 
requirements for deep space missions. SAE Transactions, 1989: p. 326-335. 
87. Gussenhoven, M., et al., APEXRAD: low altitude orbit dose as a function of inclination, 
magnetic activity and solar cycle. IEEE Transactions on Nuclear Science, 1997. 44(6): p. 2161-
2168. 
88. Fleetwood, D.M., et al., Accounting for time-dependent effects on CMOS total-dose response 
in space environments. Radiation Physics and Chemistry, 1994. 43(1-2): p. 129-138. 
89. Underwood, C.I. The single-event-effect behaviour of commercial-off-the-shelf memory 
devices. A decade in low-Earth orbit. in Radiation and Its Effects on Components and Systems, 
1997. RADECS 97. Fourth European Conference on. 1997. IEEE. 
90. Dodds, N.A., et al., The contribution of low-energy protons to the total on-orbit SEU rate. IEEE 
Transactions on Nuclear Science, 2015. 62(6): p. 2440-2451. 
91. Ginosar, R., Survey of processors for space. Data Systems in Aerospace (DASIA). Eurospace, 
2012. 
92. Image Processing System for ISR Applications: IPC5000. 2011. 
93. Pignol, M.P. DMT and DT2: two fault-tolerant architectures developed by CNES for COTS-
based spacecraft supercomputers. in null. 2006. IEEE. 
94. Hamming, R.W., Error detecting and error correcting codes. Bell System technical journal, 
1950. 29(2): p. 147-160. 
95. Shannon, C.E., A mathematical theory of communication. ACM SIGMOBILE mobile 
computing and communications review, 2001. 5(1): p. 3-55. 
96. Golay, M.J.E., Notes on Digital Coding. Proc. IRE, 1949. 37. 
97. Garello, R., P. Pierleoni, and S. Benedetto, Computing the free distance of turbo codes and 
serially concatenated codes with interleavers: Algorithms and applications. IEEE Journal on 
Selected Areas in Communications, 2001. 19(5): p. 800-812. 
98. Massey, J.L. The how and why of channel coding. in Proc. Int. Zurich Seminar. 1984. 
99. McEliece, R., The theory of information and coding. Vol. 3. 2002: Cambridge University Press. 
100. Moon, T.K., Error Correction Coding. 2005: John Wiley and Sons. 
101. Robert, H. and M. Zaragoza, The Art of Error Correcting Coding. 2002, John Wiley & Sons. 
102. Rao, T.R. and E. Fujiwara, Error control coding for computer systems. Prentice-Hall Inc., 1989. 
Yasser Nezzari                                                                                                                           References 
142 
 
103. Hodgart, M., Efficient coding and error monitoring for spacecraft digital memory. International 
journal of electronics, 1992. 73(1): p. 1-36. 
104. Morelos-Zaragoza, R.H., The art of error correcting coding. 2006: John Wiley & Sons. 
105. Walters, J.P., et al. Software-based fault tolerance for the Maestro many-core processor. in 
Aerospace Conference, 2011 IEEE. 2011. IEEE. 
106. Tilera, Tile Processor User Architecture Manual. NOVEMBER 2011. 
107. Avizienis, A., The N-Version Approach to Fault-Tolerant Software. Ieee Transactions on 
Software Engineering, 1985. 11(12): p. 1491-1501. 
108. Mushtaq, H., Z. Al-Ars, and K. Bertels. Fault tolerance on multicore processors using 
deterministic multithreading. in 2013 8th IEEE Design and Test Symposium. 2013. IEEE. 
109. Ungsunan, P.D., et al. Improving multi-core system dependability with asymmetrically reliable 
cores. in Complex, Intelligent and Software Intensive Systems, 2009. CISIS'09. International 
Conference on. 2009. IEEE. 
110. Bienia, C., et al. The PARSEC benchmark suite: characterization and architectural implications. 
in Proceedings of the 17th international conference on Parallel architectures and compilation 
techniques. 2008. ACM. 
111. Woo, S.C., et al. The SPLASH-2 programs: Characterization and methodological 
considerations. in ACM SIGARCH Computer Architecture News. 1995. ACM. 
112. Mushtaq, H., Z. Al-Ars, and K. Bertels. A user-level library for fault tolerance on shared 
memory multicore systems. in Design and Diagnostics of Electronic Circuits & Systems 
(DDECS), 2012 IEEE 15th International Symposium on. 2012. IEEE. 
113. Skitsas, M.A., C.A. Nicopoulos, and M.K. Michael, Exploring Check-Pointing and Rollback 
Recovery Under Selective SBST in Chip Multi-Processors. 
114. Linux-CR. 2010; Available from: http://ckpt.wiki.kernel.org. 
115. Lee, D., et al., Respec: efficient online multiprocessor replayvia speculation and external 
determinism. ACM SIGARCH Computer Architecture News, 2010. 38(1): p. 77-90. 
116. Mitropoulou, K., V. Porpodas, and M. Cintra. DRIFT: Decoupled compiler-based instruction-
level fault-tolerance. in International Workshop on Languages and Compilers for Parallel 
Computing. 2013. Springer. 
117. Fritts, J.E., F.W. Steiling, and J.A. Tucek. Mediabench II video: expediting the next generation 
of video systems research. in Electronic Imaging 2005. 2005. International Society for Optics 
and Photonics. 
118. Henning, J.L., SPEC CPU2000: Measuring CPU performance in the new millennium. 
Computer, 2000. 33(7): p. 28-35. 
119. Piotrowski, A., Automatic installation of software-based fault tolerance algorithms in programs 
generated by GCC compiler. International Journal of Microelectronics and Computer Science, 
2010. 1(3): p. 263-268. 
120. Feng, S., et al. Shoestring: probabilistic soft error reliability on the cheap. in ACM SIGARCH 
Computer Architecture News. 2010. ACM. 
121. Yu, J. and M.J. Garzaran. Compiler optimizations for fault tolerance software checking. in 
Proceedings of the 16th International Conference on Parallel Architecture and Compilation 
Techniques. 2007. IEEE Computer Society. 
122. Effort, U.O. The OpenIMPACT IA-64 Compiler. Available from: http://gelato.uiuc.edu/. 
123. Zhang, Y., et al., DAFT: decoupled acyclic fault tolerance. International Journal of Parallel 
Programming, 2012. 40(1): p. 118-140. 
124. Wang, C., et al. Compiler-managed software-based redundant multi-threading for transient fault 
detection. in Proceedings of the International Symposium on Code Generation and 
Optimization. 2007. IEEE Computer Society. 
125. Mitropoulou, K., Performance optimizations for compiler-based error detection. 2015. 
126. GCC. GNU Compiler Collection. Available from: http://gcc.gnu.org. 
127. Jacobs, A.M., Reconfigurable fault tolerance for space systems. 2013: University of Florida. 
128. Azarpeyvand, A., et al. Instruction reliability analysis for embedded processors. in Design and 
Diagnostics of Electronic Circuits and Systems (DDECS), 2010 IEEE 13th International 
Symposium on. 2010. IEEE. 
Yasser Nezzari                                                                                                                           References 
143 
 
129. Mukherjee, S.S., J. Emer, and S.K. Reinhardt. The soft error problem: An architectural 
perspective. in High-Performance Computer Architecture, 2005. HPCA-11. 11th International 
Symposium on. 2005. IEEE. 
130. Naithani, A., S. Eyerman, and L. Eeckhout, Optimizing Soft Error Reliability Through 
Scheduling on Heterogeneous Multicore Processors. IEEE Transactions on Computers, 2018. 
67(6): p. 830-846. 
131. Carlson, T.E., et al., An evaluation of high-level mechanistic core models. ACM Transactions 
on Architecture and Code Optimization (TACO), 2014. 11(3): p. 28. 
132. Yang, Z., et al. Phase-driven learning-based dynamic reliability management for multi-core 
processors. in Proceedings of the 54th Annual Design Automation Conference 2017. 2017. 
ACM. 
133. Yang, C. and A. Orailoglu. Processor reliability enhancement through compiler-directed 
register file peak temperature reduction. in Dependable Systems & Networks, 2009. DSN'09. 
IEEE/IFIP International Conference on. 2009. IEEE. 
134. Sartor, A.L., S. Wong, and A.C. Beck. Adaptive ilp control to increase fault tolerance for vliw 
processors. in Application-specific Systems, Architectures and Processors (ASAP), 2016 IEEE 
27th International Conference on. 2016. IEEE. 
135. Lin, S.-Z. and P.-S. Chen. A SIMD-based software fault tolerance for ARM processors. in 
Applied System Innovation (ICASI), 2017 International Conference on. 2017. IEEE. 
136. Rodrigues, G.S., et al., Analyzing the Impact of Fault-Tolerance Methods in ARM Processors 
Under Soft Errors Running Linux and Parallelization APIs. IEEE Transactions on Nuclear 
Science, 2017. 64(8): p. 2196-2203. 
137. Butenhof, D.R., Programming with POSIX threads. 1997: Addison-Wesley Professional. 
138. de Melo, A.C. The new linux’perf’tools. in Slides from Linux Kongress. 2010. 
139. Project, L. Writing an LLVM Pass. 2018; Available from: 
http://llvm.org/docs/WritingAnLLVMPass.html. 
140. LLVM. LLVM Language Reference Manual. 2018; Available from: 
https://llvm.org/docs/LangRef.html. 
141. Shirvani, P.P., N.R. Saxena, and E.J. McCluskey, Software-implemented EDAC protection 
against SEUs. IEEE Transactions on reliability, 2000. 49(3): p. 273-284. 
142. Leveugle, R., et al. Statistical fault injection: Quantified error and confidence. in Proceedings 
of the Conference on Design, Automation and Test in Europe. 2009. European Design and 
Automation Association. 
143. Al-Kofahi, K.A., Reliability analysis of triple modular redundancy system with spare. 1993. 
144. Project, L. LLVM’s Analysis and Transform Passes. 2018; Available from: 
https://llvm.org/docs/Passes.html. 
145. Sampson, A. LLVM for Grad Students. 2015; Available from: 
http://adriansampson.net/blog/llvm.html. 
146. Project, L. LLVM Programmer’s Manual. 2016; Available from: 
http://llvm.org/docs/ProgrammersManual.html. 
147. LLVM. LLVM Cloning class. 2016; Available from: 
http://llvm.org/docs/doxygen/html/Cloning_8h.html. 
148. LLVM. llvm IRBuilder. 2016; Available from: 
http://llvm.org/docs/doxygen/html/classllvm_1_1IRBuilder.html. 
149. perf: Linux profiling with performance counters. 2015; Available from: 
https://perf.wiki.kernel.org/index.php/Main_Page. 
150. Developers, V. Valgrind. 2015; Available from: http://valgrind.org/. 
151. Developers, V. Callgrind: a call-graph generating cache and branch prediction profiler. 2015; 
Available from: http://valgrind.org/docs/manual/cl-manual.html. 
152. Developers, V. Massif: a heap profiler. 2015; Available from: 
http://valgrind.org/docs/manual/ms-manual.html. 
153. Consortium, M., MediaBench II benchmark. http://mathstat.slu.edu/~fritts/mediabench/, 2015: 
p. 04-15. 
154. Giraldo, A., A. Paccagnella, and A. Minzoni, Aspect ratio calculation in n-channel MOSFETs 
Yasser Nezzari                                                                                                                           References 
144 
 
with a gate-enclosed layout. Solid-State Electronics, 2000. 44(6): p. 981-989. 
155. Giraldo, A., Evaluation of deep submicron technologies with radiation tolerant layout for 
electronics in LHC environments. Ph. D. Thesis at the University of Padova, 1998. 
156. Anelli, G., et al., Radiation tolerant VLSI circuits in standard deep submicron CMOS 
technologies for the LHC experiments: practical design aspects. IEEE Transactions on Nuclear 
Science, 1999. 46(6): p. 1690-1696. 
157. Chen, L. and N.G. Durdle, Radiation tolerant design with 0.18-micron CMOS technology. 
2005, Alberta U. 
158. Rockett, L., An SEU-hardened CMOS data latch design. IEEE Transactions on Nuclear 
Science, 1988. 35(6): p. 1682-1687. 
159. Gambles, J.W., K.J. Hass, and S.R. Whitaker. Radiation hardness of ultra low power CMOS 
VLSI. in 11th NASA Symposium on VLSI Design. 2003. 
160. Whitaker, S., J. Canaris, and K. Liu, SEU hardened memory cells for a CCSDS Reed-Solomon 
encoder. IEEE Transactions on Nuclear Science, 1991. 38(6): p. 1471-1477. 
161. Liu, M.N. and S. Whitaker, Low power SEU immune CMOS memory circuits. IEEE 
Transactions on nuclear science, 1992. 39(6): p. 1679-1684. 
162. Canaris, J., An SEU immune logic family. 1991. 
163. Bessot, D. and R. Velazco. Design of SEU-hardened CMOS memory cells: the HIT cell. in 
Radiation and its Effects on Components and Systems, 1993., RADECS 93., Second European 
Conference on. 1993. IEEE. 
164. Calin, T., M. Nicolaidis, and R. Velazco, Upset hardened memory design for submicron CMOS 
technology. IEEE Transactions on Nuclear Science, 1996. 43(6): p. 2874-2878. 
165. Haghi, M. and J. Draper. The 90 nm Double-DICE storage element to reduce Single-Event 
upsets. in Circuits and Systems, 2009. MWSCAS'09. 52nd IEEE International Midwest 
Symposium on. 2009. IEEE. 
166. Knudsen, J.E. and L.T. Clark, An area and power efficient radiation hardened by design flip-
flop. IEEE Transactions on Nuclear Science, 2006. 53(6): p. 3392-3399. 
167. Mavis, D.G. and P.H. Eaton. Soft error rate mitigation techniques for modern microcircuits. in 
Reliability Physics Symposium Proceedings, 2002. 40th Annual. 2002. IEEE. 
168. Mavis, D.G. and P.H. Eaton, Temporally redundant latch for preventing single event disruptions 
in sequential integrated circuits. 2000, Google Patents. 
169. Jagannathan, S., et al., Single-event tolerant flip-flop design in 40-nm bulk CMOS technology. 
IEEE Transactions on Nuclear Science, 2011. 58(6): p. 3033-3037. 
170. Zhou, Q. and K. Mohanram, Gate sizing to radiation harden combinational logic. IEEE 
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2006. 25(1): p. 
155-166. 
171. Kastensmidt, F.L., et al. Transistor sizing and folding techniques for radiation hardening. in 
Radiation and Its Effects on Components and Systems (RADECS), 2009 European Conference 
on. 2009. IEEE. 
172. Mavis, D.G., P.H. Eaton, and M. Sibley. SEE characterization and mitigation in ultra-deep 
submicron technologies. in IC Design and Technology, 2009. ICICDT'09. IEEE International 
Conference on. 2009. IEEE. 
173. Blum, D.R. and J.G. Delgado-Frias. Hardened by design techniques for implementing multiple-
bit upset tolerant static memories. in Circuits and Systems, 2007. ISCAS 2007. IEEE 
International Symposium on. 2007. IEEE. 
174. Black, J.D., P.E. Dodd, and K.M. Warren, Physics of multiple-node charge collection and 
impacts on single-event characterization and soft error rate prediction. IEEE Transactions on 
Nuclear Science, 2013. 60(3): p. 1836-1851. 
175. Lin, S., Y.-B. Kim, and F. Lombardi, A 11-transistor nanoscale CMOS memory cell for 
hardening to soft errors. IEEE Transactions on Very Large Scale Integration Systems, 2011. 
19(5): p. 900. 
176. Lin, S., Y.-B. Kim, and F. Lombardi, Analysis and design of nanoscale CMOS storage elements 
for single-event hardening with multiple-node upset. IEEE Transactions on Device and 
Materials Reliability, 2012. 12(1): p. 68-77. 
Yasser Nezzari                                                                                                                           References 
145 
 
177. Pescovsky, A., et al. SEU hardening: Incorporating an extreme low power bitcell design 
(SHIELD). in SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), 
2014 IEEE. 2014. IEEE. 
178. Atias, L., A. Teman, and A. Fish. Single event upset mitigation in low power SRAM design. in 
Electrical & Electronics Engineers in Israel (IEEEI), 2014 IEEE 28th Convention of. 2014. 
IEEE. 
179. Garg, R., et al. A design approach for radiation-hard digital electronics. in Proceedings of the 
43rd annual Design Automation Conference. 2006. ACM. 
180. Gaisler, J., The LEON processor user’s manual. Gaisler research Google Scholar, 2001. 
181. Gaisler, J. A portable and fault-tolerant microprocessor based on the SPARC v8 architecture. in 
Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference 
on. 2002. IEEE. 
182. Corporation, A., TSC695 SPARC V7 Processor (ERC32) Development Tools. 2005: Atmel. 
183. Corporation, A., TSC695F SPARC 32-bit Space Processor User Manual. 2003: Atmel  
184. Amorosi, L. and D. Pascucci, Embedded multi-core systems for mixed criticality applications 
in dynamic and changeable real-time environments. 2014. 
185. Pouponnot, A.L. Strategic use of SEE mitigation techniques for the development of the ESA 
microprocessors: past, present, and future. in On-Line Testing Symposium, 2005. IOLTS 2005. 
11th IEEE International. 2005. IEEE. 
186. Weigand, R. ESA Microprocessor Development Status and Roadmap. in DASIA 2011-Data 
Systems In Aerospace. 2011. 
187. Gaisler, J., The Leon2 IEEE-1754 (SPARC V8) Processor. Gaisler Research, 2003. 
188. T. Helfers, E.L., P. Rastetter; O. Ried, Astrium GmbH, Dr. O. Emam, F. Guyon Multi-
DSP/Micro-Processor Architecture (MDPA). 2007, ESA/ESTEC, Noordwijk: ESA Workshop 
on Avionics Data, Control and Software Systems (ADCSS). 
189. AB, C.G. LEON3FT-RTAX Fault-tolerant Processor. 2005; Available from: 
https://www.gaisler.com/index.php/products/components/leon3ft-rtax. 
190. Sinander, P., The COLE System-On-Chip. 2007, ESTEC Noordwijk: SAAB SPACE. 
191. Roland Weigand, F.K., Marc Souyr, Jean-Francois Coldefy, SCOC3 (Spacecraft Controller On 
Chip). 2007, ESTEC, Noordwijk. 
192. ETCA, A.A.S., LEON3-Fault Tolerant Design Against Radiation Effects ASIC. 2007. 
193. Ferreira, A.P., et al. Using PCM in next-generation embedded space applications. in Real-Time 
and Embedded Technology and Applications Symposium (RTAS), 2010 16th IEEE. 2010. 
IEEE. 
194. Pirovano, A., et al. Scaling analysis of phase-change memory technology. in Electron Devices 
Meeting, 2003. IEDM'03 Technical Digest. IEEE International. 2003. IEEE. 
195. NXP. QorIQ® PROCESSING PLATFORMS: 64-BIT MULTICORE SOCS. 2018 Available 
from: https://www.nxp.com/products/processors-and-microcontrollers/power-architecture-
processors/qoriq-platforms:QORIQ_HOME?&&tid=vanqoriq. 
196. NXP, Freescale Semiconductor Product Selector Guides. 2009. 
197. NXP, Power Architecture™ Technology Primer. 2007, Denver: Freescale Semiconductor 
Literature Distribution Center  
198. Berger, R., et al. Quad-core radiation-hardened system-on-chip power architecture processor. 
in Aerospace Conference, 2015 IEEE. 2015. IEEE. 
199. Bear, M. A Radiation-Hardened ASIC Library in 45nm SOI for Next-Generation High-
Performance Space Computing. in GOMACTech 2014. 2013. 
200. Hillman, R., et al. Space processor radiation mitigation and validation techniques for an 1,800 
MIPS processor board. in Radiation and Its Effects on Components and Systems, 2003. 
RADECS 2003. Proceedings of the 7th European Conference on. 2003. IEEE. 
201. PowerPC, I., 750FX RISC Microprocessor Datasheet. 2003, IBM Corp., Armonk, NY. 
202. Lintz, J., et al. Single event effects hardening and characterization of Honeywell's RHPPC 
integrated circuit. in Radiation Effects Data Workshop, 2003. IEEE. 2003. IEEE. 
203. Gary, S., et al., PowerPC 603, a microprocessor for portable computers. IEEE Design & Test 
of Computers, 1994(4): p. 14-23. 
Yasser Nezzari                                                                                                                           References 
146 
 
204. HARRIS. RH3000. 2018; Available from: https://www.harris.com/. 
205. Sias Jr, F. and J. Tulenko, Update on radiation-hardened microcomputers for robotics and 
teleoperated systems. 1993, Clemson Univ. 
206. Berger, R., et al., RAD750 TM SPACEWIRE-ENABLED FLIGHT COMPUTER FOR 
LUNAR RECONNAISSANCE ORBITER. 2007. 
207. Berger, R.W., et al. The RAD750/sup TM/-a radiation hardened PowerPC/sup TM/processor 
for high performance spaceborne applications. in Aerospace Conference, 2001, IEEE 
Proceedings. 2001. IEEE. 
208. Wicker, S.B., Error control systems for digital communication and storage. Vol. 1. 1995: 
Prentice hall Englewood Cliffs. 
209. Yamamoto, H. and K. Itoh, Viterbi decoding algorithm for convolutional codes with repeat 
request. IEEE Transactions on Information Theory, 1980. 26(5): p. 540-547. 
210. Slepian, D., A class of binary signaling alphabets. Bell System Technical Journal, 1956. 35(1): 
p. 203-234. 
211. Raaphorst, S., Reed-muller codes. Carleton University, May, 2003. 9. 
212. Vanstone, S.A. and P.C. Van Oorschot, An introduction to error correcting codes with 
applications. Vol. 71. 2013: Springer Science & Business Media. 
213. Rains, E.M. and N.J. Sloane, Self-dual codes. arXiv preprint math/0208001, 2002. 
214. Lin, S., et al., Trellises and trellis-based decoding algorithms for linear block codes. Vol. 443. 
2012: Springer Science & Business Media. 
215. Kasami, T., S. Lin, and W. Peterson, Polynomial codes. IEEE Transactions on Information 
Theory, 1968. 14(6): p. 807-814. 
216. Pless, V., FJ MacWilliams and NJA Sloane, The theory of error-correcting codes. I and II. 
Bulletin of the American Mathematical Society, 1978. 84(6): p. 1356-1359. 
217. Costello, D.J. and S. Lin, Error Control Coding: Fundamentals and Applications. 1982, Prentice 
Hall. 
218. MacWilliams, F.J. and N.J.A. Sloane, The theory of error-correcting codes. 1977: Elsevier. 
219. Peterson, W.W., et al., Error-correcting codes. 1972: MIT press. 
220. Meggitt, J.E., Error correcting codes for correcting bursts of errors. IBM Journal of Research 
and Development, 1960. 4(3): p. 329-334. 
221. Kasami, T., A decoding procedure for multiple-error-correcting cyclic codes. IEEE 
Transactions on Information Theory, 1964. 10(2): p. 134-138. 
222. Alagoz, B.B., Hierarchical Triple-Modular Redundancy (H-TMR) Network For Digital 
Systems. arXiv preprint arXiv:0902.0241, 2009. 
 
 
 
 
 
 
 
 
 
 
 
  
Yasser Nezzari                                                                                                                           Appendices 
147 
 
A. APPENDIX A-RADIATION HARDENING BY DESIGN  
The different resilience techniques at design level used to mitigate against the radiation effects have 
been introduced in this section. Hardening can be achieved using two methods: By layout modifications, 
or by circuit design modifications. 
A.1 Hardness by Layout Design 
There are multiple ways enabling layout design modifications. Enclosed Layout Transistor (ELT) 
[154, 155], is a technique that is increasingly used to mitigate radiation effects in space and nuclear 
environments [156]. Other methods include ringed source and ringed inter-digitated design. All the 
three mentioned techniques are focused on preventing leakage current between source and drain. 
Guard Ring is a technique used to prevent latch-up in CMOS, caused by parasitic thyristor formation 
that could damage the device. A low resistance path is created between VDD and VSS, because of the 
parasitic thyristor, leading to high current flow, and damaging the device. A high resistance path between 
the two voltage supplies will reduce the current and solve the problem. Reducing the gain of parasitic 
transistors can also reduce the current. This can be achieved by reducing the distance between adjacent 
MOSFETS, but this can reduce the density of the circuit. Guard ring should tolerate LET> 90 (MeV – 
cm2) / mg) according to [157]. 
A.2 Hardness by Circuit Design 
Used to mitigate the SEUs using the following techniques: adding a redundant transistor to the circuit 
logic, or charge sharing between devices. Mainly adding redundancy to the logic of COTS technologies 
is used to prevent SEUs. 
Rockett Memory Cell [158] is a radiation hardened latch. The memory cell has 6-P channel 
transistors surrounding 6-transistor of a conventional memory element, leading to 16 transistors with 
an input buffer. Whitaker Memory Cell [159, 160] based on the Rockett Memory Cell, but using 12 
transistors instead of 16. Improved Whitaker Memory Cell by Liu [161] is an amelioration of the 
previous cell technology using 14 transistor arrangement. RS Latch based Memory Cell by J.Canaris J. 
Canaris [162] is designed with AND-NOR or OR-NAND gates. This technique has improved the 
downsides of the earlier designs of memory cells, such as the number of transistors, recovery time, and 
the static power usage. HIT Memory Cell [163] with its 12 transistor design is aiming to improve the 
static power usage and the recovery time after an upset. Unlike earlier designs, HIT memory cells 
eliminate direct path between VDD to VSS. 
DICE (Dual Inter clocked Storage Cell) DICE [164] is manufactured using two cross-coupled 
inverter latches. Feedback to each cell comes from a preceding cell. In this design nMOS and pMOS 
are connected to a different input. The downside of this technology is its lack of mitigation of concurrent 
Yasser Nezzari                                                                                                                           Appendices 
148 
 
upsets at any two nodes [164]. An upgrade to the traditional DICE has been introduced in [165] 
consisting of a dual DICE interleaved to enhance their resilience towards SEUs. Attempting to tackle 
the simultaneous particle strikes has led to the emergence of Temporal DICE Flip-Flop Design by DICE, 
J. E. Knudsen and L. T. Clark [166], using majority voter with a feedback mechanism, enabling it to 
mitigate both SEUs and SETs [167, 168]. 
The Quatro latch [169] is fabricated with eight transistors storage cell, based on the basic DICE 
design, but with a different interconnection scheme. The new design improves multiple parameters such 
as power, area, and maximum Q delay [169]. 
Gate Sizing [170] is based on the concept of increasing the capacitance of the output of any gate, 
which can decrease its susceptibility to particle strikes. More about this technique can be found in [171, 
172]. 
Radiation Hardened Memory Cells aiming to improve SEUs tolerance and improve power 
consumption. Another factor for implementing this technology is the CMOS technology scaling, and 
the intensity of angled particle strikes [173, 174]. The proposed solution is (11 T) SRAM cell [175], 
according to the author of the article, the new design has improved on DICE in multiple aspects such 
as the area and power delay product. (13 T) [176] Proposed an improvement on (11 T) SRAM cell 
design to encounter the Single Event Multiple Effects (SEME). The new design has a lower delay, but 
higher power consumption and a larger area compared to DICE. (SHIELD) design was proposed in 
[177] to mitigate against SEUs by integrating low power bit-cell design. Another mitigating design was 
proposed in [178] and compared against the DICE design [164]. The research done by [41] was aiming 
to make better mitigation to the SEMEs with lower overhead, their design is similar to [176]. 
Device-based SEU Clamping Circuit is based on increasing the gate’s threshold compared to the 
conventional gate in the circuit [179]. 
A.3 Radiation Hardened by Design (RHBD) Processors  
This category includes processors that are manufactured as ASIC and designed on COTS CMOS 
processes. Radiation effects mitigation is accomplished at multiple levels such as the layout, circuit, 
logic and architectural areas. This family have the exceptional advantages such as their high radiation 
tolerance, medium cost, where they cost more than the COTS processors, and the reason for this is the 
high cost of qualification and low production, however, they are cheaper than the radiation hardened 
processors by design layout, with their utilisation of commercial components. One more advantage is 
that the RHBD offer high integration (I/O controllers for the space domain). The downside of using 
RHBD processors is their latency compared to the COTS processing architectures, because of their 
ASIC chips design. Mostly, the RHBD processors are based on the European LEON architecture [180, 
181]. Atmel has manufactured V7 ERC32 and TSC695FL in France [182, 183]. They can achieve 12 
MIPS / 6 MFLOPS, with a very low power consumption (0.3W for the core excluding I/O). Originally 
this architecture was based on 3-chips set, now it is a single chip CPU. This architecture has been used 
Yasser Nezzari                                                                                                                           Appendices 
149 
 
on multiple satellites and systems. SSTL has developed the OBC695A V7 SBC [184] using the TSC695 
architecture of Atmel. A step change from the previously developed architecture TSC695FL SPARC V7 
processor is Atmel LEON2 AT697 [185]. This processor is a LEON2 SPARC V8 architecture [186, 
187], based on IP core provided by Aeroflex Gaisler. This architecture incorporates a processor, a 
memory interface and PCI interface, with 70-80 MHz frequency. Another design based on LEON2 is 
the MDPA (Multi-DSP/ µProcessor Architecture), a SoC processor made by Astrium, with 70 MHz 
frequency, I/O controllers, space wire and CAN [188]. A DSP module is integrated within the SoC, with 
an encoder and decoder (600 Kbps). An implementation of a SPARC V8 processor (LEON3) and SoC 
architecture has been done by Aeroflex Gaisler on ACTEL RTAX radiation-tolerant FPGA [189]. This 
work has dramatically evolved the domain, with radiation hardened FPGA, which are more accessible 
with the ability to customize the design depending on the application. However the ACTEL RTAX 
FPGA has low performance (less than 25 MHz), in addition, its parts are ITAR controlled. A SoC 
processor based on LEON2FT has been manufactured by Saab Aerospace, named COLE [190], 
including multiple 1553 bus, SpaceWire (up to 200 MHz/160 Mbps) and CAN interfaces, this processor 
shows a high level of integration. Another LEON3 based SoC design is SCOC3 [191] developed by 
Astrium. This processor includes a large set of I/O cores, and it was manufactured on Atmel RH 0.18u 
process (ATC18RHA). Thales Alenia Space manufactured LEONDARE LEON3 SoC for the space 
domain.  
LEONDARE LEON3 [192] SoC was manufactured by Thales Alenia Space, based on DARE RHBD 
library, with a density of 25Kgates/mm2. This architecture was manufactured on UMC commercial 
0.18u process through Europractice shuttle service. An upgrade to the LEON3 was manufactured by 
Aeroflex named UT699, LEON3FT SoC, merging PCI bus with multiple several on-chip serial IO cores 
such as SpW, CAN and Ethernet. It was fabricated on 0.25u CMOS, packaged in 352-pin CQFP, 
operates at 66MHz, its power consumption is 5W. GR712RC was manufactured by Aeroflex Gaisler 
and Ramon Chips, a dual-core LEON3FT SoC, fabricated on TowerJazz 0.18u CMOS. It incorporates 
several I/O cores with 100MHz clock frequency and less than 2W power consumption.  
A.4 Radiation Hardened Processing Architectures  
Dedicated radiation hardened processes are used to manufacture processors capable of enduring the 
harsh radiation environments. These architectures can have multiple qualities such as their radiation 
effects resilience and sometimes these processors can achieve high performance. This can be reached 
when a custom design technique is used, identical to the design of COTS high performance; such as the 
PowerPC architecture, resulting in a painless code compatibility and application to the space domain. 
This method will provide high-level incorporation, including special I/O controllers, especially for the 
space domain. In this section we will introduce some radiation hardened processing architectures, 
showing the different hardware mitigation techniques used against the SEEs and the complexity of their 
Yasser Nezzari                                                                                                                           Appendices 
150 
 
circuits. 
A.4.1 Using PCM in Next-generation Embedded Space Applications  
Radiation hardened DRAM for embedded systems dedicated to space applications is a necessity 
because of the imminent thread of gamma rays, causing transient errors. However, such rad-hard 
memories are costly, and have higher power consumption, leading to lower battery life, or an increase 
in weight, for circuits designed for space applications.    
A new memory technology has emerged recently [193], named the phase-change memory (PCM), 
which has a great potential because of its low power consumption, non-volatility, scalability, and 
radiation mitigation. All the mentioned positive characteristics of PCM make it the ideal candidate to 
replace DRAM for space applications, requiring radiation immunity. On the other hand, current 
approaches necessitate changes to PCM device’s internal circuitry, the OS and/or CPU cache memory 
configuration/interface. 
Phase Change Main Memory Architecture (PMMA) presents a new architecture to manage the use 
of PCM. It was designed avoiding alterations to commodity PCM circuitry, the OS, and the CPU cache 
memory interface. This will allow plug-in change of a traditional DRAM main memory by one built 
with PMMA. PMMA combine creative scheme to handle PCM’s constraints, including write operations 
delays, asymmetric read/write overhead, and poor endurance. This work shows that PMMA improves 
the energy-delay by 60% over conventional DRAM main memories. 
The architecture of the Memory Manager (MM) is depicted in Figure A-1. It incorporates a request 
controller, a request buffer, an In-Flight Buffer (IFB), a PCM controller and an Acceleration and 
Endurance Buffer (AEB) DRAM controller. The request controller allocates resources for CPU 
interface requests and executes the memory transactions on behalf of the CPU. It is the element that 
controls the AEB and the PCM devices while revealing the same memory interface to the CPU. The 
request buffer upholds information about awaiting requests. It saves the current state of a request, 
including the request’s CPU/DRAM/PCM addresses, size of the package and the resources used to 
manage the request (e.g., buffers and tag array entries). The request controller uses the In-Flight Buffer, 
a short-term data storage. In order to read/write data from the DRAM and PCM devices, The PCM and 
AEB controllers are equipped with DMA engines. 
Yasser Nezzari                                                                                                                           Appendices 
151 
 
 
Figure A-1 PCM Memory Management Internal Architecture [37] 
Reference [194] suggests that in order to overcome the  Flash memories scaling limits, PCM can be 
used to replace the Floating Gate (FG) Flash memories. 
In the last years, several solutions have been explored in order to overcome the Flash memories 
scaling limits. Phase Change Memories (PCM) [194] seem to be one of the most promising candidates 
to replace FG Flash memories. 
A.4.2 Quad-Core Radiation-Hardened System-on-Chip  
The RAD55xX™ system-on-chip platform (SoC) IC is designed based on QorlQ [195] from 
Freescale [196], with further special qualities for the space domain. RAD55xX™ can be customized 
depending on the need. The RAD55xx has a quad-core 32/64 bit Power Architecture [197] processors, 
three levels of cache,  dual interleaved DDR3 DRAM controllers, data path acceleration architecture 
(DPAA) hardware accelerators, a NAND Flash controller, and high I/O throughput based on 
serializer/deserializer high-speed links. The RAD55xx [198] uses radiation-hardened by design 
RH45™ technology [199], by optimizing power and performance. 
Figure A-2 illustrates the RAD5545™ microprocessor architecture, which contains dual interleaved 
DDR3 ports, to compensate the delay resulted from cache misses. The CCBR interfaces are powered 
off, as they are internal. 
Yasser Nezzari                                                                                                                           Appendices 
152 
 
 
Figure A-2 RAD5545 Quad-Core Microprocessor Personality Block Diagram 
A.4.3 SCS750 Architecture 
The SCS750 [200] was manufactured to have high radiation tolerance, without neglecting the 
performance aspects. The block diagram of the board is shown in Figure A-3, including the fault 
tolerance technologies in each block. This architecture can have more than 1800 MIPS, using the 
PowerPC 750FX [201], a processing architecture with 800MHz frequency, 32 Kbytes instruction and 
32 Kbytes data on-chip L1 cache and 512 Kbytes on-chip L2 cache, 256 Mbytes of Reed-Solomon 
protected SDRAM, 4 Mbytes of EEPROM with FEC. 
 
 
Figure A-3 Functional Bock Diagram for SCS750 [46] 
Yasser Nezzari                                                                                                                           Appendices 
153 
 
A.4.4 Honeywell's RHPPC Integrated Circuit  
The RHPPC Processor [202] is fabricated on a 0.35 µm SIMOX CMOS process, including four 
metal levels, designed Radiation hardened CMOS V (RICMOS-V), considered as a cell ASIC 
architecture with personalized “drop-in” designs for the caches, tag memory management unit, and 
register files. The smallest transistor gate dimensions are 0.35 µm length by 0.8 µm width. In order to 
determine the ion energy loss in SEE testing, an active silicon layer depth of 0.2 µm, and another surface 
layer of 8 µm above it are used. The RHPPC Processor is manufactured on an HX311P die model and 
built in a 255 pin ceramic ball grid array (BGA) package. In this work, the current RHPPC Processor 
version denoted as “Pass2a” has been assessed. This architecture saves more power with slightly 
improved SEU mitigation, compared to the early “Pass2” version. The RHPPC Processor functionality 
is similar to the commercial PowerPC 603e™ microprocessor [203]. The following design 
modifications enable the RHPPC Processor to mitigate the SEEs: 
• In order to avoid transfer gates wire-ORed to a single node with a keeper, the tristate 
multiplexer was removed. 
• The associated precharge logic was cancelled by substituting the dynamic adder as a static 
adder. 
• The implementation of a radiation hard phase-locked loop (PLL) custom cell, including 
extra inputs and two status outputs (IN_LOCK and LOCK_DETECT outputs).  
• The clock regeneration cell is customized for radiation tolerance. 
Additional radiation-hardening was achieved by replacing the Honeywell’s SEU-hardened latch, 
register, and memory for similar functionalities in a commercial device.  The result is a 3.6 million 
transistor architecture with 146,000 HX3000-series standard SEU-hardened cells and 10 custom SEU-
hardened drop-in arrays. A block diagram of RHPPC processor is presented in Figure A-4. 
Yasser Nezzari                                                                                                                           Appendices 
154 
 
 
Figure A-4 RHPPC Processor Functional Block Diagram 
A.5 Hardened by Design Layout  
Other radiation hardened processors include Harris RH3000 32- bit Rad-Hard Controller [204], 
based on MIPS 3000 technology. The device was a rad-hard chipset (CPU & FPU), manufactures in 
the1990s in the USA. With 20 MHz clock and 6 MFLOPS, and included multiple peripherals such as 
RH memory and I/O controllers.  
The RH32 is a rad-hard MIPS 3000 architecture manufactured by Honeywell. IBM has fabricated 
the RAD6000 based on the IBM system 6000 architecture in the late 1990s [205], which was launched 
on the Mars Rover. An upgrade to the BAE systems RAD6000 is the RAD750 [206, 207], a radiation 
hardened processor with the PowerPC 750 design. This architecture was delivered in 2001. RAD6000 
incorporates 10.4 million transistors, with 250 nm process and has a die area of 130 mm², its operational 
frequency can reach up to 133–166 MHz with 300 MIPS. 
The radiation hardened processors are more resilient to the radiation effects, however, these 
architectures are costly, have a limited market, and they still lag in terms of performance and power 
consumption compared to the COTS. Ideally RH processors are based on 150nm CMOS, meanwhile, 
COTS are 28nm, which is six processing generations advanced [5]. Another downside is the limitation 
of radiation hardened processor to mitigating SEUs rate of approximately 10-11 errors/bit-day, but as the 
chips downscale and include more memory and flip-flops, the SEU rate is close to 10-5 errors/bit-day. 
 
 
Yasser Nezzari                                                                                                                           Appendices 
155 
 
B. APPENDIX B -ERROR CORRECTING CODING 
B.1 Basic Concepts  
All ECC codes have the principle of adding redundancy to the original information, giving them the 
ability to correct any potential errors that occurred during transmission or storage. The redundant check 
bits are attached to the original information to obtain a coded sequence (codeword).Figure B-1 
illustrates a codeword generated by encoding, this is known as systematic encoding, meaning that the 
original information is located on the leftmost k positions of the codeword. The redundant check bits 
have the size of n – k and obtained by applying an encoding function to the information symbols, 
providing redundancy that could be used for error detection and/or correction. The whole code sequence 
is known as error correction code, with C as a reference to it. 
 
 
 
 
Figure B-1 A Systematic Block Encoding for Error Correction [104]. 
B.1.1 Block Codes and Convolutional Codes 
ECCs are subdivided into two main categories, depending on the way the redundant check bits are 
appended to the message: block and convolutional [208]. At early stages of the ECC, convolutional 
codes have been preferred, because of the availability of the soft-decision Viterbi decoding algorithm 
[209], and the complexity of decoding the block codes with soft-decisions. However, the new 
developments in the soft-decision decoding (SDD) algorithms improved the linear block code’s 
performance. Block codes deal with every information bits separately by processing information on a 
block-by-block basis, making them memoryless operations, since the code words are independent from 
each other. Unlike the block codes, the convolutional codes depend on the current and previous inputs 
or outputs, on a block-by-block or a bit-by-bit basis. On the other hand, block codes have memory when 
encoding the data is done bit-by-bit and within a codeword. Next, the properties of block codes will be 
introduced, the properties can be similar to both types of codes. Lately, it has been harder to distinguish 
between block and convolutional codes, especially after the progress made in making the trellis 
structure of block codes, and the tail-biting structure of some convolutional codes that are more 
understandable. Convolutional codes are sometimes referred to as block codes “codes with time-varying 
trellis structure”. The block codes are referred to as convolutional codes “codes with a regular trellis 
structure.” 
Yasser Nezzari                                                                                                                           Appendices 
156 
 
B.1.2 Hamming Distance, Hamming Spheres and Error Correcting Capability  
For the simplicity of demonstration, block codes are considered [104]. By taking an error correction 
code C, error correction ability is not accomplished with the transmission of all 2n possible binary 
vectors of length n. Alternatively, C is a subsection of the n-dimensional binary vector space V2 = [0, 
1]n, with elements that are as far apart as possible. Taking two vectors 𝑋1 = (x1,0, x1,1, . . . , x1,n−1) and 𝑋2 
= (x2,0, x2,1, . . . , x2,n−1) in V2. The Hamming distance denoted as dH (𝑋1, 𝑋2), is the number of elements 
in which the vectors differ, 
dH (𝑋1, 𝑋2) ≜ |[i : x1,i ≠ x2,i, 0 ≤i ≤ n]| = ∑ 𝑥𝑛−1𝑖=0 i1 ⊕ 𝑥2,i (B-1) 
 
Where |A| the cardinality or the number of elements in a set A, and ⊕ denotes addition modulo-2 
(exclusive-OR). 
The minimum Hamming distance of a given Hamming code C, denoted by dmin is defined as the 
minimum Hamming distance amongst all possible distinguishable pairs of code words in C. 
dmin = min𝑣1,𝑣2 ∈ C [ dH (𝑣1, 𝑣2)| 𝑣1 ≠ 𝑣2]. (B-2) 
 
The parameters of block code of length n are denoted by the array (n, k, dmin), encoding messages with 
k bits, with minimum Hamming distance dmin, assuming |C| = 2k. 
The Hamming space in another annotation for the binary vector space V2. Assuming a code word 𝑣, 
for the error correction code C. A Hamming sphere St(𝑣), has the following aspects: a radius t and the 
sphere is centred around  𝑣. By definition it is the set of vector V2, at distance less than equal to t from 
the centre 𝑣; 
St(𝑣) = [𝑥 ∈V2| dH (𝑥, 𝑣) ≤ t ] (B-3) 
It can be noticed that the size of (or the number of code words in) St(𝑣) can be expressed with the 
following Equation. 
|St(𝑣)| = ∑ (
𝑛
𝑖
)𝑡𝑖=0  
(B-4) 
The error correction capability of a code C, denoted by t, is defined as the largest radius of Hamming 
spheres St(𝑣) surrounding all codewords 𝑣 ∈ C, where all the various pairs 𝑣𝑖, 𝑣𝑗 ∈ C, 
Where  
𝑡 = max
𝑣𝑖,𝑣𝑗∈C
[𝑙|𝑆𝑙(𝑣𝑖) ∩ 𝑆𝑙(𝑣𝑗) = Ø, 𝑣𝑖  ≠ 𝑣𝑗] (B-5) 
Using the notation of the minimum distance; 
𝑡 = [(𝑑𝑚𝑖𝑛 − 1)/2] (B-6) 
Where [n] is the largest integer less than or equal to n. 
Yasser Nezzari                                                                                                                           Appendices 
157 
 
 
Where [n] is the largest integer less than or equal to n. 
B.2 Linear Block Codes 
Finding a subset of V2 with elements as far apart as possible will result in obtaining a good code, 
which can be hard. Another problem that arises is the assignment of codewords to information 
messages.  
Linear codes are amongst the vector subspaces of V2, implying the ability to achieve the encoding 
part using matrix multiplications. Regarding the development of the encoders using circuits, it could be 
achieved using exclusive-OR gates, AND gates and D flip-flops.  
In this section, the outcome of exclusive-OR (or modulo 2 addition) and AND gates, represent the 
binary vector space operations of sum and multiplication respectively. Table B-1 shows the results of 
the binary elements addition and multiplication. 
Table B-1 Results of the Binary Elements Addition and Multiplication. 
a b a + b a . b 
0 0 0 0 
0 1 1 0 
1 0 1 0 
1 1 0 1 
The results of the operations in Table B-1 match the outcome of exclusive-OR gate and an AND 
gate, respectively. 
B.2.1 Generator and Parity-Check Matrices 
Assuming that C (n, k, dmin) is binary linear code, with k-dimensional vector subspace, meaning that 
it has a basis of (𝑣1, 𝑣2,…, 𝑣k-1). As a result any code word 𝑣 belonging to C can be represented as a 
linear combination of its elements [104]: 
𝑣 = u0𝑣0 + u1𝑣1+ …+ uk-1𝑣k-1 (B-7) 
 
𝑣 = u0𝑣0 + u1𝑣1+ …+ uk-1𝑣k-1  
Where ui ∈ [0, 1], 1 ≤ i < k. Equation (B-7) can also be represented with the generator matrix G and 
a message vector, 𝑢 = (𝑢1, 𝑢2,…, 𝑢k-1), as follows in Equation (B-6):  
𝑣 = 𝑢G (B-8) 
Where 
Yasser Nezzari                                                                                                                           Appendices 
158 
 
 G = [
𝑣0
𝑣1
⋮
𝑣k−1
] = 
[
 
 
 
𝑣0,0 𝑣0,1 ⋯ 𝑣0,𝑛−1
𝑣1,0 𝑣1,1 ⋯ 𝑣1,𝑛−1
⋮ ⋮ ⋱ ⋮
𝑣k−1,0 𝑣𝑘−1,1 ⋯ 𝑣𝑘−1,𝑛−1]
 
 
 
 (B-9) 
An (n − k)-dimensional dual space CT will be generated by the rows of a matrix H, since C is a k-
dimensional vector space in V2. The matrix H is denoted as the parity-check matrix, such that GHT = 
0, HT is the transpose of H. For any code 𝑣 ∈ C,  
𝑣𝐻𝑇 = 0 (B-10) 
Equation (B-8) has crucial importance when decoding linear codes.  
H generates linear code C⊥ that is binary linear (n, n − k, 𝑑𝑚𝑖𝑛
⊥  ), and are denoted by the dual code 
of C. 
B.2.2 The Weight is the Distance 
A key feature of the linear codes is the ability to compute the minimum distance of the code 
aggregates to computing the minimum Hamming weight of its nonzero codewords. This feature will be 
demonstrated in this section. The number of non-zero elements in 𝑥 are denoted as the Hamming weight 
wtH (𝑥), of a vector 𝑥 = (x0, x1, . . . , xn−1) ∈ V2. It can be expressed as the sum: 
wtH (𝑥) = ∑ 𝑥𝑖
𝑖=0
𝑛−1  (B-11) 
Using the definition of the Hamming distance will result in the following expression:  wtH (𝑥) = dH 
(𝑥, 0). For a binary code C the distance:  
dH (𝑣1, 𝑣2) = dH (𝑣1 + 𝑣2, 0) = wtH (𝑣1 + 𝑣2) = wtH (𝑣3)  
(B-12) 
The linearity characteristic results in 𝑣1 + 𝑣2 = 𝑣3 ∈ C. As a consequence, by computing the 
minimum Hamming weight amongst the 2k −1 nonzero codewords, the minimum distance of C can be 
found. This is a great improvement, compared to the early method, where the search is done amongst 
all the pairs of codewords. 
B.3 Encoding and Decoding of Linear Block Codes 
B.3.1 Encoding with G and H 
Equation (B-6) can describe an encoding rule for linear block codes, with easy way of 
implementation. Based on asymmetric encoding, the generator matrix G (Gsys) can be with a symmetric 
form, for a linear block (n, k, dmin). This could be achieved using elementary row operations and/or 
column permutations. Two matrices compose the Gsys; The k-by-k identity matrix, denoted Ik, and a k-
by-(n − k) parity submatrix P, where: 
Gsys = (Ik|P)  (B-13) 
Such that: 
Yasser Nezzari                                                                                                                           Appendices 
159 
 
P = [
𝑝0,0 𝑝0,1 ⋯ 𝑝0,𝑛−1
𝑝1,0 𝑝1,1 ⋯ 𝑝1,𝑛−1
⋮ ⋮ ⋱ ⋮
𝑝k−1,0 𝑝𝑘−1,1 ⋯ 𝑝𝑘−1,𝑛−1
] (B-14) 
 Using the Equation GHT= 0k, n−k, where = 0k, n−k is k-by-(n − k) all-zero matrix, it can be verified 
that the symmetric form of Hsys, of the parity-check matrix is: 
Hsys = (PTk|In-k) (B-15) 
 
Assuming that the information message expressed with the following: 𝑢 = (u0, u1, . . . , uk−1), will be 
encoded and 𝑣 = (v0, v1, . . . , vk−1) a codeword corresponding to the information message. The encoding 
using the generator matrix is most optimal, if the parameters of C are equivalent to the code rate k/n < 
½ or k < (n − k). The cost is in terms of binary operations. This results in the following:  
𝑣 = 𝑢 Gsys = (𝑢, 𝑣𝑝) (B-16) 
In the previous Equation (B-14) 𝑣𝑝= 𝑢 𝑃 = (vk, vk+1, . . . , vn−1) corresponds to the parity-check 
(redundant) part of the codeword.  
In case k > (n − k), or k/n > 1/2, encoding is achieved with less computations using the parity-check 
matrix H. For this particular case the encoding is performed based on Equation (B-10), where (𝑢, 𝑣𝑝) 
HT = 0, such that such that the (n − k) parity-check positions vk, vk+1, . . .,vn−1 are obtained as follows 
using Equation (B-15):  
𝑣𝑗 = u0p0, j + u1p1, j +…+uk-1pk−1, j,      k≤ j < n. (B-17) 
In other words, a parity-check matrix with systematic form for a linear code, uses the coefficients of 
the parity-check equations as entries of its rows.  
B.3.2 Standard Array Decoding 
Aiming to find the closest codeword 𝑣, a decoding scheme [104] will presented, with respect to a 
noisy received word 𝑟 =  𝑣 + 𝑒. Where 𝑒 is the error vector, 𝑒 ∈ [0, 1]n, which is produced by Binary 
Symmetric Channel (BSC), shown in Figure B-2. 
 
Figure B-2 Binary Symmetric Channel (BSC) 
Assuming that the cross over probability p (BSC parameter) is <1/2, the standard array [210] for a 
binary linear (n, k, dmin) code C can be shown in Table B-2, which is a table of all possible received 
Yasser Nezzari                                                                                                                           Appendices 
160 
 
vectors 𝑟, organized in a way to spot easily the nearest code word 𝑣 to 𝑟.  The size of the standard array 
is 2n−k rows and 2k + 1 columns. All the vectors in V2  = [0, 1]n are contained within the entries of the 
right side of the 2k columns of the array. At this stage, a new concept will be introduced that can be used 
to describe the decoding procedure, this concept is the syndrome. A word C has a syndrome described 
using the Equation (B-8) as:  
𝑠 = 𝑟 HT  (B-18) 
Table B-2 Standard Array of a Binary Linear Block Code. 
𝒔 𝒖𝟎 = 𝟎 𝒖𝟐 … 𝒖𝒌−𝟏 
0 𝑣0 = 0 𝑣1 .. 𝑣2
k
-1 
𝑠1 𝑒1 𝑒1 + 𝑣1 … 𝑒1 + 𝑣2
k
-1 
𝑠2 𝑒2 𝑒2 + 𝑣1 … 𝑒2 + 𝑣2
k
-
1 
. 
. 
. 
. 
. 
. 
𝑠2n-k-1 𝑒2n-k-1 𝑒2n-k-1 + 
𝑣1 
… 𝑒2n-k-1 + 
𝑣2k-1 
 
In the previous Equation (B-16) H is the parity check matrix of C. This means that 𝑠 is a set of symptoms 
indicating errors, as follows; assuming that a code word 𝑣 ∈ C is transmitted through a BSD, which adds 
the error vector 𝑒, resulting in 𝑟 =  𝑣 + 𝑒 at the receiver. Using the Equation (B-16) the syndrome of  
𝑟 is  
𝑠 = 𝑟 HT = (𝑣 + 𝑒) HT = 𝑒 HT (B-19) 
The previous equation (B-19) was obtained using the Equation (B-10), and by taking advantage of 
the linear transformation properties of the error vector. 
B.3.3 Hamming Spheres, Decoding Regions and the Standard Array 
The concept of the Hamming sphere and the error correction capability can be comprehended using 
the conventional array method for a linear code C. The 2k rightmost of the columns of the standard 
array, indicated by Colj, for 1 ≤ j ≤ 2k, accommodate a code word 𝑣𝑗∈C and a set of 2
n-k -1 words at the 
shortest Hamming distance from 𝑣𝑗, that is; 
Colj = [𝑣𝑗 + 𝑒𝑖 |  𝑒𝑖∈Rowi , 0 ≤ i < 2
n-k] (B-20) 
What is called as decoding regions are the sets Colj in the Hamming space, around each codeword 
𝑣𝑗∈ C for 0 ≤ j ≤ 2
k – 1. This means that a successful decoding occurs when a code word 𝑣𝑗∈ C is sent 
across a BSC and the received word 𝑟 belongs to the set Colj.  
Yasser Nezzari                                                                                                                           Appendices 
161 
 
Hamming Bound 
There is a relation between the error correction capability denoted as “t” and the set Colj, for code 
C. The relation is based on the Hamming sphere St (𝑣𝑗): A binary linear code C (n, k, dmin) has a decoding 
region Colj that accurately incorporates Hamming sphere St (𝑣𝑗) ⊆ Colj. Using Equation (B-4), and 
knowing that the size of Colj is 2n-k, the Hamming bound can be found, 
  ∑ (
𝑛
𝑖
)𝑡𝑖=0  ≤ 2
n-k (B-21) 
Amongst the multiple existing combinatorial interpretations of the Hamming bound, there is the one 
concerning the number of syndromes; stating that they must be greater or equal to the number of 
correctable error patterns,  ∑ (
𝑛
𝑖
)𝑡𝑖=0 .  
B.4 General Structure of a Hard-Decision Decoder of Linear Codes 
The hard-decision decoder [104] for a linear code is summarized in this section. The block diagram 
can be seen in Figure B-3, illustrating its general decoding process structure, including the computation 
of the syndrome vector and the buffer receiver vector. 
 With the assumption of hard decision, a decoder designed for BSC is given bits at the output of a 
demodulator. Let the transmitted codeword be  𝑣 ∈ C. The received word at the decoder will have noise 
such that: 𝑟 =  𝑣 + 𝑒. The linear decoder functions in two steps:  
• Firstly, by computing the syndrome  𝑠 = 𝑟 HT, using the linearity properties, resulting in the 
syndrome as a linear transformation of the of the error vector, that is added in the 
transmission channel. 
𝑠 = 𝑒 HT (B-22) 
• The second step is to estimate the most likely error vector 𝑒 and subtract it from the received 
vector (using modulo 2 addition in the binary case). The estimation is based on the syndrome 
𝑠. 
It is better to think of a decoder as a way of solving the Equation (B-20), meaning that following the 
previous steps may not be the ideal way to achieve an optimal decoding process, nonetheless it is an 
acceptable way to solve the problem.  Any method of solving the previous equation constitutes of a 
decoding scheme. As an illustration, one could try to solve the key equation by finding pseudoinverse 
of HT, represented by (HT)+, such that HT(HT)+ = In. This results in the following solution: 
   𝑒  = 𝑠(HT)+.  (B-23) 
Equation (B-21) has the smallest Hamming weight achievable.  
Yasser Nezzari                                                                                                                           Appendices 
162 
 
 
Figure B-3 General Structure of a Hard-Decision Decoder of a Linear Block Code [104] 
B.5 Hamming, Golay and Reed–Muller codes 
In this section, the main linear binary codes are shown. They are crucial to understanding error 
correcting code (ECC) concepts, and they are considered to be very efficient decoding schemes. 
The most popular linear binary codes are the Hamming [94] and Reed-Solomon [160]. What makes 
Hamming code widespread is its optimization, Hamming needs the least amount of redundancy, for a 
particular block length, with the ability to correct any single error. On the other hand, the binary Golay 
[96] code is significantly optimized, with its triple-error correction ability. The two other binary 
optimized codes are repetition and single parity-check (SPC) codes. What makes the Reed-Muller (RM) 
codes [211] significant is their sophisticated combinational definition, making their decoding easier. 
B.5.1 Hamming Codes 
As a result of Equation (B-8), any codeword 𝑣 within a linear (n, k, dmin) block code C satisfies the 
following equation; 
𝑣HT= 0  (B-24) 
This can be explained by the fact that the maximum number of linearly independent columns of the 
parity-check matrix H of C is equivalent to dmin − 1. In case of binary, with dmin= 3, the previous Equation 
(B-22) can be expressed as the sum of any two columns of H not equal to all-zero vector. 
There are up to 2m − 1 potential nonzero distinct columns, assuming that the columns of H are binary 
vectors with a length of m. As a result, the length of binary single-error correcting code is considered 
as n ≤ 2m – 1. The inequality is the Hamming bound for an error correction with a length of n bits, with 
n - k = m and t = 1. As a result, a code that satisfies this bound with equality is denoted as Hamming 
code. 
The properties of the parity-check matrix H such as its ownership of different columns, for each and 
every one, are incorporated within the Hamming codes. The syndromes of the received vector are 
equivalent to the column of H in the location in which the error has taken place, in case a single error 
has occurred in position 1 ≤ j ≤ n. 
Assuming a word transmitted through a binary symmetric channel (BSC), which adds an error vector 
Yasser Nezzari                                                                                                                           Appendices 
163 
 
𝑒. The second assumption is that all of its components are equal to zero, with the exception of the j-th 
index where ej = 1. The syndromes of the received words are equal to:  
𝑠 = 𝑟 HT =  𝑒 HT = ℎ𝑗 (B-25) 
In the previous Equation (B-23) ℎ𝑗 is known as the j-th column of H, and 1 ≤ j ≤ n. 
Encoding and Decoding Procedures 
Using Equation (B-23), it can be concluded that there is a feasible way to express the columns of H 
in the binary representation of integers, meaning that the value of the syndrome will directly reveal the 
position of the error. The same idea will be used in the encoding algorithms and decoding shown below. 
The parity-check matrix’s columns are represented as binary depiction of integer numbers i within the 
range [1, n] with an expanding order.  
If the outcome, or the resulted matrix is expressed by H’, meaning that the code associated with H’ 
is similar to the original Hamming code, possessing a parity-check matrix H, with an exchange or 
(permutation) of positions. In its original form the parity-check matrix incorporates the (n − k)×(n − 
k) identity matrix, In−k shown in Equation (B-13).  
Expressing H’ with columns is similar to the binary representation of the (integer) column number, 
making the identity matrix part of the columns of H’, corresponding to even powers of 2, meaning, 2l , 
where 0 ≤ l < m. 
Encoding 
Equation (B-15) models how the encoding schemes is performed. The column position numbers are 
checked, when calculating the parity check bit pj, for 1 ≤ j ≤ m. The columns which are not powers of 
two match the message position; moreover, the computations incorporate the corresponding message 
bits. The advantage of this scheme is its utilization of extremely simple encoding. Relative to the 
application, this scheme is optimal, since the decoding part must be at very high speed in most cases. 
Decoding 
The decoding part is simple since the codewords have been computed with respect to matrix H’. The 
location or the position of error occurrence corresponds to the syndrome. When taken as an integer, the 
faulty position can be corrected in the decoding process, after the syndrome s is computed; 
vs = vs ⊕ 1, 
The symbol ⊕ is for the exclusive or. 
B.5.2 The Binary Golay Code 
The Golay code [96] has the ability to correct three errors with t = 3, Golay observed that: 
Yasser Nezzari                                                                                                                           Appendices 
164 
 
∑ (
23
𝑖
)3𝑖=0  = 2
11 (B-26) 
The previous Equation (B-24) demonstrates the potential existence of an ideal binary (23, 12, 7) 
code, with the ability to correct three errors in 23-bit positions uttermost.  
Golay has written a paper regarding his new binary code, which demonstrates a generator matrix 
with triple-error correction ability. 
The simplicity of this code enables the use of look-up tables (LUTs) for encoding and decoding, 
especially with its relatively small length 23, dimension 12 with 12 redundant bits. 
Encoding 
Based on a look-up table, the encoding part incorporates a list of all the 212 = 4096 codewords, 
directly indexed by the data. Assuming that 𝑢 is a 12-bit binary vector, which contains the data that will 
be encoded, and let 𝑣 a 23-bit vector representing the codeword. In order to achieve decoding by 
building the look-up table (lookup table is built by constructing all 4096 12-bit vectors and by 
calculating the syndrome of a pattern). For each pattern, the 12 most-significant bit (MSB) are 
equivalent to the information bits and the 11 LSB set to zero. 
The LSB section of the code word is actually the 11-bit syndrome. The look-up table will be 
matching 𝑢 𝑡𝑜 𝑣, making it a one-to-one mapping. This can also be modelled as: 
 𝑣 = LUT (𝑢) = (𝑢, get_syndrome(𝑢, 0)) (B-27) 
The advantage of using the LUT to build the encoder is its use of cyclic nature of the Golay code. The 
Generator polynomial is equivalent to C75 in hexadecimal. It is depicted using the following 
expression: g(x) = x11 + x10 + x6 + x5 + x4 + x2 + 1. 
Equation (B-25) uses this polynomial in the scheme of finding the syndrome or “get_syndrome”.  
Decoding 
The decoder’s main aim is to evaluate the most-appropriate error vector 𝑒 when it acquires or receive 
the vector 𝑟, as an example: for the Hamming code the encoder looks for the least Hamming weight. 
Based on the LUT, Golay decoder takes the syndrome as input 𝑠 of the delivered vector 𝑟. At the output, 
Golay code encoder will compute the error vector 𝑒. The Golay decoder operates basically in three 
steps: 
• With the Hamming weight less than or equal to three, all potential error patterns are 
computed; 
• The syndromes corresponding to every error pattern are generated  𝑠 = get_syndrome(𝑒); 
• The error vector is saved at position 𝑠 of the LUT, 
LUT (𝑠) = 𝑒 
Up to three errors can be detected and corrected, after the erroneous word 𝑟 is received, using the 
LUT decoder. This can be achieved using the following: 
Yasser Nezzari                                                                                                                           Appendices 
165 
 
𝑣 = 𝑟 ⊕ LUT (get syndrome(𝑟)) 
In the previous expression 𝑣 denotes the corrected word. 
B.5.3 Extended (24, 12, 8) Golay Code 
The extended (24, 12, 8) Golay C24 decoding scheme will be presented in this section. This procedure 
is originating from an arithmetic decoding scheme [208, 212]. The parity submatrix B in in the parity-
check matrix H = (B|I12×12). Similar to the C24, The extended Golay code C’24 up to a permutation in 
the bit positions. This can be achieved by joining a total parity-check bit at the extremity of each 
codeword in the (23, 12, 7) Golay code.  
Utilizing the hexadecimal format, the 12 rows of B, expressed by rowi, 1  ≤i  ≤ 12, are as follows: 
0x7ff, 0xee2, 0xdc5, 0xb8b, 0xf16, 0xe2d, 
0xc5b, 0x8b7, 0x96e, 0xadc, 0xdb8, 0xb71. 
The submatrix B of the parity-check matrix of C24 fulfils B = BT, as a result of that, code C24 is a self-
dual code [213]. Equation (B-15) demonstrates that the encoding part is performed by recurrence with 
respect to H. With the assumption of wtH (𝑥) as the expression of the Hamming weight of a vector 𝑥, 
the arithmetic decoding of the extended (24, 12, 8) Golay code is subdivided into multiple steps as 
follows: 
1. Using the following expression; 𝑠 = 𝑟 HT, the syndrome will be computed. 
2. Depending on the value of wtH (𝑠) (if wtH (𝑠) ≤ 3, then set 𝑒  = (𝑠, 0) then jump to step 8). 
3. Check if wtH (𝑠 + rowi) ≤ 2, then set 𝑒  = (𝑠 + rowi, x𝑖), x𝑖 has the size of 12-bits, with only 
one nonzero value at the i-th coordinate. 
4. 𝑠B is computed at this step. 
5. Check the value of wtH (𝑠𝐵) (if wtH (𝑠𝐵) ≤ 3, then set 𝑒  = (0, s B)) then jump to step 8. 
6. Set 𝑒  = (x𝑖, s B + rowi) if wtH (𝑠B + rowi), then go to step 8. x𝑖 has same definition as in 
step 3. 
7. The end of decoding flag is set, in case the erroneous vector 𝑟 is irrecoverable and also at 
the end the decoding process. 
8. Set ?̂? = 𝑟 + 𝑒 and end the decoding. 
B.6 Binary Reed–Muller Codes 
Using the majority-logic (ML), the binary RM codes comprise a group of error correction codes with 
simple decoding scheme. According to [214] it is common for the codes of this group to have rather 
easy trellis forms.  
Utilizing the binary polynomials (or Boolean functions), a sophisticated definition of the binary RM 
code is achieved. This definition implies that the RM codes are almost similar with their Bose–
Chauduri–Hocquenghem (BCH) and Reed–Solomon codes relatives, and all elements of the class of 
Yasser Nezzari                                                                                                                           Appendices 
166 
 
polynomial codes [215]. 
B.6.1 Boolean Polynomials and RM Codes 
Based on the work that has been conducted by [216] in 1977, and assuming that f (x1, x2, . . . , xm) is 
the expression of a Boolean function on m binary-valued variables x1, x2, . . . , xm. This function could 
be expressed using truth table, which is very common. For all of the 2m combinations of values, the truth 
table will compute the corresponding lists the values of f. The Boolean function incorporates all of the 
common Boolean operations such as AND (conjunction), OR (disjunction) and NOT (negation). 
In order to obtain the disjunctive normal form (DNF) of a Boolean function, the truth table can be 
used directly. Any Boolean function can be represented as the sum of 2m underlying functions: 1, x1, x2, 
. . . , xm, x1x2, . . . , x1x2 . . .xm, as a result;  
𝑓 = 1 + a1𝑥1+ a2𝑥2 + … + am𝑥𝑚+ a12𝑥1𝑥2+ … + a12… 𝑥1𝑥2…𝑥𝑚  
(B-28) 
In the previous Equation (2-28) 1 represents the independent terms with 0 degree. The notation 
RM(r, m) is used to refer to a binary RM(2m, k, 2m−r ) code, which is identified as the set of vectors 
linked with all Boolean functions of degree up to r in m variables. RM(r, m) can also be denoted as the 
r-th order RM code of length 2m,  
RM(r, m)’s dimension can be computed using the following expression (Equation (B-29)):  
k = ∑ (
𝑚
𝑖
)𝑟𝑖=0  (B-29) 
The previous Equation (B-29) is the answer to the number of methods polynomials of degree 
extending to r can be generated with m variables. A generator matrix for RM(r, m) can be built using 
the Equation (B-28), which requires the utilization of the vectors corresponding to the k Boolean 
functions as rows. The function can be represented as polynomials of degree up to r in m variables. 
Dual RM Codes  
As a result of the fact that the generator matrix of RM(m-r-1, m) can be used as a parity-check matrix 
of RM(r, m), it can be easily demonstrated that RM(m-r-1, m) is the dual code of RM(r, m). 
B.7 Finite Geometries and Majority-Logic Decoding 
In this section, the construction of RM codes utilizing finite geometry will be introduced. The 
constituents of a Euclidean geometry EG(m, 2) of dimension m over GF(2) are 2m points, which are 
representing all the binary vectors of length m. Finite geometries and RM codes are very well explained 
in [217].  
RM codes are a subset of the generalized finite geometry codes. The relationship between finite 
geometries and the codes is depicted as follows: assuming that EG(m, 2), the coordinates of points of 
the geometry EG(m, 2) are represented as the columns of the matrix (𝑥1
𝑇
𝑥2
𝑇
…𝑥𝑀
𝑇
). As a result, there is 
a one-to-one relationship between the elements of the points of EG(m, 2) and the components of binary 
Yasser Nezzari                                                                                                                           Appendices 
167 
 
vectors of length 2m. This means that for a given binary vector of length 2m, it will be related to the 
corresponding subset of points of EG(m, 2).  
It can be noted specifically that any binary vector 𝑤 = (w1, w2, . . . , w2m) of length 2m, can be 
partnered with the corresponding subset of EG(m, 2). This is a result of interpreting the vector 𝑤 as 
selecting points whenever wi = 1. 
𝑤 in the previous expression is denoted as the incident vector. As a result the Binary RM codes are 
defined using the following: the code words of RM(r, m) are the incidence vectors of all subspaces 
meaning that they are a linear combinations of points, with dimensions of m - r in EG(m, 2), this can 
be found in Theorem 8 of [218]. As a direct outcome of this, the number of minimum-weight code 
words of RM(r, m) is: 
A2m-r = 2r ∏
2𝑚−𝑖−1
2𝑚−𝑟−𝑖−1
𝑚−𝑟−1
𝑖=0    
(B-30) 
 
The code is generated by removing or puncturing the coordinates that are null;  
x1 = x2 = … = xm = 0 
Amongst all the code words of RM(r, m), the binary cyclic RM*(r, m) code, satisfying the following: 
A2𝑚−𝑟−1 
∗  =  ∏
2𝑚−𝑖−1
2𝑚−𝑟−𝑖−1
𝑚−𝑟−1
𝑖=0    
(B-31) 
Which shows the minimum-weight codewords. The decoding of the RM codes can be achieved using 
the ML decoding, the main idea is that the parity-check matrix induces 2n−k parity-check equations. The 
ML decoder is formed by choosing a subset of the parity-check equations in a specific way, enabling 
the majority vote to take the value of specific code position. 
For the decoding of an RM(r, m) code, usually (r + 1) step ML decoder could be used. The decoder 
will enable the correction of any combination up to (2m−2 − 1)/2 random errors [217, 218].  
The other advantage is the simplicity of decoding cyclic RM*(r, m) codes. The idea behind it is as 
follows; in a cyclic code C, if (v1, v2, . . . , vn) is a codeword of C, this implies that its cyclic shift (vn, 
v1, . . . , vn−1) is a codeword of C as well. Subsequently, all the positions can be corrected with ML 
decoding, as long as a particular position can be corrected with the same algorithm (or hardware circuit). 
This can be achieved by periodically shifting received codewords until all n positions have been 
reached. 
Yasser Nezzari                                                                                                                           Appendices 
168 
 
 
Figure B-4 A Majority-Logic Decoder for a Cyclic RM (1, 3) Code [104]. 
Figure B-4 shows an example of a majority-logic decoder for a cyclic RM(1, 3) code. 
B.8 Binary Cyclic Codes and BCH Codes 
The main objectives of this section are to introduce the substantial concepts to comprehend the 
binary cyclic codes in addition to the mathematical background behind their encoding and decoding 
schemes. Another important notion that will be included as well in this section is the group or family of 
BCH codes, which are a division of the cyclic codes, with the advantage of having more simplified 
encoding and decoding processes, thanks to their algebraic structure. The Hamming codes are part of 
the generic family of Binary BCH codes, with minimum distance of 3. The Hamming coding is 
efficiency and have high performance in terms of error detection and recovery with low overhead 
making them the ideal choice for many computer networks. Another example of the use of the Binary 
BCH codes is the use of shortened (48, 36, 5) BCH, where it has been implemented by the U.S. cellular 
time division multiple access (TDMA) system specification, standard IS-54 [104]. 
B.8.1 Binary Cyclic Codes 
Amongst the subclasses of error correcting codes (ECC), we can find the cyclic codes, with their 
optimised encoding and decoding, especially since they are implemented utilizing simple shift registers 
and combinatorial logic elements, this means that they could be expressed with polynomials. This 
section will present the substantial notations of the cyclic codes. 
Yasser Nezzari                                                                                                                           Appendices 
169 
 
Generator and Parity-Check Polynomials 
Assuming that C is a linear (n, k) block code. Taking into account the vectors 𝑢 𝑎𝑛𝑑 𝑣 as the message 
vectors with respect to a codeword in C, respectively. 
By definition cyclic codes are linear codes, meaning that they possess the properties that make them 
ideal for hardware implementation. Every codeword 𝑣 has its corresponding polynomial 𝑣(𝑥), 
𝑣 = (v0, v1, …, vn-1) →𝑣(𝑥) = v0 + v1x + … vn-1xn-1. 
The relative location of an element vi of 𝑣 is pointed by the indeterminant x, as a term vi xi of 𝑣(𝑥), 
0 ≤ i < n. 
 
Figure B-5 A Cyclic Shift Register. 
The condition that makes a linear block code C cyclic is if and only if every cyclic shift of a code 
word is another codeword,  
𝑣 = (v0, v1, …, vn-1) ∈ C ⇐⇒ 𝑣
(1)
 = (vn-1, v0, …, vn-2) ∈ C. 
For polynomials, 𝑣
(1)
(𝑥) representing a cyclic shift by one position, which could be achieved using 
a multiplication by x modulo (xn - 1),  
𝑣(𝑥) ∈ C ⇐⇒𝑣
(1)(𝑥) = x 𝑣(𝑥) mod (xn - 1) ∈ C 
 Figure B-5 illustrates the shift register which could be utilised for this reason.  
The Generator Polynomial 
Cyclic codes possess multiple useful properties, amongst them is that all code polynomials 𝑣(𝑥) are 
multiples of a unique polynomial 𝑔(𝑥), which is denoted as the generator polynomial of the code. This 
polynomial is unique because of its roots, known as the zeros of the code. What can be observed about 
the generator polynomial 𝑔(𝑥) is that it divides (xn − 1). This property makes it possible to find the 
generator polynomial (“a(x) divides b(x)” if b(x) = q(x)a(x)). Subsequently, once the polynomial (xn − 
1) is factored into its irreducible factors, it will be possible to find 𝑔(𝑥). The factors φj (x), such that: j 
= 1, 2, . . . ,l, 
(xn − 1) = φ1(x)φ2(x) . . .φl(x). (B-32) 
It can also be observed that over the field of binary numbers, a − b and a + b (modulo 2) are 
identical. Therefore, for the coming sections, there is no difference between these operations, as all the 
codes are constructed over the binary field or its expansions. As a result the polynomial 𝑔(𝑥) is given 
Yasser Nezzari                                                                                                                           Appendices 
170 
 
by:  
𝑔(𝑥) = ∏ 𝜑𝑗(𝑥)𝑗∈𝐽 ⊂[1,2,…,𝑙]    (B-33) 
B.8.2 Encoding and Decoding of Binary Cyclic Codes 
The following equation models the dimension of an (n, k) binary cyclic code:  
k = n – deg[𝑔(𝑥)] , 
In the previous expression deg[ ] represents the degree of the argument. The linearity characteristics 
of a cyclic code C means that a generator matrix can be produced using any set of k linearly independent 
(LI) vectors. 
In detail, the binary vectors corresponding to 𝑔(𝑥), 𝑥𝑔(𝑥),…, 𝑥𝑘−1𝑔(𝑥) are (LI). Generator matrix 
of C can utilize these (LI) vectors, as its rows. 
There is a particular case where the message bits do not emerge explicitly in any position of the code 
words. In this case a non-systematic encoding rule is accomplished. 
Assuming that every message that needs encoding has its corresponding 𝑢(𝑥). There are mainly two 
types of encodings of codewords of a binary cyclic code, according to the way the message is processed, 
nonsystematic or systematic: 
1. Nonsystematic encoding 
𝑣(𝑥) = 𝑢(𝑥) 𝑔(𝑥) (B-34) 
2. Systematic encoding 
𝑣(𝑥) = 𝑥𝑛−𝑘𝑢(𝑥) + [𝑥𝑛−𝑘𝑢(𝑥)mod 𝑔(𝑥)] (B-35) 
 
B.8.3 The Parity-Check Polynomial 
The parity-check polynomial denoted by ℎ(𝑥), can be coupled with the parity-check matrix. The 
relationship between the parity-check polynomial and the generator polynomial is modelled by the 
following expression; 
𝑔(𝑥) ℎ(𝑥) = xn + 1 (B-36) 
In order to compute the parity-check polynomial using the generator polynomial the following 
Equation (B-37) could be used: 
ℎ(𝑥) = (xn + 1)/ 𝑔(𝑥) = h0 + h1 x + … + h0 xk (B-37) 
Now, the rows of the binary vectors corresponding to the first n − k − 1 nonzero cyclic shifts, can 
be utilized to obtain parity-check matrix for C. 
ℎ
(𝑗)
(𝑥) = 𝑥𝑛ℎ(𝑥)𝑚𝑜𝑑 (𝑥𝑛 − 1),  j = 0, 1, . . . , n − k – 1 
Yasser Nezzari                                                                                                                           Appendices 
171 
 
H = 
(
 
 
ℎ0 ℎ1 ⋯ ⋯ ℎ𝑘 0 0 ⋯ 0
0 ℎ0 ℎ1 ⋯ ⋯ ℎ𝑘 ⋯ ⋯ 0
⋮ ⋮ ℎ0 ℎ1 ⋯ ⋱ ℎ𝑘 ⋯ ⋮
0 0 0 ⋱ 0 ⋱ ⋱ ⋱ ⋮
0 0 0 ⋯ 0 ⋯ ⋯ ⋯ ℎ𝑘)
 
 
 (B-38) 
Duals of Cyclic Codes and Caximum-Length-Sequence Codes 
When analysing the linear codes, it can also be deduced that the cyclic code C⊥ resulted using the 
polynomial ℎ(𝑥) is equivalent to the dual code of a cyclic code C possessing generator polynomial 
𝑔(𝑥). 
Reference [219] has mentioned the substantiality of maximum-length-sequence (MLS) cyclic codes. 
This code is constituted of the duals of the cyclic Hamming codes. 
The polynomial 𝑔(𝑥) = (xn + 1)/ 𝑝(𝑥) will produce an MLS cyclic (2m − 1, m , 2m−1) code, where 
p(x) is a primitive polynomial. 
 
Figure B-6 Circuit for Systematic Encoding: Division by 𝒈(𝒙) [104] 
B.9 Decoding of Cyclic Codes 
Assuming that 𝑟(𝑥) = 𝑣(𝑥) + 𝑒(𝑥), where 𝑒(𝑥) represents the error vector generated after 
delivering the information messages through a BSC channel, also known as the error polynomial. The 
syndrome polynomial is expressed using the following; 
𝑟(𝑥)  ≜  𝑟(𝑥) 𝑚𝑜𝑑 𝑔(𝑥) =  𝑒(𝑥) 𝑚𝑜𝑑 𝑔(𝑥)   (B-39) 
The overall structure of cyclic codes decoder is shown in Figure B-7. The error polynomial 
𝑒(𝑥) could be computed using the syndrome polynomial 𝑠(𝑥). The architecture demonstrated in Figure 
B-7 can be thought as a “standard array approach” to decoding cyclic codes. The previous statement 
has resulted from the fact that a cyclic code is, first of all, a linear code. The idea behind decoding is to 
obtain the error polynomial 𝑒(𝑥) which is the unknown, using the syndrome polynomial 𝑠(𝑥), which is 
known. Equation (B-37) which is fundamental for a syndrome decoder, known as the Meggit decoder 
Yasser Nezzari                                                                                                                           Appendices 
172 
 
[220] for cyclic codes. This Equation (B-37) shows the relationship between the two polynomial; the 
syndrome polynomial 𝑠(𝑥) and the error polynomial 𝑒(𝑥). 
Another decoder that examines if the error polynomial 𝑒(𝑥) is incorporated or contained within the 
syndrome polynomial 𝑠(𝑥) is known as the error-trapping decoder [221]. There are only a few 
categories of codes that possess simple decoders such as cyclic Hamming and Golay codes. The 
complexity of an architecture is built solely on the detection of errors, where the utilization of 
combinatorial logic becomes too large, despite the increase in the error-correcting capability t = [(dmin 
− 1)/2].  
Assuming that the location of error is xn−1 (the first delivered bit). This can be expressed as:  
𝑒(𝑥) = xn−1. 
The syndrome polynomial corresponding to the error is; 
𝑠(𝑥) = xn−1 mod 𝑔(𝑥). 
The cyclic properties of the code enable the detection of any error other than the error affecting the 
given position. This can be achieved by periodically shifting the elements of the syndrome polynomial 
and the error polynomial. The position of the error will be corrected once the syndrome decoder inspects 
the syndrome for each delivered or received location. The error is detected once the pattern xn−1 mod 
𝑔(𝑥) is detected. 
 
Figure B-7 General Architecture of a Decoder for Cyclic Codes [104]. 
B.10 TMR  
Triple Modular Redundancy (TMR) is a well know fault tolerance technique for software and 
hardware. TMR uses three copies of the same elements or (redundant modules). The outputs of these 
modules are selected by a voting mechanism. The voting algorithm uses majority, where the voter 
algorithm takes the most common output [222]. TMR corrects the effect of faults before propagating to 
the rest of the system, more about TMR will be shown in Chapter 4 & 5. 
B.11 Discussion 
In this section the different error correction codes have been introduced, starting with the basics 
where the encoding and decoding notions have been shown, requiring the addition of check-bits to the 
Yasser Nezzari                                                                                                                           Appendices 
173 
 
original data before storing or transmitting it. This section also included the linear block codes, and the 
notations of their encoders and decoders including the hard-decision decoder. The most widespread 
code used in multiple domains is the Hamming code, optimal for detecting two errors, and correcting 
one. The Hamming encoder and decoder has been shown in this section. Other binary codes including 
the Golay, Reed Muller and BCH, with the ability to detect and correct multiple errors have been 
explained. Other techniques of error detection and correction require modular redundancy, such as the 
TMR, which will be explained with more details in the coming Chapters 4 & 5. The ECC are most 
suitable for the memory system of a processing architecture, including the RAM and the caches, on the 
other hand, modular redundancy with more than two replicas could be used for both the memory and 
CPU system of the processing architecture. 
 
