Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer
  Exploration for Architecting Resilience) by Cheng, Eric et al.
 1
 
Abstract— We present CLEAR (Cross-Layer Exploration for 
Architecting Resilience), a first of its kind framework which 
overcomes a major challenge in the design of digital systems that 
are resilient to reliability failures: achieve desired resilience 
targets at minimal costs (energy, power, execution time, area) by 
combining resilience techniques across various layers of the 
system stack (circuit, logic, architecture, software, algorithm). 
This is also referred to as cross-layer resilience. In this paper, we 
focus on radiation-induced soft errors in processor cores. We 
address both single-event upsets (SEUs) and single-event multiple 
upsets (SEMUs) in terrestrial environments. Our framework 
automatically and systematically explores the large space of 
comprehensive resilience techniques and their combinations 
across various layers of the system stack (586 cross-layer 
combinations in this paper), derives cost-effective solutions that 
achieve resilience targets at minimal costs, and provides 
guidelines for the design of new resilience techniques. We 
demonstrate the practicality and effectiveness of our framework 
using two diverse designs: a simple, in-order processor core and a 
complex, out-of-order processor core. Our results demonstrate 
that a carefully optimized combination of circuit-level hardening, 
logic-level parity checking, and micro-architectural recovery 
provides a highly cost-effective soft error resilience solution for 
general-purpose processor cores. For example, a 50× 
improvement in silent data corruption rate is achieved at only 
2.1% energy cost for an out-of-order core (6.1% for an in-order 
core) with no speed impact. However, (application-aware) 
selective circuit-level hardening alone, guided by a thorough 
analysis of the effects of soft errors on application benchmarks, 
provides a cost-effective soft error resilience solution as well 
(with ~1% additional energy cost for a 50× improvement in silent 
data corruption rate).   
Index Terms—Cross-layer resilience; soft errors 
I. INTRODUCTION 
HIS paper addresses the cross-layer resilience challenge for designing robust digital systems: given a set of resilience techniques at various abstraction layers (circuit, logic, architecture, software, algorithm), how does one protect a given design from radiation-induced soft errors using (perhaps) a combination of these techniques, across multiple 
abstraction layers, such that overall soft error resilience targets are met at minimal costs (energy, power, execution time, area)? Specific soft error resilience targets addressed in this paper are: Silent Data Corruption (SDC), where an error causes the system to output an incorrect result without error indication; and, Detected but Uncorrected Error (DUE), 
 
This research is supported in part by DARPA, DTRA, NSF, and SRC. The 
views expressed are those of the authors and do not reflect the official policy 
or position of the Department of Defense or the U.S. Government. 
E. Cheng, S. Mirkhani, and S. Mitra are with Stanford University (e-mail: 
eccheng@ stanford.edu).  
H. Cho is with Hongik University. 
L. G. Szafaryn, K. Skadron, and M. R. Stan are with The University of 
Virginia. 
C.-Y. Cher and P. Bose are with IBM T. J. Watson Research Center. 
K. Lilja is with Robust Chip, Inc. 
J. A. Abraham is with The University of Texas at Austin. 
where an error is detected (e.g., by a resilience technique or a system crash or hang) but is not recovered automatically 
without user intervention. 
The need for cross-layer resilience, where multiple error 
resilience techniques from different layers of the system stack 
cooperate to achieve cost-effective error resilience, is 
articulated in several publications (e.g., [1], [2], [3], [4], [5], 
[6], [7]). 
There are numerous publications on error resilience 
techniques, many of which span multiple abstraction layers. 
These publications mostly describe specific implementations. Examples include structural integrity checking [8] and its 
derivatives (mostly spanning architecture and software layers) or the combined use of circuit hardening, error detection (e.g., using logic parity checking and residue codes) and instruction-level retry [9], [10], [11] (spanning circuit, logic, and architecture layers). Cross-layer resilience implementations in commercial systems are often based on “designer experience” or “historical practice.” There exists no comprehensive framework to systematically address the cross-layer resilience challenge. Creating such a framework is difficult. It must encompass the entire design flow end-to-end, from comprehensive and thorough analysis of various combinations of error resilience techniques all the way to layout-level implementations, such that one can (automatically) determine which resilience technique or combination of techniques (either at the same abstraction layer or across different abstraction layers) should be chosen. Such a framework is essential in order to answer important cross-layer resilience questions such as: 1) Is cross-layer resilience the best approach for achieving a given resilience target at low cost?  2) Are all cross-layer solutions equally cost-effective? If not, which cross-layer solutions are the best? 3) How do cross-layer choices change depending on application-level energy, latency, and area constraints? 4) How can one create a cross-layer solution that is cost-effective across a wide variety of application workloads? 5) Are there general guidelines for new error resilience techniques to be cost-effective?  We present CLEAR (Cross-Layer Exploration for 
Architecting Resilience), a first of its kind framework, which addresses the cross-layer resilience challenge. In this paper, we focus on the use of CLEAR for radiation-induced soft errors1 in terrestrial environments. Although the soft error rate of an SRAM cell or a flip-flop stays roughly constant or even decreases over technology generations, the system-level soft error rate increases with increased integration [12], [13], [14], [15]. Moreover, soft error rates can increase when lower supply voltages are used to improve energy efficiency [16], [17]. We focus on flip-flop 
soft errors because design techniques to protect them are generally expensive. Coding techniques are routinely used for protecting on-chip memories. Combinational logic circuits are significantly less susceptible to soft errors and do not pose a concern [18], [14]. We address both single-event upsets 
 
1 Other error sources (voltage noise and circuit-aging) may be incorporated 
into CLEAR, but are not the focus of this paper. 
Eric Cheng, Student Member, IEEE, Shahrzad Mirkhani, Member, IEEE, Lukasz G. Szafaryn, Chen-Yong Cher, Senior 
Member, IEEE, Hyungmin Cho, Kevin Skadron, Fellow, IEEE, Mircea R. Stan, Fellow, IEEE, Klas Lilja, Member, IEEE, 
Jacob A. Abraham, Fellow, IEEE, Pradip Bose, Fellow, IEEE, and Subhasish Mitra, Fellow, IEEE
Tolerating Soft Errors in Processor Cores Using CLEAR 
(Cross-Layer Exploration for Architecting Resilience) 
T
 2
(SEUs) and single-event multiple upsets (SEMUs) [19], [17]. 
While CLEAR can address soft errors in various digital 
components of a complex System-on-a-Chip (including 
uncore components [20] and hardware accelerators), a detailed analysis of soft errors in all these components is beyond the 
scope of this paper. Hence, we focus on soft errors in 
processor cores. 
To demonstrate the effectiveness and practicality of 
CLEAR, we explore 586 cross-layer combinations using ten 
representative error detection/correction techniques and four 
hardware error recovery techniques. These techniques span 
various layers of the system stack: circuit, logic, architecture, 
software, and algorithm (Fig. 1). Our extensive cross-layer 
exploration encompasses over 9 million flip-flop soft error injections into two diverse processor core architectures (Table 
I): a simple in-order SPARC Leon3 core (InO-core) and a 
complex super-scalar out-of-order Alpha IVM core (OoO-
core), across 18 benchmarks: SPECINT2000 [21] and 
DARPA PERFECT [22]. Such extensive exploration enables 
us to conclusively answer the cross-layer resilience questions: 
1) For a wide range of error resilience targets, optimized 
cross-layer combinations can provide low cost solutions 
for soft errors. 2) Not all cross-layer solutions are cost-effective. a) For general-purpose processor cores, a carefully optimized combination of selective circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly effective cross-layer resilience solution. For example, a 50× SDC improvement (defined in Sec. II.A) is achieved at 2.1% and 6.1% energy costs for the OoO- and InO-cores, respectively. The use of selective circuit-level hardening and logic-level parity checking is guided by a thorough analysis of the effects of soft errors on application benchmarks. b) When the application space can be restricted to matrix operations, a cross-layer combination of Algorithm Based Fault Tolerance (ABFT) correction, selective circuit-level hardening, logic-level parity checking, and micro-architectural recovery can be highly effective. For example, a 50× SDC improvement is achieved at 1.9% and 3.1% energy costs for the OoO- and InO-cores, respectively. But, this approach may not be practical for general-purpose processor cores targeting general applications. c) (Application-aware) selective circuit-level hardening, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a highly effective soft error resilience approach. For example, a 50× SDC improvement is achieved at 3.1% and 7.3% energy costs for the OoO- and InO-cores, respectively. 
3) The above conclusions about cost-effective soft error 
resilience techniques largely hold across various application 
characteristics (e.g., latency constraints despite errors in soft 
real-time applications). 4) Selective circuit-level hardening (and logic-level parity 
checking) techniques are guided by the analysis of the 
effects of soft errors on application benchmarks. Hence, 
one must address the challenge of potential mismatch 
between application benchmarks vs. applications in the 
field, especially when targeting high degrees of resilience 
(e.g., 10× or more SDC improvement). We overcome this 
challenge using various flavors of circuit-level hardening 
techniques (details in Sec. IV). 
5) Cost-effective resilience approaches discussed above provide bounds that new soft error resilience techniques 
must achieve to be competitive. It is, however, crucial that 
the benefits and costs of new techniques are evaluated 
thoroughly and correctly before publication. 
This paper extends work published in [23] by:  
1) Providing in-depth analysis of existing soft error detection 
and correction techniques to identify the basis of hidden 
costs, pinpoint the specific errors and flip-flop locations 
covered, and understand design intricacies that impact cost and reliability improvement for each technique. 2) Detailing results that were previously omitted for space constraints to reinforce our conclusion that many existing resilience techniques do not generate cost-effective cross-layer solutions since significant augmentation, using circuit- and logic-level techniques, is required regardless. 3) Considering the limitations and efficiency losses (cost and resilience improvement) for cross-layer solutions when working with fixed hardware (e.g., unmodifiable designs supplied by an upstream vendor, commercial off-the-shelf products, or existing platforms mandated by a customer). 
II. CLEAR FRAMEWORK 
Fig. 1 gives an overview of the CLEAR framework. Individual components of the framework are discussed below.  
A. Reliability Analysis 
CLEAR is not merely an error rate projection tool; rather, reliability analysis is one component of the overall framework.  We use flip-flop soft error injections for reliability analysis with respect to radiation-induced soft errors. This is because radiation test results confirm that injection of single bit-flips into flip-flops closely models soft error behaviors in actual systems [24], [25]. Furthermore, flip-flop-level error injection is crucial since naïve high-level error injections can be highly inaccurate [26]. For individual flip-flops, both SEUs and SEMUs (i.e., multi-node upsets within a flip-flop) manifest as single-bit errors. Based on measured probabilities for multi-
Fig. 1.  CLEAR framework: (a) BEE3 emulation cluster / Stampede supercomputer injects over 9 million errors into two diverse processor architectures running
18 full-length application benchmarks. (b) Accurate physical design evaluation accounts for resilience overheads. (c) Comprehensive resilience library
consisting of ten error detection / correction techniques + four hardware error recovery techniques. (d) Example illustrating thorough exploration of 586 cross-
layer combinations with varying energy costs vs. percentage of SDC-causing errors protected. 
(b) Physical design evaluation
Synopsys
DesignCompiler
IC Compiler
PrimeTime
28nm
Librarycells
SRAMcompiler
(a) Reliability analysis / execution time evaluation
Emulationcluster FPGA
BEE3
FPGAFPGAFPGAFPGAFPGAFPGAFPGA
BEE3
FPGAFPGA
BEE3
FPGAFPGAFPGA
Stampedesupercomputer RTLRTLRTLRTL
Reliability, area, power, energy,clock frequency, execution time
Alg.        (1) ABFT correction
(2) ABFT detection
SW        (3) Assertions
(4) CFCSS
(5) EDDI
Arch.     (6) DFC
(7) Monitor core
Logic     (8) Parity checking
Circuit   (9) LEAP-DICE
(10) EDS
(c) Resilience library
RTL
Error detection / correction Recovery
(d) Cross-layer evaluation
Benchmarks
586 total combinations
Energy
cost (%)
% SDC-causing errors protected
300
100
30
0
3
6
9
0      20     40      60     80    100
(1) Instruction Replay (IR)
(2) Extended IR (EIR)
(3) Flush
(4) Reorder Buffer (RoB)
 3
cell upsets in SRAM cell arrays, it can be concluded that the 
probability of a SEMU causing multi-bit upsets across 
multiple flip-flops is much less than 1% of the single-bit upset 
probability [14], [15]. Regardless, our SEMU-tolerant circuit 
hardening and our layout implementations ensure that baseline 
and resilient designs fully-consider the impact of SEMUs. 
 We injected over 9 million flip-flop soft errors into the RTL 
of the processor designs using three BEE3 FPGA emulation 
systems and also using mixed-mode simulations on the Stampede supercomputer (TACC at The University of Texas 
at Austin) (similar to [26], [27], [28], [29]). This ensures that 
error injection results have less than a 0.1% margin of error 
with a 95% confidence interval per benchmark. Errors are 
injected uniformly into all flip-flops and application regions, 
to mimic real world scenarios. 
 The SPECINT2000 [21] and DARPA PERFECT [22] 
benchmark suites are used for evaluation2. The PERFECT suite complements SPEC by adding applications targeting signal and image processing domains. We chose the SPEC workloads since the original publications corresponding to the resilience techniques used them for evaluation. We ran benchmarks in their entirety. Flip-flop soft errors can result in the following outcomes [26], [30], [25], [29], [31]: Vanished - normal termination and output files match error-free runs, Output Mismatch (OMM) - normal termination, but output files are different from error-free runs, Unexpected Termination (UT) - program terminates abnormally, Hang - no termination or output within 2× the nominal execution time, Error Detection (ED) - an employed resilience technique flags an error, but the error is not recovered using a hardware recovery mechanism.  Using the above outcomes, any error that results in OMM causes SDC (i.e., an SDC-causing error). Any error that results in UT, Hang, or ED causes DUE (i.e., a DUE-causing 
error). Note that, there are no ED outcomes if no error detection technique is employed. The resilience of a protected (new) design compared to an unprotected (original, baseline) design can be defined in terms of SDC improvement (Eq. 1a) or DUE improvement (Eq. 1b). The susceptibility of flip-flops to soft errors is assumed to be uniform across all flip-flops in the design (but this parameter is adjustable in our framework). Resilience techniques that increase the execution time of an application or add additional hardware also increase the susceptibility of the design to soft-errors. To accurately account for this situation, we calculate, based on [32], a correction factor γ (where γ ≥ 1), which is applied to ensure a fair and accurate comparison for all techniques3. Take for instance the monitor core technique; in our implementation, it increases the number of flip-flops in a resilient OoO-core by 38%. These extra flip-flops become additional locations for soft errors to occur. This results in a γ correction of 1.38 in order to account for the increased soft error susceptibility of the design. Techniques which increase execution time have a similar impact. For example, CFCSS incurs a 40.6% execution time impact; a corresponding γ correction of 1.41. A technique such as DFC, which increases flip-flop count (20%) and execution time (6.2%), would need a γ correction of 1.28 
 
2 11 SPEC / 7 PERFECT benchmarks for InO-cores and 8 SPEC / 3 
PERFECT for OoO-cores (missing benchmarks contain floating-point 
instructions not executable by the OoO-core RTL model). 
3 Research literature commonly considers γ=1. We report results using true 
γ values, but our conclusions hold for γ=1 as well (the latter is optimistic). 
(1.2×1.062) since the impact is multiplicative (increased flip-
flop count over an increased duration). The γ correction factor 
allows us to account for these increased susceptibilities for fair 
and accurate comparisons of all resilience techniques 
considered [32]. SDC and DUE improvements with γ=1 can 
be back-calculated by multiplying the reported γ value in 
Table III and do not change our conclusions.  	
ܵܦܥ	݅݉݌ݎ݋ݒ݁݉݁݊ݐ ൌ 	 ሺ௢௥௜௚௜௡௔௟	ைெெ	௖௢௨௡௧ሻሺ௡௘௪	ைெெ	௖௢௨௡௧ሻ	 	ൈ	ߛିଵ      (1a)	 
 ܦܷܧ	݅݉݌ݎ݋ݒ݁݉݁݊ݐ ൌ 	 ሺ௢௥௜௚௜௡௔௟	ሺ௎்ାு௔௡௚ሻ	௖௢௨௡௧ሻሺ௡௘௪	ሺ௎்ାு௔௡௚ାா஽ሻ	௖௢௨௡௧ሻ ൈ	ߛିଵ       (1b)  
Reporting SDC and DUE improvements allows our results 
to be agnostic to absolute error rates (we analyze the system-level behavior given a soft error). This allows a designer the 
flexibility to target the relative improvement needed in order to achieve an error rate for his or her specific application and technology. Although we have described the use of error injection-driven reliability analysis, the modular nature of CLEAR allows us to swap in other approaches as appropriate (e.g., our error injection analysis could be substituted with techniques like [33], once they are properly validated). As shown in Table II, across our set of applications, not all flip-flops will have errors that result in SDC or DUE (errors in 19% of flip-flops in the InO-core and 39% of flip-flops in the OoO-core always vanish regardless of the application). The logic design structures (e.g., lowest hierarchical-level RTL component) these flip-flops belong to are listed in [34]. This phenomenon has been documented in the literature [35] and is due to the fact that errors that impact certain structures (e.g., branch predictor, debug registers, etc.) have no effect on program execution or correctness. Additionally, this means that resilience techniques would not normally need to be applied to these flip-flops. However, for completeness, we also report design points which would achieve the maximum improvement possible, where resilience is added to every single flip-flop (including those with errors that always vanish). This maximum improvement point provides an upper bound for cost (given the possibility that for a future application, a flip-flop that currently has errors that always 
vanish may encounter an SDC- or DUE-causing error). 
Error Detection Latency (the time elapsed from when an error occurs in the processor to when a resilience technique detects the error) is also an important aspect to consider. An end-to-end reliable system must not only detect errors, but also recover from these detected errors. Long detection latencies impact the amount of computation that needs to be recovered and can also limit the types of recovery that are capable of recovering the detected error (Sec. II.D).  
B. Execution Time 
Execution time is estimated using FPGA emulation and RTL simulation. Applications are run to completion to accurately capture the execution time of an unprotected design. We also report the error-free execution time impact associated with resilience techniques at the architecture, software, and algorithm levels. For resilience techniques at the circuit and logic levels, our design methodology maintains the same clock speed as the unprotected design.  
C. Physical Design 
We used Synopsys design tools (Design Compiler, IC compiler, and Primetime) [36] with a commercial 28nm 
TABLE I. PROCESSOR DESIGNS STUDIED 
Core Design Description Clk. freq. Error injections Instructions per cycle
InO Leon3 [80] Simple, in-order (1,250 flip-flops) 2.0 GHz 5.9 million 0.4 
OoO IVM [28] Complex, super-scalar, out-of-order (13,819 flip-flops) 600 MHz 3.5 million 1.3 
TABLE II. DISTRIBUTION OF FLIP-FLOPS WITH ERRORS RESULTING IN SDC 
AND/OR DUE OVER ALL BENCHMARKS STUDIED  
Core % FFs with SDC-causing errors % FFs with DUE-causing errors % FFs with both SDC- and DUE-causing errors
InO 60.1% 78.3% 81.2%OoO 35.7% 52.1% 61%
 4
technology library (with corresponding SRAM compiler) to perform synthesis, place-and-route, and power analysis. 
Synthesis and place-and-route (SP&R) was run for all configurations of the design (before and after adding resilience techniques) to ensure all constraints of the original design (e.g., timing and physical design) were met for the resilient designs. Design tools often introduce artifacts (e.g., slight variations in the final design over multiple SP&R runs) that impact the final design characteristics (e.g., area, power). These artifacts can be caused by small variations in the RTL or optimization heuristics, for example. To account for these artifacts, we generated separate resilient designs based on error injection results for each individual application benchmark. SP&R was then performed for each of these designs, and the reported design characteristics were averaged to minimize the artifacts. For example, for each of our 18 application benchmarks, a separate resilient design that achieves a 50× SDC improvement using LEAP-DICE only is created. The costs to achieve this improvement are reported by averaging across the 18 designs. Relative standard deviation (i.e., standard deviation / mean) across all experiments range from 0.6-3.1%. Finally, we note that all layouts created during physical design are carefully generated in order to mitigate the impact of SEMUs (as explained in Sec. II.D). 
D. Resilience Library 
We carefully chose ten error detection and correction techniques together with four hardware error recovery techniques. These techniques largely cover the space of 
 
4 Circuit and logic techniques have tunable resilience achieved using 
selective insertion guided by error injection using application benchmarks. 
5 Maximum improvement corresponds to protecting every design flip-flop. 
6 Software techniques are only for InO-cores since LLVM removed 
support for the Alpha architecture. 
7 Some software assertions (e.g., [49]) suffer from false positives. Reported 
execution time impact discounts impact of false positives. 
8 Results differ from [49] since we use accurate flip-flop-level error 
injection. [19] demonstrated that architecture register injection used in [49] 
can be highly inaccurate. 
9 Actual detection latency may be shorter in practice. Our emulation 
platform reports time to trap and exit as well (order of few thousand cycles). 
10 EDDI with store-readback [53]. Without this enhancement, EDDI 
provides 3.3× SDC / 0.4× DUE improvement. 
11 Execution time impact for ABFT detection may be high due to 
computationally expensive error detection checks. 
12 EDS costs are for flip-flop only. Error signal routing and delay buffers 
(included in Table XVI) increase cost [42]. 
existing soft error resilience techniques. The characteristics (e.g., costs, resilience improvement, etc.) of each technique when used as a standalone solution (e.g., an error detection / correction technique by itself or, optionally, in conjunction with a recovery technique) are presented in Table III. Circuit: The hardened flip-flops (LEAP-DICE, Light Hardened LEAP, LEAP-ctrl) in Table IV are designed to tolerate both SEUs and SEMUs at both nominal and near-threshold operating voltages [19], [37]. SEMUs especially impact circuit techniques since a single strike affects multiple nodes within a flip-flop. Thus, these specially designed hardened flip-flops, which tolerate SEMUs through charge cancellation, are required. Hardened flip-flops have been experimentally validated using radiation experiments on test chips fabricated in 90nm, 45nm, 40nm, 32nm, 28nm, 20nm, and 14nm nodes in both bulk and SOI technologies and can be incorporated into standard cell libraries (i.e., standard cell design guidelines are satisfied) [19], [37], [38], [39], [40], [41]. The LEAP-ctrl flip-flop is a special design, which can operate in resilient (high resilience, high power) and economy (low resilience, low power) modes. It is useful in situations where a software or algorithm technique only provides protection when running specific applications and thus, selectively enabling low-level hardware resilience when the former techniques are unavailable may be beneficial. Although a variety of hardened flip-flops are available in the literature and can be used in our framework (see Additional 
Techniques), LEAP was chosen as a case study due to the extensive characterization data available. While Error 
Detection Sequential (EDS) [42], [43] was originally designed to detect timing errors, it can be used to detect flip-flop soft errors as well. While EDS incurs less overhead at the individual flip-flop level vs. LEAP-DICE, for example, EDS requires delay buffers to ensure minimum hold constraints, aggregation and routing of error detection signals to an output (or recovery module), and a recovery mechanism to correct detected errors. These factors can significantly increase the overall costs for implementing a resilient design utilizing EDS (Table XVI). Logic: Parity checking provides error detection by checking 
TABLE IV. RESILIENT FLIP-FLOPS 
Type Soft Error Rate Area Power Delay Energy
Baseline 1 1 1 1 1Light Hardened LEAP (LHL) 2.5×10-1 1.2 1.1 1.2 1.3LEAP-DICE 2×10-4 2 1.8 1 1.8LEAP-ctrl (economy mode) 1 3.1 1.2 1 1.2LEAP-ctrl (resilient mode) 2×10-4 3.1 2.2 1 2.2EDS12 ~100% detect 1.5 1.4 1 1.4
TABLE III. INDIVIDUAL RESILIENCE TECHNIQUES: COSTS AND IMPROVEMENTS AS A STANDALONE SOLUTION 
Layer Technique Areacost Power cost Energy cost Exec. time impact Avg. SDC improve Avg. DUE improve False positive Detection latency γ 
Circuit4 
LEAP-DICE (no additional recovery needed) 
InO 0-9.3% 0-22.4% 0-22.4% 0% 1× -5,000×5 1× - 5,000×5 0% n/a 1 OoO 0-6.5% 0-9.4% 0-9.4%
EDS (without recovery - unconstrained) 
InO 0-10.7% 0-22.9% 0-22.9% 0% 1× ->100,000×5 0.1×5 - 1× 0% 1 cycle 1 OoO 0-12.2% 0-11.5% 0-11.5%
EDS (with IR recovery) 
InO 0-16.7% 0-43.9% 0-43.9% 0% 1× -  >100,000×5 1× - >100,000×5 0% 1 cycle 1.41.06 OoO 0-12.3% 0-11.6% 0-11.6%
Logic4 
Parity (without recovery - unconstrained) 
InO 0-10.9% 0-23.1% 0-23.1% 0% 1× ->100,000×5 0.1×5 - 1× 0% 1 cycle 1 OoO 0-14.1% 0-13.6% 0-13.6%
Parity (with IR recovery) 
InO 0-26.9% 0-44% 0-44% 0% 1× ->100,000×5 1× - >100,000×5 0% 1 cycle 1.41.06 OoO 0-14.2% 0-13.7% 0-13.7%
Arch. 
DFC (without recovery - unconstrained) 
InO 3% 1% 7.3% 6.2% 1.2× 0.5× 0% 15 cycles 1.28OoO 0.2% 0.1% 7.2% 7.1% 1.09
DFC (with EIR recovery) 
InO 37% 33% 41.2% 6.2% 1.2× 1.4× 0% 15 cycles 1.48 OoO 0.4% 0.2% 7.3% 7.1% 1.14
Monitor core (with RoB recovery) OoO 9% 16.3% 16.3% 0% 19× 15× 0% 128 cycles 1.38
Soft-ware6 
Software assertions for general-purpose processors (without recovery - unconstrained) InO 0% 0% 15.6% 15.6%7 1.5×8 0.6× 0.003% 9.3M cycles9 1.16 
CFCSS (without recovery - unconstrained) InO 0% 0% 40.6% 40.6% 1.5× 0.5× 0% 6.2M cycles9 1.41 
EDDI (without recovery - unconstrained) InO 0% 0% 110% 110% 37.8×10 0.3× 0% 287K cycles9 2.1 
Alg. 
ABFT correction (no additional recovery needed) InOOoO 0% 0% 1.4% 1.4% 4.3× 1.2× 0% n/a 1.01 
ABFT detection (without recovery - unconstrained) InOOoO 0% 0% 24% 1-56.9%11 3.5× 0.5× 0% 9.6M cycles9 1.24 
 5
flip-flop inputs and outputs [44]. Our design heuristics reduce 
the cost of parity while also ensuring that clock frequency is 
maintained as in the original design (by varying the number of 
flip-flops checked together, grouping flip-flops by timing 
slack, pipelining parity checker logic, etc.) Naïve 
implementations of parity checking can otherwise degrade 
design frequency by up to 200 MHz (20%) or increase energy 
cost by 80% on the InO-core. We minimize SEMUs through 
layouts that ensure a minimum spacing (the size of one flip-flop) between flip-flops checked by the same parity checker. 
This ensures that only one flip-flop, in a group of flip-flops 
checked by the same parity checker, will encounter an upset 
due to a single strike in our 28nm technology in terrestrial environments [45]. Although a single strike could impact multiple flip-flops, since these flip-flops are checked by different checkers, the upsets will be detected. Since this absolute minimum spacing will remain constant, the relative spacing required between flip-flops will increase at smaller technology nodes, which may exacerbate the difficulty of implementation. Minimum spacing is enforced by applying design constraints during the layout stage. This constraint is important because even in large designs, flip-flops will still tend to be placed very close to one another. Table V shows the distribution of distances that each flip-flop has to its next nearest neighbor in a baseline design (this does not correspond to the spacing between flip-flops checked by the same logic parity checker). As shown, the majority of flip-flops are actually placed such that they would be susceptible to a SEMU. After applying parity checking, we see that no flip-flop, within a group checked by the same parity checker, is placed such that it will be vulnerable to a SEMU (Table VI). Logic parity is implemented using an XOR-tree based predictor and checker, which detects flip-flops soft errors. This implementation differs from logic parity prediction, which also targets errors inside combinational logic [46]. XOR-tree logic parity is sufficient for detecting flip-flop soft errors (with the minimum spacing constraint applied). “Pipelining” in the predictor tree (Fig. 2) may be required to ensure 0% clock period impact. We evaluated the following heuristics for forming parity groups (the specific flip-flops that are checked together) to minimize cost of parity (cost comparisons in Table VII): 
1) Parity group size: flip-flops are clustered into a constant power of 2-sized group, which amortizes the parity logic 
cost by allowing the use of full binary trees at the predictor 
and checker. The last set of flip-flops will consist of 
modulo “group size” of flip-flops. 
2) Vulnerability: flip-flops are sorted by decreasing 
susceptibility to errors causing SDC or DUE and grouped 
into a constant power of 2-sized group. The last set of flip-
flops will consist of modulo “group size” of flip-flops. 
3) Locality: flip-flops are grouped by their location in the 
layout, in which flip-flops in the same functional unit are grouped together to help reduce wire routing for the predictor and checker logic. A constant power of 2-sized groups are formed with the last group consisting of modulo “group size” of flip-flops. 4) Timing: flip-flops are sorted based on their available timing path slack and grouped into a constant power of 2-sized group. The last set of flip-flops will consist of modulo “group size” of flip-flops. 5) Optimized: Fig. 3 describes our heuristic. Our solution is the most optimized and is the configuration we use to report overhead values. When unpipelined parity can be used, it is better to use larger-sized groups (e.g., 32-bit groups) in order to amortize the additional predictor/checker logic to the number of flip-flops protected. However, when pipelined parity is required, we found 16-bit groups to be a good option. This is because beyond 16-bits, additional pipeline flip-flops begin to dominate costs. These factors have driven our implementation of the previously described heuristics.  Architecture: Our implementation of Data Flow Checking (DFC), which checks static dataflow graphs, includes Control 
Flow Checking (CFC), which checks static control-flow graphs. This combination checker resembles that of [47], which is also similar to the checker in [8]. Compiler optimization embeds the static signatures required by the checkers into unused delay slots in the software, thereby reducing execution time overhead by 13%. Table VIII helps explain why DFC is unable to provide high SDC and DUE improvement. Of flip-flops that have errors that result in SDCs and DUEs (Sec. II.A), DFC checkers detect SDCs and DUEs in less than 68% of these flip-flops 
 Fig. 2.  “Pipelined” logic parity. 
Predictor
CheckerComb.
logic
Maintain clock period Parity group (4-32 FF size)
Original components
Parity components
Pipeline flip-flops
TABLE V. DISTRIBUTION OF SPACING BETWEEN A FLIP-FLOP AND ITS 
NEAREST NEIGHBOR IN A BASELINE (ORIGINAL, UNPROTECTED) DESIGN 
Distance InO-core OoO-core
< 1 flip-flop length away (i.e., flip-flops are adjacent and vulnerable to a SEMU) 65.2% 42.2% 1 - 2 flip-flop lengths away 30% 30.6%2 - 3 flip-flop lengths away 3.7% 18.4%3 - 4 flip-flop lengths away 0.6% 3.5%> 4 flip-flop lengths away 0.5% 5.3%
Fig. 3.  Logic parity heuristic for low cost parity implementation. 32-bit
unpipelined parity and 16-bit pipelined parity were experimentally
determined to be the lowest cost configurations. 
Set of all flip-flops in design
Implement unpipelined parity
Enough
timing slack for 32-bit
predictor tree?
yes
Finish
Group flip-flops, by 
functional unit, into 
32-bit groups. Last 
group is (size % 32)
(locality heuristic)
no
Implement pipelined parity
Group flip-flops, by 
functional unit into 
16-bit groups. Last 
group is (size % 16)
(locality heuristic)
TABLE VIII. DFC ERROR COVERAGE 
 InO OoOSDC DUE SDC DUE
% flip-flops with a SDC- / DUE-causing error that are detected by DFC 57% 68% 65% 66% % of SDC- / DUE-causing errors detected (average per FF that is protected by DFC) 30% 30% 29% 40% Overall % of SDC- / DUE-causing errors detected (for all flip-flops in the design) 15.9% 27% 19.3% 30% Resulting improvement (Eq. 1) 1.2× 1.4× 1.2× 1.4×
TABLE VII. COMPARISON OF HEURISTICS FOR “PIPELINED” LOGIC PARITY 
IMPLEMENTATIONS TO PROTECT ALL FLIP-FLOPS ON THE INO-CORE 
Heuristic Area cost Power cost Energy cost
Vulnerability(4-bit parity group) 15.2% 42% 42%Vulnerability(8-bit parity group) 13.4% 29.8% 29.8%Vulnerability(16-bit parity group) 13.3% 27.9% 27.9%Vulnerability(32-bit parity group) 14.6% 35.3% 35.3%Locality (16-bit parity group) 13.4% 29.4% 29.4%Timing (16-bit parity group) 11.5% 26.8% 26.8%Optimized (16-/32-bit groups) 10.9% 23.1% 23.1%TABLE VI. DISTRIBUTION OF SPACING BETWEEN A FLIP-FLOP AND ITS 
NEAREST NEIGHBOR IN THE SAME PARITY GROUP (I.E., MINIMUM DISTANCE 
BETWEEN FLIP-FLOPS CHECKED BY THE SAME PARITY CHECKER) 
Distance InO-core OoO-core
< 1 flip-flop length away (i.e., flip-flops are adjacent and vulnerable to a SEMU) 0% 0% 1 - 2 flip-flop lengths away 7.8% 8.8%2 - 3 flip-flop lengths away 5.3% 10.6%3 - 4 flip-flop lengths away 3.4% 18.3%> 4 flip-flop lengths away 83.3% 62.2%
Average distance 4.4 flip-flops 12.8 flip-flops
 6
(these 68% of flip-flops are distributed across all pipeline 
stages). For these 68% of flip-flops, on average, DFC detects 
less than 40% of the errors that result in SDCs or DUEs. This 
is because not all errors that result in an SDC or DUE will 
corrupt the dataflow or control flow signatures checked by the 
technique (e.g., register contents are corrupted and written out to a file, but the executed instructions remain unchanged). The 
combination of these factors means DFC is only detecting 
~30% of SDCs or DUEs; thus, the technique provides low 
resilience improvement. These results are consistent with 
previously published data (detection of ~16% of non-vanished 
errors) on effectiveness of DFC checkers in simple cores [47]. 
Monitor cores are checker cores that validate instructions executed by the main core (e.g., [48], [8]). We analyze monitor cores similar to [48]. For InO-cores, the size of the monitor core is of the same order as the main core, and hence, excluded from our study. For OoO-cores, the simpler monitor core can have lower throughput compared to the main core and thus stall the main core. We confirm (via IPC estimation) that our monitor core implementation is sufficient to run the required checking operations without stalling the main core (Table IX). Software: Software assertions for general-purpose 
processors13 check program variables to detect errors. We combine assertions from [49], [50] to check both data and control variables to maximize error coverage. Checks for data variables (e.g., end result) are added via compiler transformations using training inputs to determine the valid range of values for these variables (e.g., likely program invariants). Since such assertion checks are added based on training inputs, it is possible to encounter false positives, where an error is reported in an error-free run. We have determined this false positive rate by training the assertions using representative inputs. However, we perform final analysis by incorporating the input data used during evaluation into the training step in order to give the technique the best possible benefit and to eliminate the occurrence of false positives. Checks for control variables (e.g., loop index, stack pointer, array address) are determined using application profiling and are manually added in the assembly code. In Table X, we breakdown the contribution to cost, improvement, and false positives resulting from assertions checking data variables [50] vs. those checking control variables [49]. Table XI demonstrates the importance of evaluating resilience techniques using accurate error injection (explained in [26]). Depending on the particular error injection model used, SDC improvement could be over-estimated for one benchmark and under-estimated for another. For instance, using inaccurate architecture register error injection (regU), one would be led to believe that software assertions provide 3× the SDC improvement than they do in reality (e.g., when evaluated using flip-flop-level error injection). In order to pinpoint the sources of inaccuracy between the actual improvement rates that were determined using accurate flip-flop-level error injection vs. those published in the literature, we conducted error injection campaigns at other 
 
13 Same applications studied in [49]. 
levels of abstraction (architecture register and program 
variable). However, even then, we were unable to exactly 
reproduce previously published improvement rates. Some 
additional differences in our architecture and program variable 
injection methodology compared to the published 
methodology may account for this discrepancy: 1) Our architecture register and program variable evaluations 
were conducted on a SPARCv8 in-order design rather than 
a SPARCv9 out-of-order design. 2) Our architecture register and program variable methodology injects errors uniformly into all program instructions while previous publications only inject into integer instructions of floating-point benchmarks. 3) Our architecture register and program variable methodology injects errors uniformly over the full application rather than injecting only into the core of the application during computation. 4) Since our architecture register and program variable methodology injects errors uniformly into all possible error candidates (e.g., all cycles and targets), the calculated improvement covers the entire design. Previous publications calculated improvement over the limited subset of error candidates (out of all possible error candidates) that were injected into and thus, only covers a subset of the design. 
Control Flow Checking by Software Signatures (CFCSS) checks static control flow graphs and is implemented via compiler modification similar to [51]. We can analyze CFCSS in further detail to gain deeper understanding as to why improvement for the technique is relatively low (Table XII). Compared to DFC (a technique with a similar concept), we see that CFCSS offers slightly better SDC improvement. However, since CFCSS only checks control flow signatures, many SDCs will still escape (e.g., the result of an add is corrupted and written to file). Additionally, certain DUEs, such as those which may cause a program crash, will not be detectable by CFCSS, or other software techniques, since execution may abort before a corresponding software check can be triggered. The relatively low resilience improvement of CFCSS has been corroborated in actual systems as well [52]. 
Error Detection by Duplicated Instructions (EDDI) provides instruction redundant execution via compiler modification [53]. We utilize EDDI with store-readback [54] to maximize coverage by ensuring that values are written correctly. From Table XIII, it is clear why store-readback is important for EDDI. In order to achieve high SDC improvements, nearly all SDC causing errors need to be detected. By detecting an additional 12% of SDCs, store-readback increases SDC improvement of EDDI by an order of magnitude. Virtually all escaped SDCs are caught by ensuring that the values being written to the output are indeed correct (by reading back the written value). However, given that some SDC- or DUE-causing errors are still not detected by the technique, the results show that using naïve high-level 
TABLE IX. MONITOR CORE VS. MAIN CORE 
Design Clk. freq. Average Instructions Per Cycle (IPC)
OoO-core 600 MHz 1.3 Monitor core 2 GHz 0.7 
TABLE X. COMPARISON OF ASSERTIONS CHECKING DATA (E.G., END RESULT) 
VS. CONTROL (E.G., LOOP INDEX) VARIABLES 
 Data variable check Control variable check Combined check
Execution time impact 12.1% 3.5% 15.6%SDC improvement 1.5× 1.1× 1.5×DUE improvement 0.7× 0.9× 0.6×False positive rate 0.003% 0% 0.003%
TABLE XI. COMPARISON OF SDC IMPROVEMENT FOR ASSERTIONS WHEN 
INJECTING ERRORS AT VARIOUS LEVELS 
App.13 Flip-flop (ground truth)
Register uniform (regU)
Register write (regW) 
Program variable uniform (varU) Program variable write (varW) 
bzip2 1.8× 1.6× 1.1× 1.9× 1.5×crafty 0.5× 0.3× 0.5× 0.7× 1.1×gzip 2× 19.3× 1× 1.6× 1.1×mcf 1.1× 1.3× 0.9× 1× 1.8×parser 2.4× 1.7× 1× 2.4× 2×
avg. 1.6× 4.8× 0.9× 1.5× 1.5×
TABLE XII. CFCSS ERROR COVERAGE 
SDC DUE
% flip-flops with a SDC- / DUE-causing error that is detected by CFCSS 55% 66% % of SDC- / DUE-causing errors that are detected per FF that is protected by CFCSS 61% 14% Resulting improvement (Eq. 1) 1.5× 0.5×
 7
injections will still yield incorrect conclusions (Table XIV). 
Enhancements to EDDI such as Error detectors [55] and 
reliability-aware transforms [56], are intended to reduce the 
number of EDDI checks (i.e., selective insertion of checks) in order to minimize execution time impact while maintaining 
high overall error coverage. We evaluated the Error detectors 
technique using flip-flop-level error injection and found that 
they provide an SDC improvement of 2.6× improvement (a 
21% reduction in SDC improvement as compared to EDDI without store-readback). Error detectors also requires software path tracking to recalculate important variables, which introduced a 3.9× execution time impact, greater than that of the original EDDI technique. The overhead corresponding to software path tracking can be reduced by implementing path tracking in hardware (as was done in the original work), but doing so eliminates the benefits of EDDI as a software-only technique. Algorithm: Algorithm Based Fault Tolerance (ABFT) can detect (ABFT detection) or detect and correct errors (ABFT 
correction) through algorithm modifications [57], [58], [59], [60]. Although ABFT correction algorithms can be used for detection-only (with minimally reduced execution time impact), ABFT detection algorithms cannot be used for correction. There is often a large difference in execution time impact between ABFT algorithms as well depending on the complexity of check calculation required. An ABFT correction technique for matrix inner product, for example, requires simple modular checksums (e.g., generated by adding all elements in a matrix row) – an inexpensive computation. On the other hand, ABFT detection for FFT, for example, requires expensive calculations using Parseval’s theorem [61]. For the particular applications we studied, the algorithms that were protected using ABFT detection often required more computationally-expensive checks than algorithms that were protected using ABFT correction; therefore, the former generally had greater execution time impact (relative to each of their own original baseline execution times). An additional complication arises when an ABFT detection-only algorithm is implemented. Due to the long error detection latencies imposed by ABFT detection (9.6 million cycles, on average), hardware recovery techniques are not feasible and higher level recovery mechanisms will impose significant overheads. Recovery: We consider two recovery scenarios: bounded 
latency, i.e., an error must be recovered within a fixed period of time after its occurrence, and unconstrained, i.e., where no 
latency constraints exist and errors are recovered externally 
once detected (no hardware recovery is required). Bounded 
latency recovery is achieved using one of the following 
hardware recovery techniques (Table XV): flush or reorder 
buffer (RoB) recovery (both of which rely on flushing non-
committed instructions followed by re-execution) [62], [63]; 
instruction replay (IR) or extended instruction replay (EIR) 
recovery (both of which rely on instruction checkpointing to 
rollback and replay instructions) [10]. EIR is an extension of IR with additional buffers required by DFC for recovery. 
Flush and RoB are unable to recover from errors detected after 
the memory write stage of InO-cores or after the reorder 
buffer of OoO-cores, respectively (these errors will have propagated to architecture visible states). Hence, LEAP-DICE is used to protect flip-flops in these pipeline stages when using flush/RoB recovery. IR and EIR can recover detected errors in any pipeline flip-flop. IR recovery is shown in in Fig. 4 and flush recovery is shown in Fig. 5. Since recovery hardware serves as single points of failure, flip-flops in the recovery hardware itself need to be capable of error correction (e.g., protected by hardened flip-flops when considering soft errors). Additional Techniques: Many additional resilience techniques have been published in literature; but, these techniques are closely related to our evaluated techniques. Therefore, we believe that our results are representative and largely cover the cross-layer design space. At the circuit-level, hardened flip-flops like DICE (Dual Interlocked storage Cell) [64], BCDMR (Bistable Cross-coupled Dual Modular Redundancy) [65], and BISER (Built In Soft Error Resilience) [66] are similar in cost to LEAP-DICE, the most resilient hardened flip-flop studied. The DICE technique suffers from an inability to tolerate SEMUs, unlike LEAP-DICE. BISER is capable of operating in both economy and resilient modes. This enhancement is provided by LEAP-ctrl. Hardened flip-flops like RCC (Reinforcing Charge Collection) [13] offer around 3× soft error rate improvement at around 1.2× area, power, and energy cost. LHL provides slightly more soft error tolerance at roughly the same cost as RCC. Circuit-level detection techniques such as [67], [68], [69] are similar to EDS. Like EDS, these techniques can detect 
soft errors while offering minor differences in actual implementation. Stability checking [70] works on a similar principle of time sampling to detect errors. Logic-level techniques like residue codes [9] can be effective for specific functional units like multipliers, but are costlier to implement than the simple XOR-trees used in logic parity. Additional logic level coding techniques like Berger 
Fig. 4.  Instruction Replay (IR) recovery. 
fet
ch
de
co
de
reg
ist
er
ex
ec
ute
me
mo
ry
ex
ce
pti
on
wr
ite
Recovery 
control
Re
gis
ter
 fil
e
Sh
ad
ow
 
reg
iste
r fi
le
Hardened
Recovery hardware
Cross-layer protected
Ins
tru
cti
on
Shadowed instructions (replay)
Deferred shadow register writeError detection Normal operationRecovery operation
Mirroring
Fig. 5.  Flush recovery. 
Hardened
Recovery hardware
Cross-layer protectedfet
ch
de
co
de
reg
ist
er
exe
cu
te
me
mo
ry
exc
ep
tio
n
wr
ite
Recovery 
control
Shadowed instructions (replay) 
Error detection
Ins
tru
cti
on
Normal operation
Recovery operation
Error can’t escape Error can escape
TABLE XIII. EDDI: IMPORTANCE OF STORE-READBACK 
 SDC improve-ment 
% SDC errors detected 
SDC errors escaped 
DUE improve-ment 
% DUE errors detected
DUE errors escaped 
Without store-readback 3.3× 86.1% 49 0.4× 19% 3090 With store-readback 37.8× 98.7% 6 0.3× 19.8% 3006 
TABLE XIV. COMPARISON OF SDC IMPROVEMENT AND DETECTION FOR 
EDDI WHEN INJECTING ERRORS AT VARIOUS LEVELS (NO STORE-READBACK)
Injection location SDC improvement % SDC detected
Flip-flop (ground truth) 3.3× 86.1%Register uniform (regU) 2.0× 48.8%Register write (regW) 6.6× 84.8%Program variable uniform (varU) 12.6× 92.1%Program variable write (varW) >100,000× 100%
TABLE XV. HARDWARE ERROR RECOVERY COSTS 
Core Type Area Power Energy Recovery latency Unrecoverable flip-flop errors
InO
Instruction Replay (IR) recovery 16% 21% 21% 47 cycles None (all pipeline FFs recoverable) EIR recovery 34% 32% 32% 47 cycles
Flush recovery 0.6% 0.9% 1.8% 7 cycles FFs after memory write stage
OoO
Instruction Replay (IR) recovery 0.1% 0.1% 0.1% 104 cycles None (all pipeline FFs recoverable) EIR recovery 0.2% 0.1% 0.1% 104 cycles
Reorder Buffer (RoB) recovery 0.01% 0.01% 0.01% 64 cycles FFs after reorder buffer stage
 8
codes [71] and Bose-Lin codes [72] are costlier to implement 
than logic parity. Like logic parity checking, residue, Berger, 
and Bose-Lin codes only detect errors. 
Techniques like DMR (Dual Modular Redundancy) and 
TMR (Triple Modular Redundancy) at the architecture level 
can be easily ruled out since these techniques will incur more 
than 100% area, power, and energy costs. RMT (Redundant 
Multi-Threading) [73] has been shown to have high (>40%) 
energy costs (which can increase due to recovery since RMT only serves to detect errors). Additionally, RMT is highly architecture dependent, which limits its applicability. Software techniques like Shoestring [74], Error detectors [55], Reliability-driven transforms [56], and SWIFT [75] are similar to EDDI, but offer variations to the technique by reducing the number of checks added. As a result, EDDI can be used as a bound on the maximum error detection possible. An enhancement to SWIFT, known as CRAFT [76], uses hardware acceleration to improve reliability, but doing so eliminates the benefit of EDDI as a software-only technique. Although it is difficult to faithfully compare these “selective” EDDI techniques as published (since the original authors evaluated improvements using high-level error injection at the architecture register level which are generally inaccurate), the published results for these “selective” EDDI techniques show insufficient benefit (Table XVII).  Enhancements which reduce the execution time impact provide very low SDC improvements, while those that provide moderate improvement incur high execution time (and thus, energy) impact (much higher than providing the same improvement using LEAP-DICE, for instance). Fault screening [62] is an additional software level technique. However, this technique also checks to ensure intermediate values computed during execution fall within expected bounds, which is similar to the mechanisms behind Software assertions for general-purpose processors, and thus, is covered by the latter.  Low-level Techniques: Resilience techniques at the circuit and logic layer (i.e., low-level techniques) are tunable as they can be selectively applied to individual flip-flops. As a result, a range of SDC/DUE improvements can be achieved for varying costs (Table XVI). These techniques offer the ability to finely tune the specific flip-flops to protect in order to achieve the degree of resilience improvements required. In general, (application-aware) selective hardening using LEAP-DICE provides the most cost-effective standalone solution for both InO- and OoO-cores over all improvements. Logic parity is less efficient due to the need for pipelining parity logic and EDS is less efficient due to the need for delay buffers. High-level Techniques: In general, techniques at the 
architecture, software, and algorithm layers (i.e., high-level 
techniques) are less tunable as there is little control of the 
exact subset of flip-flops a high-level technique will protect. 
From Table IV, we see that no high-level technique provides 
more than 38× improvement (while most offer far less 
improvement). As a result, to achieve a 50× improvement, for 
example, augmentation from low-level techniques at the 
circuit- and logic-level are required, regardless. 
III. CROSS-LAYER COMBINATIONS 
CLEAR uses a top-down approach to explore the cost-effectiveness of various cross-layer combinations. For example, resilience techniques at the upper layers of the system stack (e.g., ABFT correction) are applied before incrementally moving down the stack to apply techniques from lower layers (e.g., an optimized combination of logic parity checking, circuit-level LEAP-DICE, and micro-architectural recovery). This approach (example shown in Fig. 6) ensures that resilience techniques from various layers of the stack effectively interact with one another. Resilience techniques from the algorithm, software, and architecture layers of the stack generally protect multiple flip-flops (determined using error injection); however, a designer typically has little control over the specific subset protected. Using multiple techniques from these layers can lead to a situation where a given flip-flop may be protected (sometimes unnecessarily) by multiple techniques. At the logic and circuit layers, fine-grained protection is available since these techniques can be applied selectively to individual flip-flops (those not sufficiently protected by higher-level techniques). We explore a total of 586 cross-layer combinations using CLEAR (Table XVIII). Not all combinations of the ten resilience techniques and four recovery techniques are valid (e.g., it is unnecessary to combine ABFT correction and ABFT detection since the techniques are mutually exclusive or to explore combinations of monitor cores to protect an InO-core due to the high cost). Accurate flip-flop level injection and layout evaluation reveals many individual techniques provide minimal (less than 1.5×) SDC/DUE improvement (contrary to conclusions reported in the literature that were derived using inaccurate architecture- or software-level injection), have high costs, or both. The consequence of this revelation is that most cross-layer combinations have high cost (detailed results for these costly combinations are omitted for brevity but are shown in Fig. 1).  
 
14 Costs generated per benchmark with average cost over all benchmarks 
reported. Relative standard deviation is 0.6-3.1% 
15 DUE improvement not possible using detection-only techniques given 
unconstrained recovery. 
Fig. 6.  Cross-layer methodology example for combining ABFT correction, 
LEAP-DICE, logic parity, and micro-architectural recovery. 
Unprotected design Perform error injection to 
determine percentage of 
errors resulting in SDC/DUE 
per flip-flop when application 
running with ABFT correction
Protected 
design
Apply ABFT correction
Apply LEAP-DICE, parity, and recovery 
to flip-flops until required SDC/DUE 
improvement is achieved (Fig. 7)
TABLE XVI. COSTS VS. SDC AND DUE IMPROVEMENTS FOR TUNABLE RESILIENCE TECHNIQUES14 
A (AREA COST %), P (POWER COST %), E (ENERGY COST %)  (P=E FOR THESE COMBINATIONS – NO CLOCK/EXECUTION TIME IMPACT) 
 Bounded latency recovery Unconstrained recovery
15 Exec.time impactSDC improvement DUE improvement SDC improvement DUE improvement2 5 50 500 max 2 5 50 500 max 2 5 50 500 max 2 5 50 500 max
InO 
Selective hardening using LEAP-DICE A E 0.8 2 1.8 4.3 2.9 7.3 3.3 8.2 9.3 22.4 0.71.5 1.73.8 3.89.5 5.112.5 9.322.4 0.82 1.84.3 2.97.3 3.38.2 9.3 22.4 0.7 1.5 1.7 3.8 3.8 9.5 5.112.5 9.322.4 0% 
Logic parity only (+ IR recovery) A E 17.3 23.4 18.6 26 20.3 29.4 20.7 30.5 26.9 44.1 16.922.5 18.325.4 21.531.9 22.835 23.335.9 1.32.4 2.65 4.38.4 4.79.5 10.9 23.1 - - - - - 0% 
EDS-only (+ IR recovery) A E 17.1 23.1 18.1 25.4 19.7 28.5 20.5 29.6 26.7 43.9 16.822.1 1825.2 20.331.5 22.539.2 26.243.7 1.12.1 2.14.4 3.77.5 4.58.6 10.7 22.9 - - - - - 0% 
OoO 
Selective hardening using LEAP-DICE A E 1.1 1.5 1.3 1.7 2.2 3.1 2.4 3.5 6.5 9.4 1.32 1.62.3 3.14.2 3.65.1 6.59.4 1.11.5 1.31.7 2.23.1 2.43.5 6.5 9.4 1.3 2 1.6 2.3 3.1 4.2 3.65.1 6.59.4 0% Logic parity only (+ IR recovery) A E 1.9 1.6 2.1 2.4 6.1 4.1 6.3 5.1 14.2 13.7 1.72.4 2.63 4.54.4 55.4 13.813.6 1.81.5 22.3 5.94 6.25 14.1 13.6 - - - - - 0% EDS-only (+ IR recovery) A E 1.4 1.7 1.8 2.1 3.3 3.5 4 4 12.3 11.6 1.32.1 22.5 3.64.4 45.3 11.811.4 1.31.6 1.72 3.23.4 3.93.9 12.2 11.5 - - - - - 0% 
TABLE XVII. COMPARISON OF “SELECTIVE” EDDI TECHNIQUES AS 
REPORTED IN LITERATURE COMPARED TO EDDI EVALUATED USING FLIP-
FLOP-LEVEL ERROR INJECTION 
 Error-injection SDC improve Exec. time impact
EDDI with store-readback (implemented) Flip-flop 37.8× 2.1×Reliability-aware transforms (published) Arch. reg. 1.8× 1.05×Shoestring (published) Arch. reg. 5.1× 1.15×SWIFT (published) Arch. reg. 13.7× 1.41×
 9
A. Combinations for General-Purpose Processors 
Among the 586 cross-layer combinations explored using 
CLEAR, a highly promising approach combines selective 
circuit-level hardening using LEAP-DICE, logic parity, and 
micro-architectural recovery (flush recovery for InO-cores, 
RoB recovery for OoO-cores). Thorough error injection using application benchmarks plays a critical role in selecting the flip-flops protected using these techniques. Fig. 7 and Heuristic 1 detail the methodology for creating this combination. If recovery is not needed (e.g., for unconstrained recovery), the “Harden” procedure in Heuristic 1 can be modified to always return false. For example, to achieve a 50× SDC improvement, the combination of LEAP-DICE, logic parity, and micro-architectural recovery provides a 1.5× and 1.2× energy savings for the OoO- and InO-cores, respectively, compared to selective circuit hardening using LEAP-DICE (Table XX). The relative benefits are consistent across benchmarks and over the range of SDC/DUE improvements. The overheads in Table XX are small because we reported the most energy-efficient resilience solutions. Most of the 586 combinations are far costlier. Let us consider the scenario where recovery hardware is not needed (e.g., unconstrained recovery). In this case, a minimal (<0.2% energy) savings can be achieved when targeting SDC improvement. However, without recovery hardware, DUEs increase since detected errors are now uncorrectable; thus, no DUE improvement is achievable. Finally, one may suppose that the inclusion of EDS into cross-layer optimization may yield further savings since EDS costs ~25% less area, power, energy than LEAP-DICE. However, a significant portion of EDS overhead is not captured solely by cell overhead. In fact, the additional cost of aggregating and routing the EDS error detection signals and the cost of adding delay buffers to satisfy minimum delay constraints posed by EDS dominates cost and prevents cross-layer combinations using EDS from yielding benefits.  Additional cross-layer combinations spanning circuit, logic, architecture, and software layers are presented in Table XX. In general, most cross-layer combinations are not cost-effective. For general-purpose processors, a cross-layer combination of LEAP-DICE, logic parity, and micro-architectural recovery provides the lowest cost solution for InO- and OoO-cores for all improvements. Up to this point, we have considered SDC and DUE improvements separately. However, it may be useful to achieve a specific improvement in SDC and DUE simultaneously. When targeting SDC improvement, DUE improvement also improves (and vice-versa); however, it is unlikely that the two improvements will be the same since flip-flops with high SDC vulnerability will not necessarily be the same flip-flops that have high DUE vulnerability. A 
simple method for targeting joint SDC/DUE improvement is to implement resilience until SDC (DUE) improvement is reached and then continue implementing resilience to unprotected flip-flops until DUE (SDC) improvement is also achieved. This ensures that both SDC and DUE improvement meet (or exceed) the targeted minimum required improvement. Table XIX details the costs required to achieve joint SDC/DUE improvement using this methodology when considering a combination of LEAP-DICE, parity, and flush/RoB recovery. 
B. Targeting Specific Applications 
When the application space targets specific algorithms (e.g., matrix operations), a cross-layer combination of LEAP-DICE, parity, ABFT correction, and micro-architectural error recovery (flush/RoB) provides additional energy savings (compared to the general-purpose cross-layer combinations presented in Sec. III.A. Since ABFT correction performs in-place error correction, no separate recovery mechanism is required for ABFT correction. For our study, we could apply ABFT correction to three of our PERFECT benchmarks: 
2d_convolution, debayer_filter, and inner_product (the 
rest were protected using ABFT detection). The results in Table XXII confirm that combinations of ABFT correction, LEAP-DICE, parity, and micro-architectural recovery provide up to 1.1× and 2× energy savings over the previously presented combination of LEAP-DICE, parity, and recovery when targeting SDC improvement for the OoO- and InO-cores, respectively. However, as will be discussed in Sec. III.B.1), the practicality of ABFT is limited when considering general-purpose processors.  When targeting DUE improvement, including ABFT correction provides no energy savings for the OoO-core. This is because ABFT correction (along with most architecture and software techniques like DFC, CFCSS, and Assertions) 
Fig. 7.  Cross-layer resilience methodology for combining LEAP-DICE,
parity, and micro-architectural recovery. 
Unprotected 
design
LEAP-DICE
Parity
Select technique using heuristic 1
Does
implemented resilience
achieve desired SDC/DUE
improvement?
yes
Protected designMark f to be protected using selected technique
no
For each flip-flop fϵS (where S is the set of all flip-flops in the 
design), determine the percentage of errors that cause SDC/DUE in f
Remove flip-flop fϵS that has highest 
percentage of errors causing SDC/DUE
(optional) include flush (InO) or RoB (OoO) recovery
Apply resilience techniques to design
Heuristic 1: Choose LEAP-DICE or parity technique 
Input: f: flip-flop to be protectedOutput: Technique to apply to f (LEAP-DICE, parity) 1: if HARDEN(f) then return LEAP-DICE 2: if PARITY(f) then return parity 3: return LEAP-DICE
4: procedure HARDEN(f)5:    if an error in f cannot be flushed (i.e., f is in the memory,        exception, writeback stages of InO or after the RoB of OoO)6:    then return TRUE; else return FALSE 7: end procedure
8: procedure PARITY(f)9:    if f has timing path slack greater than delay imposed by 32-bit        XOR-tree (this implements low cost parity checking)10:    then return TRUE, else return FALSE 11: end procedure
TABLE XVIII. CREATING 586 CROSS-LAYER COMBINATIONS 
  No rec. Flush / RoB rec. IR / EIR rec. Total 
InO 
Combinations of LEAP-DICE, EDS, parity, DFC, Assertions, CFCSS, EDDI 127 3 14 144 
ABFT correction / detection alone 2 0 0 2
ABFT correction + previous combinations 127 3 14 144 
ABFT detection + previous combinations 127 0 0 127 
InO-core total - - - 417
OoO 
Combinations of LEAP-DICE, EDS, parity, DFC, monitor cores 31 7 30 68 ABFT correction / detection alone 2 0 0 2ABFT correction + previous combinations 31 7 30 68 ABFT detection + previous combinations 31 0 0 31 OoO-core total - - - 169
Combined Total 586
TABLE XIX. COST TO ACHIEVE JOINT SDC/DUE IMPROVEMENT WITH A 
COMBINATION OF LEAP-DICE, PARITY, AND FLUSH/ROB RECOVERY 
Joint SDC/DUE improvement 
InO OoO 
Area Power Energy Area Power Energy
2× 0.7% 2% 2% 0.6% 0.1% 0.1%5× 1.9% 4.2% 4.2% 0.9% 0.4% 0.4%50× 4.1% 9% 9% 2.8% 2.2% 2.2%500× 4.6% 10.8% 10.8% 3.1% 2.8% 2.8%max 8% 17.9% 17.9% 4.9% 7% 7%
 10
performs checks at set locations in the program. For example, a DUE resulting from an invalid pointer access can cause an immediate program termination before a check is invoked. As a result, this DUE would not be detected by the technique. Although ABFT correction is useful for general-purpose processors limited to specific applications, the same cannot be said for ABFT detection (Table XXII). Fig. 8 shows that, since ABFT detection cannot perform in-place correction, ABFT detection benchmarks cannot provide DUE improvement (any detected error necessarily increases the number of DUEs). Additionally, given the lower average SDC improvement and generally higher execution time impact for ABFT detection algorithms, combinations with ABFT detection do not yield low-cost solutions. 
1) Additional Considerations for ABFT  
Since most applications are not amenable to ABFT correction, the flip-flops protected by ABFT correction must also be protected by techniques such as LEAP-DICE or parity (or combinations thereof) for processors targeting general-purpose applications. This requires circuit hardening techniques (e.g., [66], [77]) with the ability to selectively operate in an error-resilient mode (high resilience, high energy) when ABFT is unavailable, or in an economy mode (low resilience, low power mode) when ABFT is available. The LEAP-ctrl flip-flop accomplishes this task. The addition of LEAP-ctrl can incur an additional ~1% energy cost and ~3% area cost (Table XXII). Although 44% (22% for OoO-cores) of flip-flops would need to be implemented using LEAP-ctrl, only 5% (2% for OoO-cores) would be operating in economy mode at any given time (Table XXI). Unfortunately, this requirement of fine-grained operating mode control is difficult to implement in practice since it would require some firmware or software control to determine and pass information to a hardware controller indicating whether or not an ABFT application were running and which flip-flops to place in resilient mode and which to place in economy mode (rather than a simple switch setting all such flip-flops into the same operating mode). 
Therefore, cross-layer combinations using ABFT correction may not be practical or useful in general-purpose processors targeting general applications. 
C. Fixed Hardware 
Tunable resilience at low cost is enabled with the use of low-level techniques like circuit-level hardening and logic-level parity. Unfortunately, it is not always possible to incorporate resilience at design time (e.g., legacy hardware, hardware from external vendors, commercial off-the-shelf hardware, etc.) Although circuit-, logic-, and architecture-level resilience cannot be incorporated given a fixed hardware constraint, it is still possible to provide soft error resilience at the software- and algorithm-level. Although no single resilience technique at the software- or algorithm-level can provide more than 50× SDC/DUE improvement (Table III), combining multiple software and algorithm techniques can increase achievable resilience improvement, but comes at very high energy cost (Table XXIII). It is important to note that due to the long error detection latencies of software techniques, these combinations are not relevant when bounded latency recovery is required (as no hardware recovery mechanism is applicable for these software error detection techniques). Given the existing software- and algorithm-level techniques available, even the most resilient cross-layer combination of ABFT correction, CFCSS, and EDDI can only provide an SDC improvement of 75.6× (at 163% energy cost). Therefore, in scenarios demanding extreme resilience (e.g., >100× resilience improvement), combinations of software and algorithm techniques will still be insufficient. This realization reinforces the importance of resilience being incorporated at design time with the aid of circuit- and logic-level techniques. 
TABLE XXI. IMPACT OF ABFT CORRECTION ON FLIP-FLOPS 
Core % FFs with an error corrected by any ABFT algorithm (∪) % FFs with an error corrected by every ABFT algorithm (∩)
InO 44% 5% OoO 22% 2% 
TABLE XX. COSTS VS. SDC AND DUE IMPROVEMENTS FOR VARIOUS COMBINATIONS IN GENERAL-PURPOSE PROCESSORS 
A (AREA COST %), P (POWER COST %), E (ENERGY COST %) 
 Bounded latency recovery Unconstrained recovery Exec.time impactSDC improvement DUE improvement SDC improvement DUE improvement2 5 50 500 max 2 5 50 500 max 2 5 50 500 max 2 5 50 500 max
InO 
LEAP-DICE + logic parity (+ flush recovery) 
A P E 
0.7 1.9 1.9 
1.7 3.9 3.9 
2.5 6.1 6.1 
3 6.7 6.7 
817.917.9
0.61.51.5
1.53.4 3.4
3.68.4 8.4
4.410.410.4
817.917.9
0.71.91.9
1.63.8 3.8
2.45.9 5.9
2.86.5 6.5
7.6 17.2 17.2 - - - - - 0% 
EDS + LEAP-DICE + logic parity (+ flush recovery) 
A P E 
0.9 1.9 1.9 
2.3 4.3 4.3 
2.7 6.6 6.6 
3.3 7.2 7.2 
8.419.319.3
0.81.71.7
2.13.8 3.8
3.88.5 8.5
4.811 11
8.419.319.3
0.91.91.9
2.24.2 4.2
2.56.3 6.3
3.27.1 7.1
8.1 19 19 - - - - - 0% 
DFC + LEAP-DICE + logic parity (+ EIR recovery) 
A P E 
39.3 32.4 44.2 
41.1 35.5 56.7 
41.5 38.7 60.2 
43.1 41 62.7 
4550.960.3
39.332.545.8
39.933.948.9
41.938.458.3
42.540.763
4550.960.3
3.31.410.6
5.14.8 13.9
5.68.1 17.4
7.110 19.9
10.6 18.225.5 - - - - - 6.2% 
Assertions + LEAP-DICE + logic parity (no recovery) 
A P E - - - - - - - - - - 
0.71.417.1
0.91.8 17.5
12.2 18
1.12.2 18
7.6 17.2 24.5 - - - - - 15.6% 
CFCSS + LEAP-DICE + logic parity (no recovery) 
A P E - - - - - - - - - - 
0.30.841.5
11.8 43
1.42.9 44.6
1.33.1 44.9
7.6 17.2 64.8 - - - - - 40.6% 
EDDI + LEAP-DICE + logic parity (no recovery) 
A P E - - - - - - - - - - 
00 110
00 110
0.70.6 111
0.90.8 111
7.6 17.2 146 - - - - - 110% 
OoO 
LEAP-DICE + logic parity (+ RoB recovery) 
A P E 
0.06 0.1 0.1 
0.1 0.2 0.2 
1.4 2.1 2.1 
2.2 2.4 2.4 
4.97 7
0.50.10.1
0.70.1 0.1
2.62 2
31.8 1.8
4.97 7
0.060.10.1
0.10.2 0.2
1.42.1 2.1
2.22.4 2.4
4.9 7 7 - - - - - 0% 
EDS + LEAP-DICE + logic parity (+ RoB recovery) 
A P E 
0.07 0.1 0.1 
0.1 0.2 0.2 
1.6 2.3 2.3 
2.2 2.5 2.5 
5.48.18.1
0.60.10.1
0.80.1 0.1
2.62 2
31.8 1.8
5.48.18.1
0.070.10.1
0.10.2 0.2
1.62.3 2.3
2.22.5 2.5
5.4 8.1 8.1 - - - - - 0% 
DFC + LEAP-DICE + logic parity (+ EIR recovery) 
A P E 
0.2 1.1 21.2 
1 1.4 21.5 
1.8 2 22.2 
2 2.8 23 
5.37.214.8
0.20.220
0.40.2 20.1
1.72.6 22.9
3.93.3 23.6
5.37.214.8
0.11 10
0.81.3 11.4
1.61.9 12.1
1.82.7 12.9
5.1 7.1 14.7 - - - - - 7.1% 
Monitor core + LEAP-DICE + logic parity (+ RoB rec.) 
A P E 
916.3 16.3 
9 16.3 16.3 
9.8 20 20 
10.5 20.2 20.2 
13.923.323.3
916.316.3
916.316.3
10.120.120.1
11.221.521.5
13.922.322.3
916.316.3
916.316.3
9.820 20
10.520.220.2
13.9 23.3 23.3 - - - - - 0% 
Fig. 8.  ABFT correction and ABFT detection benchmark comparisons. 
0
1
2
0 2 4 6 8
× SDC Improvement
ABFT Correction ABFT Detection
× DUE 
Improvement
 11
IV. APPLICATION BENCHMARK DEPENDENCE 
The most cost-effective resilience techniques rely on 
selective circuit hardening / parity checking guided by error injection using application benchmarks. This raises the question: what happens when the applications in the field do not match application benchmarks? We refer to this situation as application benchmark dependence. To quantify this dependence, we randomly selected 4 (of 11) SPEC benchmarks as a training set, and used the remaining 7 as a validation set. Resilience is implemented using the training set and the resulting design’s resilience is determined using the validation set. Therefore, the training set tells us which flip-flops to protect and the validation set allows us to determine what the actual improvement would be when this same set of flip-flops is protected. We used 50 training/validation pairs. Since high-level techniques cannot be tuned to achieve a given resilience improvement, we analyze each as a standalone technique to better understand how they perform individually. For standalone resilience techniques, the average inaccuracy between the results of trained and validated resilience is generally very low (Table XXIV and Table XXV) and is likely due to the fact that the improvements that the techniques themselves provide is already very low. We also report p-values [78], which provide a measure of how likely the validated and trained improvements would match. Table XXVI and Table XXVII indicate that validated SDC and DUE improvements are generally underestimated. Fortunately, when targeting <10× SDC improvement, the underestimation is minimal. This is due to the fact that the most vulnerable 10% of flip-flops (i.e., the flip-flops that result in the most SDCs or DUEs) are consistent across benchmarks. These include flip-flops that store the program counter, current instruction, ALU input operands, jump and link, execute next instruction, and function return (i.e., crucial program state information that all benchmarks utilize). Since the number of errors resulting in SDC or DUE is not uniformly distributed among flip-flops, protecting these top 10% of flip-flops will result in the ~10× SDC improvement regardless of the benchmark considered. The vulnerabilities of the remaining 90% of flip-flops are more benchmark-dependent. These include flip-flops that store immediate operands, ALU result, register read/load/write, register 
addresses, cache state, exception/trap type, and supervisor 
state (i.e., program data state that is utilized differently by 
applications due to benchmark caching characteristics, register 
pressure, various data types, etc.) Concretely, we can analyze 
benchmark similarity by analyzing the vulnerable flip-flops 
indicated by each application benchmark. Per benchmark, one can group the most vulnerable 10% of flip-flops into a subset (e.g., subset 1). The next 10% of vulnerable flip-flops (e.g., 10-20%) are grouped into subset 2 (and so on up to subset 10). Therefore, given our 18 benchmarks, we create 18 distinct subset 1’s, 18 distinct subset 2’s, and so on. Each group of 18 subsets (e.g., all subset 1’s) can then be assigned a similarity as given in Eq. 2. The similarity of subset “x” is the number of flip-flops that exist in all subset “x’s” (e.g., subset intersection) divided by the number of unique flip-flops in every subset “x’s” (e.g., subset union). From Table XXVIII, it is clear that only the top 10% most vulnerable flip-flops have very high commonality across all benchmarks (the last 2 subsets have high similarity because these are the flip-flops that have errors that always vanish). All other flip-flops are relatively distributed across the spectrum depending on the specific benchmark being run.  
݈ܵ݅݉݅ܽݎ݅ݐݕ	ሺݏݑܾݏ݁ݐ	"ݔ"ሻ ൌ 	 |	∩	ሺ௔௟௟	௙௟௜௣ି௙௟௢௣௦	௜௡	௘௩௘௥௬	௦௨௕௦௘௧	"௫"ሻ	||	∪	ሺ௔௟௟	௙௟௜௣ି௙௟௢௣௦	௜௡	௘௩௘௥௬	௦௨௕௦௘௧	"௫"ሻ	| (2)  It is clear that for highly-resilient designs, one must develop methods to combat this sensitivity to benchmarks. Benchmark sensitivity may be minimized by training using additional 
benchmarks or through better benchmarks (e.g., [79]). An alternative approach is to apply our CLEAR framework using available benchmarks, and then replace all remaining unprotected flip-flops using LHL (Table IV). This enables our resilient designs to meet (or exceed) resilience targets at ~1% additional cost for SDC and DUE improvements >10×. The maximum reported improvement for our lowest cost cross-layer solution is over three orders of magnitude improvement. However, it is still possible for an SDC/DUE to occur since circuit-hardening techniques do not guarantee correction of every possible flip-flop soft error. The extremely high level of resilience provided by our cross-layer solution is not possible using high-level techniques alone (Sec. II.D). Although a logic parity only (with recovery) solution could provide higher degrees of resilience, such a solution incurs a 44.1% energy cost (Table XVI). 
TABLE XXII. COSTS VS. SDC AND DUE IMPROVEMENTS FOR VARIOUS CROSS-LAYER COMBINATIONS INVOLVING ABFT 
A (AREA COST %), P (POWER COST %), E (ENERGY COST %) 
 Bounded latency recovery Unconstrained recovery Exec.time impactSDC improvement DUE improvement SDC improvement DUE improvement2 5 50 500 max 2 5 50 500 max 2 5 50 500 max 2 5 50 500 max
InO 
ABFT correction + LEAP-DICE + logic parity (+ flush recovery) 
A P E 
0 0 1.4 
0.4 0.7 2.2 
1.0 1.7 3.1 
1.2 1.8 3.2 
817.919.6
0.31 2.4
0.41 2.4
1.53.3 4.8
2.75.7 7.2
817.919.6
00 1.4
0.40.7 2.2
0.91.6 3
1.11.8 3.2
7.6 17.2 18.8 - - - - - 1.4% 
ABFT detection + LEAP-DICE + logic parity (no recovery) 
A P E - - - - - - - - - - 
00 1.4
1.22.4 27
24.8 30
2.55.7 31.1
7.6 17.2 45.3 - - - - - 24% 
ABFT correction + LEAP-ctrl + LEAP-DICE + logic parity (+ flush recovery) 
A P E 
1.5 0.6 1.9 
2.5 1.3 2.7 
3.8 2.6 4.0 
4.1 2.8 4.2 
817.919.6
11.32.8
11.3 2.8
4.14.6 6.1
57 8.5
817.919.6
1.50.61.9
2.31.2 2.6
3.42.6 4
42.7 4.1
7.6 17.2 18.8 - - - - - 1.4% 
OoO 
ABFT correction + LEAP-DICE + logic parity (+ ROB recovery) 
A P E 
0 0 1.4 
0.01 0.01 1.5 
0.3 0.5 1.9 
0.5 0.8 2.2 
4.97 8.5
0.40.11.5
0.60.1 1.5
2.13 4.2
31.6 3
4.97 8.5
00 1.4
0.010.011.5
0.30.5 1.9
0.50.8 2.2
4.8 6.9 8.4 - - - - - 1.4% 
ABFT detection + LEAP-DICE + logic parity (no recovery) 
A P E - - - - - - - - - - 
00 24
0.10.2 24.2
0.71.2 25.5
1.21.6 26
4.8 6.9 32.6 - - - - - 24% 
ABFT correction + LEAP-ctrl + LEAP-DICE + logic parity (+ ROB recovery) 
A P E 
1.5 0.3 1.7 
1.8 0.3 1.7 
2.9 1.0 2.5 
3.2 1.3 2.7 
4.97 8.5
0.60.11.5
0.90.1 1.5
2.83 4.3
3.61.6 3.1
4.97 8.5
1.50.31.7
1.80.3 1.7
2.91 2.5
3.21.3 2.7
4.9 6.9 8.4 - - - - - 1.4% 
TABLE XXIII. COSTS VS. SDC AND DUE IMPROVEMENTS FOR CROSS-LAYER COMBINATIONS FOR FIXED HARDWARE 
 Area cost Power cost Energy cost Execution time impact Bounded latency recovery Unconstrained recoverySDC improvement DUE improvement SDC improvement DUE improvement
InO 
CFCSS + EDDI 0% 0% 144% 144% - - 37.1× 0.3×ABFT correction + CFCSS 0% 0% 34.5% 34.5% - - 5.1× 0.6×ABFT correction + EDDI 0% 0% 141% 141% - - 61.1× 0.2×ABFT correction + CFCSS + EDDI 0% 0% 163% 163% - - 75.6× 0.2×
 12
V. DESIGN OF NEW RESILIENCE TECHNIQUES 
CLEAR has been used to comprehensively analyze the design space of existing resilience techniques (and their combinations). As new resilience techniques are proposed, CLEAR can incorporate and analyze these techniques as well. However, CLEAR can also be used today to guide the design of new resilience techniques.  All resilience techniques will lie on a two-dimensional plane of energy cost vs. SDC improvement (Fig. 9). The range of designs formed using combinations of LEAP-DICE, parity, and micro-architectural recovery form the lowest-cost cross-layer combination available using today’s resilience techniques. In order for new resilience techniques to be able to create competitive cross-layer combinations, they must have energy and improvement tradeoffs that place the technique 
under the region bounded by our LEAP-DICE, parity, and micro-architectural recovery solution. Since certain standalone techniques, like LEAP-DICE, can also provide highly competitive solutions, it is useful to understand the cost vs. improvement tradeoffs for new techniques in relation to this best standalone technique as well (Fig. 10). 
VI. CONCLUSION 
CLEAR is a first of its kind cross-layer resilience framework that enables effective exploration of a wide variety of resilience techniques and their combinations across several layers of the system stack. Extensive cross-layer resilience studies using CLEAR demonstrate: 
1) A carefully optimized combination of selective circuit-
level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processors 
2) (Application-aware) selective circuit-level hardening alone, guided by thorough analysis of the effects of soft errors on application benchmarks, also provides a cost-effective soft error resilience solution (with ~1% additional energy cost for a 50× SDC improvement compared to the above approach). 3) Algorithm Based Fault Tolerance (ABFT) correction combined with selective circuit-level hardening (and logic-
TABLE XXIV. TRAINED VS. VALIDATED SDC IMPROVEMENTS FOR HIGH-
LEVEL TECHNIQUES. UNDERESTIMATION LOW BECAUSE IMPROVEMENTS LOW
Core Technique Train Validate Underestimate p-value 
InO 
DFC 1.3× 1.2× -7.7% 3.8×10-9
Assertions 1.5× 1.4× -6.7% 2.4×10-1
CFCSS 1.6× 1.5× -6.3% 5.7×10-1
EDDI 37.8× 30.4× -19.6% 6.9×10-1
ABFT correction 4.3× 3.9× -9.3% 6.7×10-1
OoO DFC 1.3× 1.2× -7.7% 1.9×10
-5
Monitor core 19.6× 17.5× -5.6% 8.3×10-3ABFT correction 4.3× 3.7× -14% 7.2×10-1
TABLE XXV. TRAINED VS. VALIDATED DUE IMPROVEMENTS FOR HIGH-
LEVEL TECHNIQUES. UNDERESTIMATION LOW BECAUSE IMPROVEMENTS LOW
Core Technique Train Validate Underestimate p-value 
InO 
DFC 1.4× 1.3× -7.1% 3.9×10-17
Assertions 0.6× 0.6× 0% 8×10-2
CFCSS 0.6× 0.6× 0% 9.2×10-1
EDDI 0.4× 0.4× 0% 2.2×10-1
ABFT correction 1.2× 1.2× 0% 1.8×10-1
OoO DFC 1.4× 1.3× -7.1% 1.4×10
-10
Monitor core 15.2× 13.9× -8.6% 3.5×10-7ABFT correction 1.1× 1.1× 0% 1.5×10-1
Fig. 9.  New resilience techniques must have cost and improvement tradeoffs
that lie within the shaded regions bounded by LEAP-DCE + parity + micro-
architectural recovery. 
2× 5× 50× 500× max0
6
12
18
Energy
cost (%)
Improvement
DUE (InO)
SDC (InO)
DUE (OoO)
SDC (OoO)
(Our solutions incur no clock frequency impact)
New resilience techniques must lie in shaded regions 
InO OoO
TABLE XXVII. DUE IMPROVEMENT, COST BEFORE AND AFTER APPLYING 
LHL TO OTHERWISE UNPROTECTED FLIP-FLOPS 
Core
SDC improvement Cost before LHL insertion Cost after LHL insertion
Train Validate After LHL Area Power / Energy Area Power / Energy
InO 
5× 4.7× 18.7× 1.5% 3.4% 3.3% 5.9%
10× 8.7× 34.6× 1.9% 4.2% 3.5% 6.5%
20× 16.3× 64.5× 2.4% 5.3% 3.7% 7%
30× 23.5× 92.7× 2.8% 6.6% 3.7% 8.1%
40× 29.9× 117.6× 3.3% 7.5% 4.1% 8.7%
50× 35.9× 140.6× 3.6% 8.4% 4.2% 9.4%
500× 243.5× 840.3× 4.4% 10.4% 4.8% 10.9%
Max 5,524.7× 5,524.7× 8% 17.9% 8% 17.9%
OoO
5× 4.4× 28.7× 0.7% 0.1% 1.8% 1.7%
10× 8.7× 36.6× 1.1% 0.5% 2.1% 2%
20× 17.3× 70.2× 1.5% 0.9% 2.5% 2%
30× 22.2× 81.5× 1.8% 1.3% 2.6% 2.1%
40× 26.1× 115.1× 2.1% 1.6% 2.8% 2.4%
50× 29.8× 121.3× 2.5% 2% 3.1% 2.6%
500× 153.2× 625.1× 2.9% 1.9% 3.4% 2.7%
Max 6,802.6× 6,802.6× 4.9% 7% 4.9% 7%
Fig. 10.  To be competitive standalone solutions, new resilience techniques
must have cost and improvement tradeoffs that lie within the shaded regions
bounded by LEAP-DICE. 
2× 5× 50× 500× max0
8
16
24
Energy
cost (%)
(Our solutions incur no clock frequency impact)
Shaded region will beat best standalone technique
InO OoO
Improvement
DUE (InO)
SDC (InO)
DUE (OoO)
SDC (OoO)
TABLE XXVI. SDC IMPROVEMENT, COST BEFORE AND AFTER APPLYING LHL
TO OTHERWISE UNPROTECTED FLIP-FLOPS 
Core 
SDC improvement Cost before LHL insertion Cost after LHL insertion
Train Validate After LHL Area Power / Energy Area Power / Energy
InO 
5× 4.8× 19.3× 1.6% 3.6% 3.1% 5.7%
10× 9.6× 38.2× 1.7% 3.9% 3.1% 5.7%
20× 19.1× 75.8× 1.9% 4.4% 3.2% 6.1%
30× 26.8× 105.6× 2.2% 4.8% 3.2% 6.3%
40× 32.9× 129.4× 2.3% 5.3% 3.3% 6.7%
50× 38.9× 152.3× 2.4% 5.7% 3.3% 6.9%
500× 433.1× 1,326.1× 2.9% 6.3% 3.4% 7.1%
Max 5,568.9× 5,568.9× 8% 17.9% 8% 17.9%
OoO 
5× 4.8× 35.1× 0.1% 0.2% 0.9% 1.8%
10× 8.8× 40.7× 0.4% 0.6% 1.1% 2.1%
20× 18.8× 65.6× 0.7% 1% 1.3% 2.3%
30× 21.3× 82.3× 0.9% 1.4% 1.4% 2.4%
40× 26.4× 130.2× 1.2% 1.7% 1.7% 2.5%
50× 32.1× 204.3× 1.4% 2.1% 1.9% 2.7%
500× 301.4× 1,084.1× 2.2% 2.4% 2.4% 2.8%
Max 6,625.8× 6,625.8× 4.9% 7% 4.9% 7%
TABLE XXVIII. SUBSET SIMILARITY ACROSS ALL 18 BENCHMARKS FOR THE 
INO-CORE (SUBSETS CONSIST OF GROUPS OF 10% OF ALL FLIP-FLOPS) 
Subset (ranked by decreasing SDC + DUE vulnerability) Similarity (Eq. 2)
1: 0-10% 0.832: 10-20% 0.053: 20-30% 04: 30-40% 05: 40-50% 06: 50-60% 07: 60-70% 08; 70-80% 09: 80-90% 0.7110: 90-100% 1
 13
level parity checking and micro-architectural recovery) can 
further improve soft error resilience costs. However, 
existing ABFT correction techniques can only be used for 
a few applications; this limits the applicability of this approach in the context of general-purpose processors. 
4) Based on our analysis, we can derive bounds on energy 
costs vs. degree of resilience (SDC or DUE improvements) 
that new soft error resilience techniques must achieve to be 
competitive. 
5) It is crucial that the benefits and costs of new resilience 
techniques are evaluated thoroughly and correctly before 
publication. Detailed analysis (e.g., flip-flop-level error 
injection or layout-level cost quantification) identifies 
hidden weaknesses that are often overlooked. While this paper focuses on soft errors in processor cores, 
cross-layer resilience solutions for accelerators and uncore 
components as well as other error sources (e.g., voltage noise) 
may have different tradeoffs and may require additional 
modeling and analysis capabilities. 
ACKNOWLEDGMENT 
We thank Prof. S. V. Adve (UIUC), Prof. M. Alioto (NUS), Dr. K. A. Bowman (Qualcomm), Dr. B. Cline (ARM), Dr. S. Das (ARM), Dr. S. K. S. Hari (NVIDIA), Prof. J. C. Hoe (CMU), Prof. A. B. Kahng (UCSD), Dr. J. Kao (TSMC), Dr. D. Lin (Stanford), Prof. I. L. Markov (U. Michigan), Dr. H. Naeimi (Intel), Dr. A. S. Oates (TSMC), Prof. K. Pattabiraman (UBC), and Dr. S. K. Sahoo (UIUC). We thank the Texas Advanced Computing Center at The University of Texas at Austin.  REFERENCES 
[1]  S. Borkar, "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov 2005.  
[2]  F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer and M. Snir, "Toward exascale resilience: 2014 update," Supercomputing Frontiers and 
Innovations, vol. 1, no. 1, pp. 5-28, Jun 2014.  
[3]  N. P. Carter, H. Naeimi and D. S. Gardner, "Design techniques for cross-layer resilience," in 2010 Design, Automation Test in Europe Conference 
Exhibition (DATE 2010), 2010.  
[4]  A. DeHon, H. M. Quinn and N. P. Carter, "Vision for cross-layer optimization to address the dual challenges of energy and reliability," in 
2010 Design, Automation Test in Europe Conference Exhibition (DATE 
2010), 2010.  
[5]  M. S. Gupta, J. A. Rivers, L. Wang and P. Bose, "Cross-layer system resilience at affordable power," in 2014 IEEE International Reliability 
Physics Symposium, 2014.  
[6]  J. Henkel, L. Bauer, H. Zhang, S. Rehman and M. Shafique, "Multi-layer dependability: From microarchitecture to application level," in 2014 51st 
ACM/EDAC/IEEE Design Automation Conference (DAC), 2014. 
[7]  M. Pedram, D. brooks and T. Pinkston, "Report for the NSF workshop on cross-layer power optimization and management," 2012. 
[8]  D. J. Lu, "Watchdog Processors and Structural Integrity Checking," IEEE 
Transactions on Computers, Vols. C-31, no. 7, pp. 681-685, July 1982. 
[9]  H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita and H. Sugiyama, "A 1.3 GHz fifth generation SPARC64 microprocessor," in Solid-State Circuits Conference, 2003. 
Digest of Technical Papers. ISSCC. 2003 IEEE International, 2003. 
[10] P. J. Meaney, S. B. Swaney, P. N. Sanda and L. Spainhower, "IBM z990 soft error detection and recovery," IEEE Transactions on Device and 
Materials Reliability, vol. 5, no. 3, pp. 419-427, Sept 2005. 
[11] B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. V. Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter and P. Williams, "IBM POWER7 multicore server processor," IBM Journal of Research and 
Development, vol. 55, no. 3, pp. 1:1-1:29, May 2011.  
[12] S. Mitra, P. Bose, E. Cheng, C. Y. Cher, H. Cho, R. Joshi, Y. M. Kim, C. R. Lefurgy, Y. Li, K. P. Rodbell, K. Skadron, J. Stathis and L. Szafaryn, "The resilience wall: Cross-layer solution strategies," in Proceedings of 
Technical Program - 2014 International Symposium on VLSI Technology, 
Systems and Application (VLSI-TSA), 2014.  
[13] N. Seifert, V. Ambrose, B. Gill, Q. Shi, R. Allmon, C. Recchia, S. Mukherjee, N. Nassif, J. Krause, J. Pickholtz and A. Balasubramanian, "On the radiation-induced soft error performance of hardened sequential 
elements in advanced bulk CMOS technologies," in Reliability Physics 
Symposium (IRPS), 2010 IEEE International, 2010. 
[14] N. Seifert, B. Gill, S. Jahinuzzaman, J. Basile, V. Ambrose, Q. Shi, R. Allmon and A. Bramnik, "Soft Error Susceptibilities of 22 nm Tri-Gate Devices," IEEE Transactions on Nuclear Science, vol. 59, no. 6, pp. 2666-2673, Dec 2012. 
[15] N. Seifert, S. Jahinuzzaman, J. Velamala, R. Ascazubi, N. Patel, B. Gill, J. Basile and J. Hicks, "Soft Error Rate Improvements in 14-nm Technology," 
IEEE Transactions on Nuclear Science, vol. 62, no. 6, pp. 2570-2577, Dec 2015. 
[16] N. N. Mahatme, N. J. Gaspard, S. Jagannathan, T. D. Loveless, B. L. Bhuva, W. H. Robinson, L. W. Massengill, S. J. Wen and R. Wong, "Impact of Supply Voltage and Frequency on the Soft Error Rate of Logic Circuits," IEEE Transactions on Nuclear Science, vol. 60, no. 6, pp. 4200-4206, Dec 2013. 
[17] R. Pawlowski, J. Crop, M. Cho, J. Tschanz, V. De, T. Fairbanks, H. Quinn, S. Borkar and P. Y. Chiang, "Characterization of radiation-induced SRAM and logic soft errors from 0.33V to 1.0V in 65nm CMOS," in Proceedings 
of the IEEE 2014 Custom Integrated Circuits Conference, 2014. 
[18] B. Gill, N. Seifert and V. Zia, "Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node," in 2009 IEEE International Reliability Physics 
Symposium, 2009. 
[19] H.-H. K. Lee, K. Lilja, M. Bounasser, P. Relangi, I. R. Linscott, U. S. Inan and S. Mitra, "LEAP: Layout Design through Error-Aware Transistor Positioning for soft-error resilient sequential cell design," in Reliability 
Physics Symposium (IRPS), 2010 IEEE International, 2010. 
[20] H. Cho, C. Y. Cher, T. Shepherd and S. Mitra, "Understanding soft errors in uncore components," in 2015 52nd ACM/EDAC/IEEE Design 
Automation Conference (DAC), 2015.  
[21] J. L. Henning, "SPEC CPU2000: measuring CPU performance in the New Millennium," Computer, vol. 33, no. 7, pp. 28-35, Jul 2000. 
[22] K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez, L. Song, N. Tallent and A. Tumeo, "PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual," 2013. 
[23] E. Cheng, S. Mirkhani, L. G. Szafaryn, C.-Y. Cher, H. Cho, K. Skadron, M. R. Stan, K. Lilja, J. A. Abraham, P. Bose and S. Mitra, "CLEAR: Cross-layer exploration for architecting resilience - combining hardware and software techniques to tolerate soft errors in processor cores," in 
Proceedings of the 53rd Annual Design Automation Conference, New York, NY, USA, 2016. 
[24] C. Bottoni, M. Glorieux, J. M. Daveau, G. Gasiot, F. Abouzeid, S. Clerc, L. Naviner and P. Roche, "Heavy ions test result on a 65nm Sparc-V8 radiation-hard microprocessor," in 2014 IEEE International Reliability 
Physics Symposium, 2014. 
[25] P. N. Sanda, J. W. Kellington, P. Kudva, R. Kalla, R. B. McBeth, J. Ackaret, R. Lockwood, J. Schumann and C. R. Jones, "Soft-error resilience of the IBM POWER6 processor," IBM Journal of Research and 
Development, vol. 52, no. 3, pp. 275-284, May 2008. 
[26] H. Cho, S. Mirkhani, C. Y. Cher, J. A. Abraham and S. Mitra, "Quantitative evaluation of soft error injection techniques for robust system design," in 
Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE, 2013. 
[27] J. Davis and C. Thacker, "BEE3: Revitalizing Computer Architecture Research," Microsoft, 2009.
[28] P. Ramachandran, P. Kudva, J. Kellington, J. Schumann and P. Sanda, "Statistical Fault Injection," in 2008 IEEE International Conference on 
Dependable Systems and Networks With FTCS and DCC (DSN), 2008. 
[29] N. J. Wang, J. Quek, T. M. Rafacz and S. J. Patel, "Characterizing the effects of transient faults on a high-performance processor pipeline," in 
Dependable Systems and Networks, 2004 International Conference on, 2004. 
[30] S. E. Michalak, A. J. DuBois, C. B. Storlie, H. M. Quinn, W. N. Rust, D. H. DuBois, D. G. Modl, A. Manuzzato and S. P. Blanchard, "Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer," IEEE Transactions on Device and Materials 
Reliability, vol. 12, no. 2, pp. 445-454, June 2012.  
[31] N. J. Wang, A. Mahesri and S. J. Patel, "Examining ACE analysis reliability estimates using fault-injection," in Proceedings of the 34th 
Annual International Symposium on Computer Architecture, New York, NY, USA, 2007. 
[32] H. Schirmeier, C. Borchert and O. Spinczyk, "Avoiding Pitfalls in Fault-Injection Based Comparison of Program Susceptibility to Soft Errors," in 
2015 45th Annual IEEE/IFIP International Conference on Dependable 
Systems and Networks, 2015. 
[33] S. Mirkhani, S. Mitra, C. Y. Cher and J. Abraham, "Efficient soft error vulnerability estimation of complex designs," in 2015 Design, Automation 
Test in Europe Conference Exhibition (DATE), 2015. 
[34] E. Cheng, S. Mirkhani, L. G. Szafaryn, C.-Y. Cher, H. Cho, K. Skadron, M. R. Stan, K. Lilja, J. A. Abraham, P. Bose and S. Mitra, "CLEAR: Cross-
 14
layer exploration for architecting resilience - combining hardware and software techniques to tolerate soft errors in processor cores," arXiv:1604.03062 [cs.AR], 2016. 
[35] M. Sullivan, B. Zimmer, S. Hari, T. Tsai and S. W. Keckler, "An analytical model for hardened latch selection and exploration," in Workshop on 
Silicon Errors in Logic - System Effects (SELSE), 2016.  
[36] Synopsys, Synopsys design suite.  
[37] K. Lilja, M. Bounasser, S. J. Wen, R. Wong, J. Holst, N. Gaspard, S. Jagannathan, D. Loveless and B. Bhuva, "Single-Event Performance and Layout Optimization of Flip-Flops in a 28-nm Bulk Technology," IEEE 
Transactions on Nuclear Science, vol. 60, no. 4, pp. 2782-2788, Aug 2013. 
[38] K. Lilja, M. Bounasser, T. R. Assis, K. Rodbell, P. Oldiges, M. Turowski, B. Bhuva, S. Wen and R. Wong, "SER prediction in advanced finFET and SOI finFET technologies; challenges and comparisons to measurements," in Single Event Effects (SEE) Symposium, 2016.  
[39] R. C. Quinn, J. S. Kauppila, T. D. Loveless, J. A. Maharrey, J. D. Rowe, M. L. Alles, B. L. Bhuva, R. A. Reed, M. Bounasser, K. Lilja and L. W. Massengill, "Frequency trends observed in 32nm SOI flip-flops and combinational logic," in IEEE Nuclear and Space Radiation Effects 
Conference, 2015.  
[40] R. C. Quinn, J. S. Kauppila, T. D. Loveless, J. A. Maharrey, J. D. Rowe, M. W. McCurdy, E. X. Zhang, M. L. Alles, B. L. Bhuva, R. A. Reed, W. T. Holman, M. Bounasser, K. Lilja and L. W. Massengill, "Heavy Ion SEU Test Data for 32nm SOI Flip-Flops," in Radiation Effects Data Workshop 
(REDW), 2015 IEEE, 2015.  
[41] M. Turowski, "32nm SOI SRAM and latch SEU cross-sections measured (heavy ion data) and determined with simulations," in Single Event Effects 
(SEE) Symposium, 2015.  
[42] K. A. Bowman, J. W. Tschanz, N. S. Kim, J. C. Lee, C. B. Wilkerson, S. L. L. Lu, T. Karnik and V. K. De, "Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance," IEEE 
Journal of Solid-State Circuits, vol. 44, no. 1, pp. 49-63, Jan 2009. 
[43] K. A. Bowman, J. W. Tschanz, S. L. L. Lu, P. A. Aseron, M. M. Khellah, A. Raychowdhury, B. M. Geuskens, C. Tokunaga, C. B. Wilkerson, T. Karnik and V. K. De, "A 45 nm Resilient Microprocessor Core for Dynamic Variation Tolerance," IEEE Journal of Solid-State Circuits, vol. 46, no. 1, pp. 194-208, Jan 2011.  
[44] L. Spainhower and T. A. Gregg, "IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective," IBM Journal of Research and 
Development, vol. 43, no. 5.6, pp. 863-873, Sept 1999.  
[45] O. A. Amusan, "Effects of single-event-induced charge sharing in sub-100 nm bulk CMOS technologies," 2009. 
[46] S. Mitra and E. J. McCluskey, "Which concurrent error detection scheme to choose?," in Test Conference, 2000. Proceedings. International, 2000. 
[47] A. Meixner, M. E. Bauer and D. Sorin, "Argus: Low-Cost, Comprehensive Error Detection in Simple Cores," in 40th Annual IEEE/ACM International 
Symposium on Microarchitecture (MICRO 2007), 2007.  
[48] T. M. Austin, "DIVA: a reliable substrate for deep submicron microarchitecture design," in Microarchitecture, 1999. MICRO-32. 
Proceedings. 32nd Annual International Symposium on, 1999. 
[49] S. K. S. Hari, S. V. Adve and H. Naeimi, "Low-cost program-level detectors for reducing silent data corruptions," in IEEE/IFIP International 
Conference on Dependable Systems and Networks (DSN 2012), 2012. 
[50] S. K. Sahoo, M. L. Li, P. Ramachandran, S. V. Adve, V. S. Adve and Y. Zhou, "Using likely program invariants to detect hardware errors," in 2008 
IEEE International Conference on Dependable Systems and Networks With 
FTCS and DCC (DSN), 2008.  
[51] N. Oh, P. P. Shirvani and E. J. McCluskey, "Control-flow checking by software signatures," IEEE Transactions on Reliability, vol. 51, no. 1, pp. 111-122, Mar 2002.  
[52] M. N. Lovellette, K. S. Wood, D. L. Wood, J. H. Beall, P. P. Shirvani, N. Oh and E. J. McCluskey, "Strategies for fault-tolerant, space-based computing: Lessons learned from the ARGOS testbed," in Aerospace 
Conference Proceedings, 2002. IEEE, 2002.  
[53] N. Oh, P. P. Shirvani and E. J. McCluskey, "Error detection by duplicated instructions in super-scalar processors," IEEE Transactions on Reliability, vol. 51, no. 1, pp. 63-75, Mar 2002.  
[54] D. Lin, T. Hong, Y. Li, S. E., S. Kumar, F. Fallah, N. Hakim, D. S. Gardner and S. Mitra, "Effective Post-Silicon Validation of System-on-Chips Using Quick Error Detection," IEEE Transactions on Computer-Aided Design of 
Integrated Circuits and Systems, vol. 33, no. 10, pp. 1573-1590, Oct 2014. 
[55] K. Pattabiraman, Z. T. Kalbarczyk and R. K. Iyer, "Automated derivation of application-aware error detectors using static analysis: the trusted illiac approach," IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 1, pp. 44-57, 2011.  
[56] S. Rehman, F. Kriebel, M. Shafique and J. Henkel, "Reliability-Driven Software Transformations for Unreliable Hardware," IEEE Transactions on 
Computer-Aided Design of Integrated Circuits and Systems, vol. 33, no. 11, pp. 1597-1610, Nov 2014.  
[57] G. Bosilca, R. Delmas, J. Dongarra and J. Langou, "Algorithm-based fault tolerance applied to high performance computing," Journal of Parallel and 
Distributed Computing, vol. 69, no. 4, pp. 410-416, #Apr# 2009. 
[58] Z. Chen and J. Dongarra, "Numerically stable real number codes based on random matrices," in Computational Science -- ICCS 2005: 5th 
International Conference, Atlanta, GA, USA, May 22-25, 2005. 
Proceedings, Part I, V. S. Sunderam, G. D. van Albada, P. M. A. Sloot and J. J. Dongarra, Eds., Berlin, Heidelber: Springer Berlin Heidelberg, 2005, pp. 115-122.
[59] K.-H. Huang and J. A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Transactions on Computers, Vols. C-33, no. 6, pp. 518-528, June 1984. 
[60] V. S. S. Nair and J. A. Abraham, "Real-number codes for fault-tolerant matrix operations on processor arrays," IEEE Transactions on Computers, vol. 39, no. 4, pp. 426-435, Apr 1990.  
[61] A. L. N. Reddy and P. Banerjee, "Algorithm-based fault detection for signal processing applications," IEEE Transactions on Computers, vol. 39, no. 10, pp. 1304-1308, Oct 1990.  
[62] P. Racunas, K. Constantinides, S. Manne and S. S. Mukherjee, "Perturbation-based Fault Screening," in 2007 IEEE 13th International 
Symposium on High Performance Computer Architecture, 2007. 
[63] N. J. Wang and S. J. Patel, "ReStore: symptom based soft error detection in microprocessors," in 2005 International Conference on Dependable 
Systems and Networks (DSN'05), 2005.  
[64] T. Calin, M. Nicolaidis and R. Velazco, "Upset hardened memory design for submicron CMOS technology," IEEE Transactions on Nuclear Science, vol. 43, no. 6, pp. 2874-2878, Dec 1996.  
[65] J. Furuta, C. Hamanaka, K. Kobayashi and H. Onodera, "A 65nm Bistable Cross-coupled Dual Modular Redundancy Flip-Flop capable of protecting soft errors on the C-element," in 2010 Symposium on VLSI Circuits, 2010. 
[66] S. Mitra, N. Seifert, M. Zhang, Q. Shi and K. S. Kim, "Robust System Design with Built-In Soft-Error Resilience," Computer, vol. 38, no. 2, pp. 43-52, 2005. 
[67] D. Blaauw, S. Kalaiselvan, K. Lai, W. H. Ma, S. Pant, C. Tokunaga, S. Das and D. Bull, "Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance," in 2008 IEEE International Solid-State Circuits 
Conference - Digest of Technical Papers, 2008.  
[68] M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. M. Harris, D. Blaauw and D. Sylvester, "Bubble Razor: Eliminating Timing Margins in an ARM Cortex-M3 Processor in 45 nm CMOS Using Architecturally Independent Error Detection and Correction," IEEE Journal of Solid-State Circuits, vol. 48, no. 1, pp. 66-81, Jan 2013. 
[69] M. Nicolaidis, "Time redundancy based soft-error tolerance to rescue nanometer technologies," in VLSI Test Symposium, 1999. Proceedings. 
17th IEEE, 1999. 
[70] P. Franco and E. J. McCluskey, "On-line delay testing of digital circuits," in 
VLSI Test Symposium, 1994. Proceedings., 12th IEEE, 1994. 
[71] J. M. Berger, "A note on error detection codes for asymmetric channels," 
Information and Control, vol. 4, no. 1, pp. 68-73, Mar 1961. 
[72] B. Bose and D. J. Lin, "Systematic Unidirectional Error-Detecting Codes," 
IEEE Transactions on Computers, Vols. C-34, no. 11, pp. 1026-1032, Nov 1985. 
[73] S. S. Mukherjee, M. Kontz and S. K. Reinhardt, "Detailed design and evaluation of redundant multi-threading alternatives," in Computer 
Architecture, 2002. Proceedings. 29th Annual International Symposium on, 2002. 
[74] S. Feng, S. Gupta, A. Ansari and S. Mahlke, "Shoestring: Probabilistic Soft Error Reliability on the Cheap," in Proceedings of the Fifteenth Edition of 
ASPLOS on Architectural Support for Programming Languages and 
Operating Systems, New York, NY, USA, 2010.  
[75] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan and D. I. August, "SWIFT: software implemented fault tolerance," in International 
Symposium on Code Generation and Optimization, 2005. 
[76] G. A. Reis, J. Chang, N. Vachharajani, S. S. Mukherjee, R. Rangan and D. I. August, "Design and evaluation of hybrid fault-detection systems," in 
32nd International Symposium on Computer Architecture (ISCA'05), 2005. 
[77] M. Zhang, S. Mitra, T. M. Mak, N. Seifert, N. J. Wang, Q. Shi, K. S. Kim, N. R. Shanbhag and S. J. Patel, "Sequential Element Design With Built-In Soft Error Resilience," IEEE Transactions on Very Large Scale Integration 
(VLSI) Systems, vol. 14, no. 12, pp. 1368-1378, Dec 2006. 
[78] R. L. Wasserstein and N. A. Lazar, "The ASA's statement on p-values: context, process, and purpose," The American Statistician, vol. 70, no. 2, pp. 129-133, 2016. 
[79] S. Mirkhani, B. Samynathan and J. A. Abraham, "In-depth soft error vulnerability analysis using synthetic benchmarks," in 2015 IEEE 33rd 
VLSI Test Symposium (VTS), 2015.  
[80] D. L. Tao and C. R. P. Hartmann, "A novel concurrent error detection scheme for FFT networks," IEEE Transactions on Parallel and Distributed 
Systems, vol. 4, no. 2, pp. 198-221, Feb 1993.  
[81] A. Gaisler, Leon3 processor.
 
 15
Eric Cheng (S’09) received the B.S. degree in 
electrical and computer engineering from Carnegie 
Mellon University, Pittsburgh, PA, USA, in 2011 
and the M.S. degree in electrical engineering from 
Stanford University, Stanford, CA, USA in 2013, 
where he is currently pursuing the Ph.D. degree. 
His research interests include cross-layer 
resilience, self-repair, and recovery of reliable 
systems. 
 
 
Shahrzad Mirkhani received her B.Sc. degree in 
computer engineering from the Sharif University of 
Technology, Tehran, Iran, in 1998 and the M.Sc. 
degree in electrical and computer engineering from 
the University of Tehran, Tehran, Iran in 2001. She 
then continued as a researcher in CAD research 
group in University of Tehran until 2007. She 
received her Ph.D. degree in computer engineering 
from University of Texas at Austin in 2014 under 
supervision of Prof. Jacob A. Abraham. She 
continued her research in reliability as a post-doctoral researcher at Stanford 
University under supervision of Prof. Subhasish Mitra. She is currently the 
lead product engineer at Bigstream Solution, mainly working on Big Data 
acceleration. 
 
Lukasz G. Szafaryn is a computer architect at Intel 
in the GPU performance group. He was previously a 
research scientist at AMD Research in the GPU 
reliability group. He received his Ph.D. from the 
University of Virginia. His research interests include 
novel architecture-level resilience and performance 
optimization techniques, CPU-GPU integration as 
well as design analysis using modeling/simulation 
tools. 
 
 
Chen-Yong Cher (SM’13) is a Research Staff 
Member in the Computer Architecture Department at 
the IBM Thomas J. Watson Research Center. He 
received B.S., M.S. and Ph.D. degrees in Electrical 
and Computer Engineering from Purdue University, 
in 1998, 2000 and 2004, respectively. He 
subsequently joined IBM at the Thomas J. Watson 
Research Center, where he has worked on 
performance, thermal, process variation, power and 
soft error reliability for microprocessors. He has 
contributed to BlueGene/Q microarchitecture definition and is principal power 
and soft error reliability architect for Power Edge-of-Network (EN) and 
BlueGene/Q compute chips. He is an IEEE senior member. He is an author or 
coauthor of 25 patents and over 40 technical papers. 
 
Hyungmin Cho is an Assistant Professor in the 
Department of Computer Engineering, Hongik 
University, Korea. He received the B.S. degree in 
Computer Science Engineering from Seoul National 
University, Korea in 2005, and the M.S. and Ph.D. 
degrees in Electrical Engineering from Stanford 
University in 2010 and 2015, respectively. He was a 
Research Scientist at Intel Labs, Santa Clara, CA. 
His research interests include reliable computer 
systems and computing model for robust systems. 
 
Kevin Skadron (F’13) is the Harry Douglas Forsyth 
professor and chair of the Department of Computer 
Science at the University of Virginia, where he has 
been on the faculty since 1999. His research focuses 
on heterogeneous architecture, design and 
applications of novel hardware accelerators, and 
design for physical constraints such as power, 
temperature, and reliability.  Skadron and his 
colleagues have developed a number of open-source 
tools to support this research, including the HotSpot 
thermal modeling tools, the Svalinn reliability model, the Rodinia GPU 
benchmark suite, the ANMLZoo automata benchmark suite, and the 
MNCaRT research toolkit for automata processing.  Skadron is a Fellow of 
the IEEE and the ACM, and recipient of the 2011 ACM SIGARCH Maurice 
Wilkes Award. 
 
Mircea R. Stan (F’14) received the “Diploma” 
degree from Politehnica University, Bucharest, 
Romania, and the M.S. and Ph.D. degrees from the 
University of Massachusetts. He received the NSF 
CAREER Award in 1997 and was a co-author on 
best paper awards at SELSE 2017, ISQED 2008, 
GLSVLSI 2006, ISCA 2003, and SHAMAN 2002. 
He is a senior Editor of the IEEE TRANSACTIONS 
ON NANOTECHNOLOGY. Prof. Stan is a fellow 
of IEEE, a member of ACM, Eta Kappa Nu, Phi 
Kappa Phi, and Sigma Xi. 
 
Klas Lilja is the founder and CEO of Robust Chip 
Inc. (RCI), a company specialized in analysis of 
radiation effects in electronics and design of 
radiation hard electronics. He received M.Sc. and Ph. 
D. degrees from Chalmers Technical University, 
Sweden, and the Swiss Federal Institute of 
Technology (ETH), Switzerland, in 1986 and 1992 
respectively, and has over 25 years of experience in 
the area design, simulation and radiation hardening 
of electronic devices and circuits. Prior to founding 
RCI, Klas was VP Engineering at ISE and Head of TCAD at Avant! 
Corporation (both companies are now part of Synopsys). As founder of RCI, 
he has led the development of RCI’s unique software tools and break-through 
technology for radiation hardened electronics, and developed the company 
into a worldwide leader in this field. Klas holds more than 10 issued and 
pending patents on RCI’s new technology, as well as several other patents in 
the area of semiconductor devices and semiconductor simulation. 
 
Jacob A. Abraham (F’84) is a Professor in the 
department of Electrical and Computer Engineering 
at the University of Texas at Austin.  He is also 
director of the Computer Engineering Research 
Center and holds a Cockrell Family Regents Chair in 
Engineering.  He received the Bachelor's degree in 
Electrical Engineering from the University of Kerala, 
India, in 1970.  His M.S. degree, in Electrical 
Engineering, and Ph.D., in Electrical Engineering 
and Computer Science, were received from Stanford 
University, Stanford, California, in 1971 and 1974, respectively.  From 1975 
to 1988 he was on the faculty of the University of Illinois, Urbana, Illinois. 
Professor Abraham's research interests include VLSI design and test, low-
power circuits, and the design of on-chip circuitry to facilitate measurements 
and diagnosis of digital, analog and RF circuits and biological systems.  He is 
also interested in techniques to mitigate the complexity of design verification, 
and the design and evaluation of resilient and secure systems.  He has over 
500 publications, has received numerous "best paper" awards at international 
conferences, and was included in the ISI (Thomson Reuters) list of highly 
cited researchers.  He has supervised more than 90 Ph.D. dissertations and is 
particularly proud of the accomplishments of his students, many of whom 
occupy senior positions in academia and industry.  He has been elected Fellow 
of the IEEE as well as Fellow of the ACM, and is the recipient of the 2005 
IEEE Emanuel R. Piore Award and the 2017 Jean-Claude Laprie Award in 
Dependable Computing. 
 
Pradip Bose (F’07) is a Distinguished Research 
Staff Member and Manager of the Efficient & 
Resilient Systems Department at IBM T. J. Watson 
Research Center, Yorktown Heights, NY. He has 
been involved in the design and pre-silicon modeling 
of virtually all IBM POWER-series microprocessors, 
since the pioneering POWER1 (RS/6000) machine, 
which started as the Cheetah (and subsequently 
America) superscalar RISC project at IBM Research. 
From 1992-95, he was on assignment at IBM Austin, 
where he was the lead performance engineer in a high-end processor 
development project (POWER3). During 1989-90, Dr. Bose was on a 
sabbatical assignment as a Visiting Associate Professor at Indian Statistical 
Institute, India, where he worked on practical applications of knowledge-
based systems. His current research interests are in high performance 
 16
computers, power- and reliability-aware microprocessor architectures, pre-
silicon modeling and validation. He is the author or co-author of over ninety 
refereed publications (including several book chapters) and he also serves as 
an Adjunct Professor at Columbia University. He has received twenty-five 
Invention Plateau Awards, several Research Accomplishment and 
Outstanding Innovation Awards from IBM. Dr. Bose served as the Editor-in-
Chief of IEEE Micro from 2003-2006 and as the chair of ACM SIGMICRO 
from 2011-2017. He is an IEEE Fellow and a member of the IBM Academy 
of Technology. 
 
Subhasish Mitra (F’13) is Professor of Electrical 
Engineering and of Computer Science at Stanford 
University, where he directs the Stanford Robust 
Systems Group and co-leads the Computation thrust 
of the Stanford SystemX Alliance. He is also a 
faculty member of the Stanford Neurosciences 
Institute. Before joining Stanford, he was a Principal 
Engineer at Intel Corporation. He received his Ph.D. 
in Electrical Engineering from Stanford. 
Prof. Mitra's research interests range broadly 
across robust computing, nanosystems, VLSI design, CAD, validation and 
test, and neurosciences. He, jointly with his students and collaborators, 
demonstrated the first carbon nanotube computer, which appeared on the 
cover of NATURE. The National Science Foundation presented this work as a 
Research Highlight to the United States Congress, and it also was highlighted 
as "an important, scientific breakthrough" by the BBC, Economist, EE Times, 
IEEE Spectrum, MIT Technology Review, National Public Radio, New York 
Times, Scientific American, Time, Wall Street Journal, Washington Post, and 
numerous others worldwide. His earlier work on X-Compact test compression 
has been key to cost-effective manufacturing and high-quality testing of a vast 
majority of electronic systems. X-Compact and its derivatives have been 
implemented in widely-used commercial Electronic Design Automation tools. 
Prof. Mitra's honors include the ACM SIGDA/IEEE CEDA A. Richard 
Newton Technical Impact Award in Electronic Design Automation (a test of 
time honor), the Semiconductor Research Corporation's Technical Excellence 
Award, the Intel Achievement Award (Intel’s highest corporate honor), and 
the Presidential Early Career Award for Scientists and Engineers from the 
White House (the highest United States honor for early-career outstanding 
scientists and engineers).  He and his students published several award-
winning papers at major venues: IEEE/ACM Design Automation Conference, 
IEEE International Solid-State Circuits Conference, IEEE International Test 
Conference, IEEE Transactions on CAD, IEEE VLSI Test Symposium, and 
the Symposium on VLSI Technology. At Stanford, he has been honored 
several times by graduating seniors “for being important to them during their 
time at Stanford.” 
Prof. Mitra served on the Defense Advanced Research Projects Agency's 
(DARPA) Information Science and Technology Board as an invited member. 
He is a Fellow of the ACM and the IEEE. 
