We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (798 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. We demonstrate the practicality and effectiveness of our framework using two diverse designs: a simple, in-order processor core and a complex, out-of-order processor core. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50× improvement in silent data corruption rate is achieved at only 2.1% energy cost for an out-of-order core (6.1% for an in-order core) with no speed impact. However, selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with ~1% additional energy cost for a 50× improvement in silent data corruption rate).
Introduction
This paper addresses the cross-layer resilience challenge for designing robust digital systems: given a set of resilience techniques at various abstraction layers (circuit, logic, architecture, software, algorithm), how does one protect a given design from radiation-induced soft errors using (perhaps) a combination of these techniques, across multiple abstraction layers, such that overall soft error resilience targets are met at minimal costs (energy, power, execution time, area)? Specific soft error resilience targets addressed in this paper are: Silent Data Corruption (SDC), where an error causes the system to output an incorrect result without error indication; and, Detected but Uncorrected Error (DUE), where an error is detected (e.g., by a resilience technique or a system crash or hang) but is not recovered automatically without user intervention.
The need for cross-layer resilience, where multiple error resilience techniques from different layers of the system stack cooperate to achieve cost-effective error resilience, is articulated in several publications (e.g., [ Borkar 05, Cappello 14, Carter 10, DeHon 10, Gupta 14, Henkel 14, Pedram 12] ).
There are numerous publications on error resilience techniques, many of which span multiple abstraction layers. These publications mostly describe specific implementations. Examples include structural integrity checking [Lu 82 ] and its derivatives (mostly spanning architecture and software layers) or the combined use of circuit hardening, error detection (e.g., using logic parity checking and residue codes) and instruction-level retry [Ando 03, Meaney 05, Sinharoy 11] (spanning circuit, logic, and architecture layers). Cross-layer resilience implementations in commercial systems are often based on "designer experience" or "historical practice." There exists no comprehensive framework to systematically address the cross-layer resilience challenge. Creating such a framework is difficult. It must encompass the entire design flow end-toend, from comprehensive and thorough analysis of various combinations of error resilience techniques all the way to layout-level implementations, such that one can (automatically) determine which resilience technique or combination of techniques (either at the same abstraction layer or across different abstraction layers) should be chosen. However, such a framework is essential in order to answer important cross-layer resilience questions such as: 1. Is cross-layer resilience the best approach for achieving a given resilience target at low cost?
2. Are all cross-layer solutions equally cost-effective? If not, which cross-layer solutions are the best?
3. How do cross-layer choices change depending on application-level energy, latency, and area constraints?
4. How can one create a cross-layer resilience solution that is costeffective across a wide variety of application workloads? 5. Are there general guidelines for new error resilience techniques to be cost-effective?
We present CLEAR (Cross-Layer Exploration for Architecting Resilience), a first of its kind framework, which addresses the cross-layer resilience challenge. In this paper, we focus on the use of CLEAR for radiation-induced soft errors 1 in terrestrial environments.
Although the soft error rate of an SRAM cell or a flip-flop stays roughly constant or even decreases over technology generations, the system-level soft error rate increases with increased integration [Mitra 14, Seifert 10, 12] . Moreover, soft error rates can increase when lower supply voltages are used to improve energy efficiency [Mahatme 13, Pawlowski 14] . We focus on flip-flop soft errors because design techniques to protect them are generally expensive. Coding techniques are routinely used for protecting on-chip memories. Combinational logic circuits are significantly less susceptible to soft errors and do not pose a concern [Gill 09, Seifert 12]. We address both single-event upsets (SEUs) and singleevent multiple upsets (SEMUs) [Lee 10, Pawlowski 14] . While CLEAR can address soft errors in various digital components of a complex System-on-a-Chip (including uncore components [Cho 15 ] and hardware accelerators), a detailed analysis of soft errors in all these components is beyond the scope of this paper. Hence, we focus on soft errors in processor cores.
To demonstrate the effectiveness and practicality of CLEAR, we explore 798 cross-layer combinations using ten representative error detection/correction techniques and four hardware error recovery techniques. These techniques span various layers of the system stack: circuit, logic, architecture, software, and algorithm ( Fig. 1 ). Our extensive cross-layer exploration encompasses over 9 million flip-flop soft error injections into two diverse processor core architectures (Table 1) : a simple in-order SPARC Leon3 core (InO-core) and a complex super-scalar outof-order Alpha IVM core (OoO-core), across 18 benchmarks: SPECINT2000 [Henning 00] and DARPA PERFECT [DARPA] . Such extensive exploration enables us to conclusively answer the above crosslayer resilience questions:
1. For a wide range of error resilience targets, optimized cross-layer combinations can provide low cost solutions for soft errors.
2. Not all cross-layer solutions are cost-effective. a. For general-purpose processor cores, a carefully optimized combination of selective circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly effective cross-layer resilience solution. For example, a 50× SDC improvement (defined in Sec. 2.1) is achieved at 2.1% and 6.1% energy costs for the OoO-and InO-cores, respectively. The use of selective circuit-level hardening and logic-level parity checking is guided by a thorough analysis of the effects of soft errors on application benchmarks. b. When the application space can be restricted to matrix operations, a cross-layer combination of Algorithm Based Fault Tolerance (ABFT) correction, selective circuit-level hardening, logic-level parity checking, and micro-architectural recovery can be highly effective. For example, a 50× SDC improvement is achieved at 1.9% and 3.1% energy costs for the OoO-and InO-cores, respectively. But, this approach may not be practical for general-purpose processor cores targeting general applications. c. Selective circuit-level hardening, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a highly effective soft error resilience approach. For example, a 50× SDC improvement is achieved at 3.1% and 7.3% energy costs for the OoO-and InO-cores, respectively.
3. The above conclusions about cost-effective soft error resilience techniques largely hold across various application characteristics (e.g., latency constraints despite errors in soft real-time applications). 4. Selective circuit-level hardening (and logic-level parity checking) techniques are guided by the analysis of the effects of soft errors on application benchmarks. Hence, one must address the challenge of potential mismatch between application benchmarks vs. applications in the field, especially when targeting high degrees of resilience (e.g., 10× or more SDC improvement). We overcome this challenge using various flavors of circuit-level hardening techniques (details in Sec. 4).
5. Cost-effective resilience approaches discussed above provide bounds that new soft error resilience techniques must achieve to be competitive. It is, however, crucial that the benefits and costs of new techniques are evaluated thoroughly and correctly before publication. 2 11 SPEC / 7 PERFECT benchmarks for InO-cores and 8 SPEC / 3 PERFECT for OoO-cores (missing benchmarks contain floating-point instructions not executable by the OoO-core RTL model). Figure 1 gives an overview of the CLEAR framework. Individual components of the framework are discussed below.
CLEAR Framework

Reliability Analysis
CLEAR is not merely an error rate projection tool; rather, reliability analysis is a component of the overall CLEAR framework.
We use flip-flop soft error injections for reliability analysis with respect to radiation-induced soft errors. This is because radiation test results confirm that injection of single bit-flips into flip-flops closely models soft error behaviors in actual systems [Bottoni 14, Sanda 08 ]. Furthermore, flip-flop-level error injection is crucial since naïve high-level error injections can be highly inaccurate [Cho 13 ]. For individual flip-flops, both SEUs and SEMUs manifest as single-bit errors. Our SEMU-tolerant circuit hardening and our layout implementations ensure this assumption holds for both the baseline and resilient designs.
We injected over 9 million flip-flop soft errors into the RTL of the processor designs using three BEE3 FPGA emulation systems and also using mixed-mode simulations on the Stampede supercomputer (TACC at The University of Texas at Austin) (similar to [Cho 13, Davis 09, Ramachandran 08, Wang 04]). This ensures that error injection results have less than a 0.1% margin of error with a 95% confidence interval per benchmark. Errors are injected uniformly into all flip-flops and application regions, to mimic real world scenarios.
The SPECINT2000 [Henning 00] and DARPA PERFECT [DARPA] benchmark suites are used for evaluation 2 . The PERFECT suite complements SPEC by adding applications targeting signal and image processing domains. We chose the SPEC workloads since the original publications corresponding to the resilience techniques used them for evaluation. We ran benchmarks in their entirety.
Flip-flop soft errors can result in the following outcomes [Cho 13, Michalak 12, Sanda 08, Wang 04, 07]: Vanished -normal termination and output files match error-free runs, Output Mismatch (OMM)normal termination, but output files are different from error-free runs, Unexpected Termination (UT) -program terminates abnormally, Hang -no termination or output within 2× the nominal execution time, Error Detection (ED) -an employed resilience technique flags an error, but the error is not recovered using a hardware recovery mechanism.
Using the above outcomes, any error that results in OMM causes SDC. Any error that results in UT, Hang, or ED causes DUE. Note that, there are no ED outcomes if no error detection technique is employed. The resilience of a protected (new) design compared to an unprotected (original, baseline) design can be defined in terms of SDC improvement (Eq. 1a) or DUE improvement (Eq. 1b). The susceptibility of flip-flops to soft errors is assumed to be uniform across all flip-flops in the design (but this parameter is adjustable in our framework). Techniques that increase execution time or add flip-flops increase the susceptibility of the design to soft-errors. To accurately account for this situation, we calculate, based on [Schirmeier 15 ], a correction factor γ (where γ ≥ 1), which is applied FPGA   BEE3  FPGA  FPGA  FPGA  FPGA  FPGA  FPGA  FPGA FPGA   BEE3  FPGA  FPGA  FPGA  FPGA  FPGA  FPGA  FPGA FPGA   BEE3  FPGA  FPGA  FPGA  FPGA  FPGA  FPGA 
Reporting SDC and DUE improvements allows our results to be agnostic to absolute error rates. Although we have described the use of error injection-driven reliability analysis, the modular nature of CLEAR allows us to swap in other approaches as appropriate (e.g., our errorinjection analysis could be substituted with techniques like [Mirkhani 15b ], once they are properly validated).
Execution Time Evaluation
Execution time is estimated using FPGA emulation and RTL simulation. Applications are run to completion to accurately capture the execution time of an unprotected design. We also report the error-free execution time impact associated with resilience techniques at the architecture, software, and algorithm levels. For resilience techniques at the circuit and logic levels, our design methodology maintains the same clock speed as the unprotected design.
Physical Design Evaluation
We used Synopsys design tools (Design Compiler, IC compiler, and Primetime) [Synopsys] with a commercial 28nm technology library (with corresponding SRAM compiler) to perform synthesis, place-and-route, and power analysis. Synthesis and place-and-route (SP&R) was run for all configurations of the design (before and after adding resilience techniques) to ensure all constraints of the original design (e.g., timing and DRC) were met for the resilient designs. To account for tool artifacts (e.g., due to variations in RTL or optimization heuristics), separate resilient designs were generated based on error injection results for each individual application benchmark. SP&R was performed for each of these designs and then averaged to minimize these artifacts. Layouts were also 3 Research literature commonly considers γ=1. We report results using true γ values, but our conclusions hold for γ=1 as well (latter is optimistic). 4 Circuit and logic techniques have tunable costs/resilience (e.g., for InO-cores, 5×
SDC improvement using LEAP-DICE is achieved at 4.3% energy cost while 50× SDC improvement is achieved at 7.3% energy cost). This is achievable through selective insertion guided by error injection using application benchmarks. 5 Software techniques are generated for InO-cores only since the LLVM compiler no longer supports the Alpha architecture. 6 Some software assertions for general-purpose processors (e.g., [Sahoo 08]) suffer from false positives (i.e., an error is reported during an error-free run). The carefully generated to mitigate the impact of SEMUs (Sec. 2.4).
Resilience Library
We carefully chose ten error detection and correction techniques together with four hardware error recovery techniques. These techniques largely cover the space of existing soft error resilience techniques (explained in [Cheng 16] ). The costs incurred by each resilience technique (and corresponding resilience improvements) when used as a standalone solution (e.g., an error detection / correction technique by itself or, optionally, in conjunction with a recovery technique) are presented in Table 2 .
Circuit: The hardened flip-flops (LEAP-DICE, Light Hardened LEAP) in Table 3 are We analyze monitor cores similar to [Austin 99 ]. For InO-cores, the size of the monitor core is of the same order as the main core, and hence, excluded from our study. For OoO-cores, the simpler monitor core can have lower throughput compared to the main core and thus stall the main core. We confirm (via IPC estimation) that our monitor core implementation does not stall the main core. Software: Software assertions for general-purpose processors check program variables to ensure that their values are valid. We combine assertions from [ Recovery: We consider two recovery scenarios (details in [Cheng 16 ]): bounded latency, i.e., an error must be recovered within a fixed period of time after its occurrence, and unconstrained, i.e., where no latency constraints exist and errors are recovered externally once detected (no hardware recovery is required). Bounded latency recovery is achieved using one of the following hardware recovery techniques (Table 4) : flush or Reorder Buffer (RoB) recovery (both of which rely on flushing noncommitted instructions followed by re-execution) [Racunas 07, Wang 05]; instruction replay (IR) or extended instruction replay (EIR) recovery (both of which rely on instruction checkpointing to rollback and replay instructions) [Meaney 05 ]. EIR is an extension of IR with additional buffers required by DFC for recovery. Flush and RoB are unable to recover from errors detected after the memory write stage of InO-cores or after the reorder buffer of OoO-cores, respectively (these errors will have propagated to architecture visible states). Hence, LEAP-DICE is used to protect flip-flops in these pipeline stages when using flush/RoB recovery. 11 Costs are generated per benchmark. We report the average cost over all benchmarks. Relative standard deviation is 0.6-3.1%.
Cross-Layer Combinations
CLEAR uses a top-down approach to explore the cost-effectiveness of various cross-layer combinations. For example, resilience techniques at the upper layers of the system stack (e.g., ABFT correction) are applied before incrementally moving down the stack to apply techniques from lower layers (e.g., an optimized combination of logic parity checking, circuit-level LEAP-DICE, and micro-architectural recovery). This approach (example shown in Fig. 2) ensures that resilience techniques from various layers of the stack effectively interact with one another. Resilience techniques from the algorithm, software, and architecture layers of the stack generally protect multiple flip-flops (determined using error injection); however, a designer typically has little control over the specific subset protected. Using multiple resilience techniques from these layers can lead to situations where a given flip-flop may be protected (sometimes unnecessarily) by multiple techniques. At the logic and circuit layers, fine-grained protection is available since these techniques can be applied selectively to individual flip-flops (those not sufficiently protected by higher-level techniques).
Among the 798 cross-layer combinations explored using CLEAR, a highly promising approach combines selective circuit-level hardening using LEAP-DICE, logic parity, and micro-architectural recovery (flush recovery for InO-cores, RoB recovery for OoO-cores). Thorough error injection using application benchmarks plays a critical role in selecting the flip-flops protected using these techniques. Figure 3 and Heuristic 1 detail the methodology for creating this combination. If recovery is not needed (e.g., for unconstrained recovery), the "Harden" procedure in Heuristic 1 can be modified to always return false.
For example, to achieve a 50× SDC improvement, the combination of LEAP-DICE, logic parity, and micro-architectural recovery provides a 1.5× and 1.2× energy savings for the OoO-and InO-cores, respectively, compared to selective circuit hardening using LEAP-DICE (Table 5 ). The relative benefits are consistent across benchmarks and over the range of SDC/DUE improvements. The overheads in Table 5 are small because we reported the most energy-efficient resilience solutions. Most of the 798 combinations are far costlier. More detailed results (e.g., inclusion of EDS) appear in [Cheng 16 ]. 0.7 1.7 3.8 5.1 9.3 0% 1.5 3.8 9.5 12.5 22.4 1.5 3.8 9.5 12.5 22.4
LEAP-DICE + logic parity (+ flush recovery)
A P E 0.7 1.7 2.5 3 8 0.6 1.5 3.6 4.4 8 0.7 1.6 2.4 2.8 7.6 -----0% 1.9 3.9 6.1 6.7 17.9 1.5 3.4 8.4 10.4 17.9 1.9 3.8 5.9 6.5 17.2 1.9 3.9 6.1 6.7 17.9 1.5 3.4 8.4 10.4 17.9 1.9 3.8 5.9 6.5 17. 1.1 1.3 2.2 2.4 6.5 1.3 1.6 3.1 3.6 6.5 1.1 1.3 2.2 2.4 6.5 1.3 1.6 3.1 3.6 6.5 0% 1.5 1.7 3.1 3.5 9.4 2 2.3 4.2 5.1 9.4 1.5 1.7 3.1 3.5 9.4 2 2.3 4.2 5.1 9.4 1.5 1.7 3.1 3.5 9.4 2 2.3 4.2 5.1 9.4 1.5 1.7 3.1 3.5 9.4 2 2. if f has timing path slack greater than delay imposed by 32-bit XOR-tree (this implements low cost parity checking as explained in [Cheng 16]) 10:
then return TRUE, else return FALSE 11: end procedure When the application space targets specific algorithms (e.g., matrix operations), a cross-layer combination of LEAP-DICE, parity, ABFT correction, and micro-architectural error recovery (flush/RoB) provides additional energy savings (Table 5 ). Since ABFT correction performs inplace error correction, no separate recovery mechanism is required for ABFT correction. For our study, we could apply ABFT correction to three of our PERFECT benchmarks: 2d_convolution, debayer_filter, and inner_product. When targeting DUE improvement, including ABFT correction provides no energy savings for the OoO-core. This is because ABFT correction performs checks at set locations in the program. For example, a DUE resulting from an invalid pointer access can cause an immediate program termination before a check is invoked.
Since most applications are not amenable to ABFT correction, the flipflops protected by ABFT correction must also be protected by techniques such as LEAP-DICE or parity (or combinations thereof) for processors targeting general-purpose applications. This requires circuit hardening techniques (e.g., [Mitra 05, Zhang 06]) with the ability to selectively operate in an error-resilient mode (high resilience, high energy) when ABFT is unavailable, or in an economy mode (low resilience, low power mode) when ABFT is available. However, the overheads outweigh ABFT correction benefits (details in [Cheng 16] ).
Application Benchmark Dependence
The most cost-effective resilience techniques rely on selective circuit hardening / parity checking guided by error injection using application benchmarks. This raises the question: what happens when the applications in the field do not match application benchmarks? We refer to this situation as application benchmark dependence.
To quantify this dependence, we randomly selected 4 (of 11) SPEC benchmarks as a training set, and used the remaining 7 as a validation set. Resilience is implemented using the training set and the resulting design's resilience is determined using the validation set. We used 50 training/validation pairs. Table 6 indicates that validated SDC improvement is generally underestimated. Fortunately, when targeting <10× SDC improvement, the underestimation is <4%. This is due to the fact that the most vulnerable 10% of flip-flops (i.e., the flip-flops that result in the most SDCs or DUEs) are consistent across benchmarks. Since the number of errors resulting in SDC or DUE is not uniformly distributed among flip-flops, protecting these 10% of flip-flops results in the ~10× SDC improvement regardless of the benchmark. The vulnerabilities of the remaining 90% of flip-flops are more benchmark-dependent. The sensitivity may be minimized by training using additional benchmarks or through better benchmarks (e.g., [Mirkhani 15a] ). An alternative approach is to apply our CLEAR framework using available benchmarks, and then replace all remaining unprotected flip-flops using LHL (Table 3 , [Cheng 16] ). This enables our resilient designs to meet (or exceed) resilience targets at <1.2% additional cost for SDC improvements >10×.
(DUE results in [Cheng 16 ]).
Conclusions
CLEAR is a first of its kind cross-layer resilience framework that enables effective exploration of a wide variety of resilience techniques and their combinations across several layers of the system stack. Extensive cross-layer resilience studies using CLEAR demonstrate:
1. A carefully optimized combination of selective circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for generalpurpose processors.
2. Selective circuit-level hardening alone, guided by thorough analysis of the effects of soft errors on application benchmarks, also provides a cost-effective soft error resilience solution (with ~1% additional energy cost for a 50× SDC improvement compared to the above approach).
3. Algorithm Based Fault Tolerance (ABFT) correction combined with selective circuit-level hardening (and logic-level parity checking and micro-architectural recovery) can further improve soft error resilience costs. However, existing ABFT correction techniques can only be used for a few applications; this limits the applicability of this approach in the context of general-purpose processors.
4. Based on our analysis, we can derive bounds on energy costs vs. degree of resilience (SDC or DUE improvements) that new soft error resilience techniques must achieve to be competitive (shown in Fig. 4) .
5. It is crucial that the benefits and costs of new resilience techniques are evaluated thoroughly and correctly before publication. Detailed analysis (e.g., flip-flop-level error injection or layout-level cost quantification) identifies hidden weaknesses that are often overlooked.
While this paper focuses on soft errors in processor cores, cross-layer resilience solutions for accelerators and uncore components as well as other error sources (e.g., voltage noise) may have different tradeoffs and may require additional modeling and analysis capabilities. 
