Abstract-The effects of soft errors in processor cores have been widely studied. However, little has been published about soft errors in uncore components, such as the memory subsystem and I/O controllers, of a system-on-a-chip (SoC). In this paper, we study how soft errors in uncore components affect system-level behaviors. We have created a new mixed-mode simulation platform that combines simulators at two different levels of abstraction, and achieves 20 000× speedup over register-transferlevel-only simulation. Using this platform, we present the first study of the system-level impact of soft errors inside various uncore components of a large-scale, multicore SoC using the industrial-grade, open-source OpenSPARC T2 SoC design. Our results show that soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that uncore soft errors can create major challenges for traditional system-level checkpoint recovery techniques. To overcome such recovery challenges, we present a new replay recovery technique for uncore components belonging to the memory subsystem. For the L2 cache controller and the dynamic random-access memory controller components of OpenSPARC T2, our new technique reduces the probability that an application run fails to produce correct results due to soft errors by more than 50× with 1.82% and 2.58% chip-level area and power impact, respectively.
I. INTRODUCTION
R ADIATION-INDUCED errors pose a major challenge to building robust systems using complex system-on-chips (SoCs). Although the soft error rate at the static randomaccess memory (SRAM) cell or latch level stays roughly constant or even decreases over technology generations, the system-level soft error rate increases as more devices are integrated into SoCs [1] - [5] . In this paper, we focus on soft errors in flip-flops (flip-flop soft errors) because design techniques to protect them are generally expensive. Coding techniques are routinely used for protecting on-chip memories. Combinational logic circuits are significantly less susceptible to soft errors [4] - [6] . For our soft error resilience solution, we address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) [7] , [8] .
Uncore components, 1 such as cache controllers, dynamic random-access memory (DRAM) controllers and I/O controllers, are increasingly important because their overall area footprint and power consumption in SoCs are comparable to that of processor cores [9] , [10] . The need for studying soft errors in uncore components has been pointed out in [11] and [12] . While there are many studies on soft errors in processor cores (see [13] - [15] ), few have studied soft errors in uncore components. The lack of such studies can be attributed to the difficulties in modeling large-scale SoCs (with multiple processor cores and multiple uncore components) for the following reasons.
1) Uncore studies should model the entire SoC because uncore components interact with processor cores and other uncore components. Modeling only a part of the system may not capture uncore behaviors accurately. 2) Studying system-level effects of soft errors requires real-world applications. This becomes more relevant in the context of cross-layer resilience, where multiple error resilience techniques from various layers of the system stack are combined to achieve cost-effective solutions [1] , [16] - [18] . 3) For statistically significant results, a large number of error injection samples are required. For example, when observing a certain outcome rate, more than 40 000 samples are required to achieve ±0.1% accuracy with 95% confidence when the observed rate is 1%. 2 Such requirements demand high-throughput error simulation or emulation platforms. Register-transfer-level (RTL) simulators that model detailed error behaviors are extremely slow. For example, RTL simulation of an out-of-order, superscalar processor core achieves less than a thousand cycles per second [20] . High-level simulators, on the other hand, achieve much faster simulation times [21] . However, naïvely injecting errors into abstracted high-level layers without adequate lowlevel details can result in highly inaccurate results (e.g., results in [13] for processor cores).
Existing uncore error studies are limited to very small designs (e.g., private L1 cache and bus controller in a design with a single processor core [22] ) or rely on fast high-level simulators without low-level details (e.g., error injections into primary input and output signals in [23] and [24] ). While radiation testing can be used to study overall soft error resilience of a design [25] , [26] , it is only available after the chip is produced. Also, individually quantifying vulnerabilities of various on-chip components can be difficult using radiation testing due to limited observability.
In this paper, we make the following contributions. 1) We present a simulation platform that is capable of simulating large-scale SoCs while modeling detailed flip-flop soft errors. With additional design effort, this platform achieves over 20 000× speedup compared to RTL-only simulation. 2) We present the first study of system-level effects of soft errors in uncore components in a large-scale OpenSPARC T2 SoC with 500 million transistors, eight processor cores, 64 hardware threads, and many uncore components [27] . We report quantified results on the effects of soft errors in L2 cache controllers (L2Cs), DRAM controllers, crossbar interconnects (CCX), and peripheral component interconnect (PCI) Express I/O controllers. We show that soft errors in uncore components can have significant reliability impact comparable to that of processor cores. 3) We show that traditional system-level checkpoint recovery techniques that generally target processor cores are inadequate for uncore components. 4) We present a new soft error recovery technique called quick replay recovery (QRR) for uncore components belonging to the memory subsystem. We demonstrate the effectiveness of QRR for the L2C and the DRAM controller in the OpenSPARC T2 design. QRR results in more than 50× improvement (i.e., reduction) of the probability that an application run fails to produce correct results due to soft errors in uncore components belonging to the memory subsystem; the corresponding chip-level area and power impact for all L2C and DRAM controller instances are 1.82% and 2.58%, respectively. An earlier version of this paper was published in [28] . In this paper, we present the following additional contributions.
1) We perform a detailed analysis of the accuracy level of our simulation platform. We demonstrate that obtained results from our platform closely matches those from RTL-only simulation. 2) We enhance our QRR technique to address both SEUs and SEMUs for uncore components. 3) We quantify the cost-effectiveness of our QRR technique at various error resilience improvement levels.
II. MIXED-MODE SOFT ERROR
SIMULATION PLATFORM To analyze the effects of uncore soft errors in large-scale SoCs, we created a mixed-mode platform that combines two simulation platforms (sometimes referred to as co-simulation in design validation literature [29] ). The target uncore component is simulated using an RTL simulator to model soft error behaviors with low-level details, while the rest of the system is simulated using a high-level simulator. Our mixed-mode platform is different from existing co-simulation-based studies on error behaviors for the following reasons. 1) Li et al. [30] and Ejlali et al. [31] used co-simulation to study errors in small combinational logic blocks, such as the ALU or the decoder module with only a few hundred gates. To correctly model how soft errors behave inside an uncore component, we model an entire uncore component (more than 100K gates) using RTL. 2) Goswami [32] and Kalbarczyk et al. [33] profiled highlevel effects resulting from low-level errors, and used the statistical information for quick error simulations. Profiled error behaviors may not reflect subsequent error propagation effects due to interactions with the rest of the system (e.g., a flip-flop error in a module may result in multiple erroneous interactions with other components [13] ). We model how the error interacts by simulating its behavior at the entire chip level until all the effects from the error have been fully modeled using RTL and high-level simulations. 3) Wang et al. [15] used two simulators at two different levels of abstraction to simulate a processor core, but only one of the simulators is used at a given point in time. This requires simulating the entire system using RTL to model flip-flop error behaviors. In our platform, we selectively use RTL simulation only for the target uncore component to reduce low-level simulation overheads. Field-programmable gate array (FPGA) emulation platforms can achieve faster speeds compared to RTL simulations while modeling low-level details [34] , [35] . However, to model an entire SoC, the design may need to be mapped onto multiple FPGA chips. This is because the area required for the FPGA implementation of a design can be an order of magnitude greater than an application-specific integrated circuit implementation (for the same technology generation) [36] . As a result, limited inter-FPGA I/O bandwidth can limit the overall emulation speed to only a few MHz [37] .
A. Mixed-Mode Platform Simulation Modes
Our platform operates in two modes. 1) Accelerated Mode [ Fig. 1(a) ]: All components on the chip, including processor cores and uncore components, are simulated using the Simics instruction-set simulator [21] with abstracted models (high-level models). These high-level models are created manually [21] , [38] . Creating a corresponding high-level model for each component requires additional effort for inspecting design specifications and analyzing actual RTL design. accuracy, the two simulators are synchronized during this mode to ensure transfer of packets between simulators at the correct cycle. Although the accelerated mode cannot simulate how a soft error behaves at the flip-flop level, high-level models can correctly simulate subsequent behaviors after an error fully propagates to the high-level uncore state (i.e., no flip-flop or SRAM array inside the uncore component, not included in the high-level uncore state, contains an error). Fig. 2 shows the flowchart of our uncore error injection methodology using our mixed-mode platform. The co-simulation mode is invoked only when soft error injection begins and terminated when the injected error disappears without any remaining error or when the remaining errors can be simulated using the accelerated mode.
B. Soft Error Injection Methodology
Phase 1 (Prepare for Error Injection): For each error injection run, an error injection cycle from high-level simulation (in accelerated mode) and a target flip-flop inside the target uncore component are randomly selected. The mixed-mode platform starts application execution in accelerated mode and simulates the application until the error injection cycle (Fig. 2,  steps 1 and 2 ). This step is shortened by starting the simulation using one of the system state snapshots obtained from a one-time, error-free execution of the application in accelerated mode. If the error injection cycle is C i and the snapshots are created every C f cycles, the simulation is started using a snapshot created at cycle C s , where C s = C i /C f × C f . For our error injection runs, we created snapshots every 2 million cycles.
When RTL simulation starts ( Fig. 1(b) 3 ] . A warm-up period is required before error injection to correctly restore all microarchitectural states (e.g., flip-flops and small SRAM buffers) that have not been simulated by the high-level model (Fig. 2, step 4) . For the tested OpenSPARC T2 uncore components, a warm-up period longer than 1000 cycles is enough to reconstruct microarchitectural states (Section IV-A). For each error injection run, the length of the warm-up period is randomly selected (longer than 1000 cycles) to avoid injecting errors always after the same number of co-simulation cycles. The golden component is an identical copy of the target uncore component that receives the same input, but simulated without error injection. It is only used for simulation purposes to check when to end the co-simulation mode. The co-simulation mode is no longer needed if the comparison finds no mismatch or all mismatches satisfy one of the following conditions.
1) The mismatch can be directly mapped to high-level uncore states. The subsequent effects can be simulated by using the accelerated mode. 2) The mismatch does not cause any functional difference (e.g., corrupted data when the associated valid flag is not set; the value will not be used by the application in that case). Phase 3 (Determine Application Outcome): The current uncore state in RTL is transferred back to the high-level model. Since the rest of the system is already being simulated in the high-level simulator, only the target uncore state is transferred. The platform continues to run the application to completion (Fig. 2, steps 10-12) .
During phase 2, the platform monitors if an injected error has produced erroneous return packets to the processor cores by comparing return packets from the target uncore component to those of the golden uncore component [ Fig. 1(b) 6 ] . If no erroneous return packet has been detected and the transferred state from the target uncore component matches that from the golden uncore component, the error injection run will result in the same outcome as that of the error-free run. For those cases, the simulation can stop early without executing the rest of the application in phase 3 ( Fig. 2 , steps 8 and 9).
C. Mixed-Mode Simulation Performance
The effective simulation throughput of the mixed-mode platform is over 2 × 10 6 cycles/s, comparable to that of multi-FPGA platforms for large-scale SoCs [34] , [35] . Compared to the RTL-only simulation of the OpenSPARC T2 design (up to 100 cycles/s only [39] ), we achieve more than 20 000× speedup. By utilizing saved snapshots, steps 1 and 2 take only 10 6 cycles on average. Steps 11 and 12 are executed only for less than 1% of total error injection runs. 4 Table II summarizes the performance of our mixed-mode platform for the OpenSPARC T2 design when simulating an application whose execution time is L cycles. For applications with cycle lengths longer than 2.8 × 10 8 , the throughput is over 2 × 10 6 cycles/s. Applications with shorter lengths achieve throughput values less than 2 × 10 6 cycles/s (e.g., the Radix application with L = 1.2 × 10 8 in Section III-B achieves 10 6 cycles/s); however, those applications require shorter simulation times
III. SOFT ERROR INJECTION RESULTS
FOR UNCORE COMPONENTS Using the mixed-mode error injection platform, we performed soft error injection runs for uncore components in the OpenSPARC T2 design (Table III) . We use flip-flop soft error injections for reliability analysis with respect to radiationinduced soft errors. This is because radiation test results confirm that injection of single bit-flips 5 into flip-flops closely models soft error behaviors in actual systems [26] , [40] . Furthermore, flip-flop-level error injection is crucial since naïve high-level error injections can be highly inaccurate [13] . In this paper, we study soft errors in the L2C, the DRAM controller (MCU), the CCX, and the PCI Express I/O controller (PCIe). 6 
A. Flip-Flops Targeted for Error Injection
Our soft error injection study excludes flip-flops that are already protected or inactive during normal operation. L2C, MCU, and PCIe have built-in error detection and recovery/error correction, such as error-correcting code (ECC) and cyclic redundancy check (CRC), to address errors inside memory arrays. Flip-flops storing ECC or CRC encoded data are effectively protected. The inactive flip-flops are dedicated to built-in self-test and redundant arrays to repair defective SRAM cells. For this paper, we assume a defect-free chip where these flip-flops are not utilized. Table IV shows the number of flip-flops targeted for error injection in the L2C, MCU, CCX, and PCIe modules. 5 Tolerating SEMUs is crucial for circuit-level soft error resilience techniques (e.g., BISER [41] , [42] and LEAP-DICE [43] , more details in Section VI-D). When SEMUs affect multiple nodes inside a single flip-flop, the overall effect still manifests as a single-bit error. The chances of SEMUs affecting multiple flip-flops are expected to be much smaller, especially for terrestrial applications [5] , [44] . For situations where SEMUs across multiple flip-flops are prominent, our overall methodology can be combined with error injection simulations that inject multiple bit-flips per error injection run. 6 The network interface unit (NIU), the system interface unit (SIU), and the noncacheable unit (NCU) are excluded from this paper since RTL simulation of those components requires additional software modules to model off-chip transactions. Those software modules are available only for the Solaris OS on SPARC machines. Due to the lack of source code and detailed specification in the OpenSPARC T2 distribution, it is challenging to replicate their behavior on other environments to thoroughly test those modules. 7 Because the OpenSPARC T2 distribution does not provide RTL source of the PCIe, we used an industrial implementation of state-of-the-art PCIe generation 3 design to model soft errors in I/O controllers. 
B. Benchmark Applications
We use a wide range of multithreaded benchmark applications: 6 SPLASH-2 benchmarks [45], 9 PARSEC-2.1 benchmarks 8 [46] , and 3 Phoenix MapReduce benchmarks for shared-memory systems [47] (Table V) . To fully utilize OpenSPARC T2's 64 hardware threads, we instantiated 64 threads for each benchmark application. For PCIe error injections, we modeled a situation where PCIe I/O is used to transfer the application's input data files. In our benchmark set, 12 applications have input data file characteristics as shown in Table V , and they are used for PCIe error injection runs. For each benchmark, we ran more than 40 000 error injection runs for each target uncore component. We assume that only one soft error happens for each application run. 9 We used the following five outcome categories, used in related studies, to classify application-level outcomes [13] , [15] , [26] . 1) Vanished: The application terminates normally, and at the end of the execution, the output files and all software-visible architectural states match with those obtained from the error-free run.
2) Application Output Not Affected (ONA):
The application terminates normally without any error indication, and, at the end of the execution, the output files from the erroneous run match those obtained from the error-free run. However, one or more remaining bits of the architectural state differ from those obtained from the error-free run.
3) Application Output Mismatch (OMM):
The application terminates normally without any error indication. However, at the end of the execution, the output files of the application are different from those obtained from the error-free run. The remaining architectural state bits 8 Facesim is not tested because the input file for simulation is not included in the benchmark suite. Raytrace from PARSEC is not tested because it produces no output files, and it is not possible to validate the application results. 9 The interval between flip-flop soft errors is usually much longer compared to the length of the target benchmark applications [11] . Actual failure rate of a given system can be derived by combining the technology-dependent soft error rate and the observed application-level outcome rates per injected soft error. may or may not match with those of the error-free run. This category is often referred to as silent data corruption as well [26] , [48] .
4) Unexpected Termination (UT):
The application terminates abnormally with error indication. These include error reporting interrupts, e.g., divide-by-zero, invalid instruction, or memory access violation, and applicationdetected errors, e.g., exit() function calls with error codes. 5) Hang: The application does not produce any result or does not terminate within a specified timeout limit set to 2× the nominal execution time.
C. Application-Level Erroneous Outcome Rates
Our soft error simulation results demonstrate that uncore soft errors can have significant impact on the overall chip-level soft error rate. Fig. 3 shows the observed erroneous outcome rates for each of the uncore components across the benchmark applications and their arithmetic means. For example, in Fig. 3(a) , error injections into L2C for Barnes resulted in 0.42% of ONA, 0.02% of OMM, 1.34% of UT, 0.26% of Hang, and 97.96% of vanished outcomes.
As expected, most injected soft errors resulted in the vanished outcome type (over 97% of cases on average). Out of non-vanished outcomes, UT is the most frequent outcome type for L2C and CCX errors (0.69% on average). However, depending on the application, OMM rates are also significant. For example, the OMM rate for L2C is 0.3% for Fluidanimate and 0.42% for Streamcluster. PCIe error injection results show higher OMM rates (0.89% on average) compared to other components. Since PCIe transfers input data files in our simulations, soft errors in the PCIe likely affect data values. On the other hand, soft errors in other uncore components may corrupt control-related program variables, such as pointers or condition variables that may result in UT or Hang outcomes. Overall, the probability of having an erroneous application outcome (non-vanished) for a single flip-flop soft error is 1.4%, 1.7%, 2.2%, and 1.7% for L2C, MCU, CCX, and PCIe, respectively.
The OMM outcome type is a serious reliability concern because, unlike the UT and the Hang outcome types, the user may not be aware that the application resulted in erroneous outputs (unless there are additional mechanisms to verify the correctness of outputs). Fig. 4 compares the observed OMM rates obtained from our uncore soft error injection runs to the OMM rates of processor core soft errors reported in the literature. 10 The observed OMM rates of uncore soft errors are comparable to that of processor cores, showing that understanding soft error resilience is important for uncore components in the studied OpenSPARC T2 design.
IV. MIXED-MODE PLATFORM ACCURACY Unlike RTL-only simulations or FPGA-based emulation, where every flip-flop in a system is modeled all the time, our mixed-mode platform models detailed flip-flop behaviors only during the co-simulation mode. Hence, it is important to quantify the accuracy of our approach. 
A. Warm-Up Period of Co-Simulation Mode
To show that a 1000 cycle warm-up period is enough to restore the microarchitectural states not included in the OMM rate of uncore components and processor cores (per instance). Error bars are showing the minimum and maximum values observed across benchmark applications (LEON: LEON3 SPARC [13] , IVM: IVM ALPHA [13] , Power: IBM POWER6 [26] , and OR: OpenRISC [51] ). high-level uncore model (before an error is injected at the flip-flop), we compared the logic value of each microarchitectural state bit of our mixed-mode simulation setup (during co-simulation mode) versus a simulation setup that runs the RTL co-simulation from the very beginning (i.e., full-cosimulation). In Fig. 5 , the y-axis represents the percentage of bits in our mixed-mode setup (during co-simulation mode) that do not match the corresponding bit in the full-co-simulation mode (unless the bit in the full-co-simulation mode is still unknown). The results are averaged over 10 000 runs. After 1000 cycles into the co-simulation mode, the microarchitectural state of our mixed-mode platform closely matches that of the full-co-simulation (difference less than 0.2%).
B. Limited Co-Simulation Length
As discussed before, the co-simulation mode terminates early if the outcome of the application run is determined or if only the states modeled by high-level uncore models are erroneous. However, in a few cases, errors may persist in uncore microarchitectural states not modeled by high-level uncore models for extended periods of simulation time. For these cases, limiting co-simulation length is a tradeoff between simulation efficiency and accuracy of the obtained results. For our error injection study, only a small subset of soft errors that are injected into a small number of flip-flops result in such situations 11 past 100K cycles of co-simulation. Hence, we limit co-simulation length to 100K cycles. These flip-flops represent 3.7%, 2%, 3.4%, and 3.3% of all flip-flops in L2C, MCU, CCX, and PCIe, respectively (Fig. 6) . Out of all error injection runs, only 1.8% actually result in situations in which errors in uncore microarchitectural states not modeled by highlevel uncore models persist past 100K co-simulation cycles (L2C: 1.8%, MCU: 0.4%, CCX: 1.5%, and PCIe: 1.4% of their respective total runs). Extending the co-simulation length beyond 100K cycles slows down simulation and has diminishing returns in further determining application outcomes (e.g., extending cosimulation cycle limit by 10× to 1M cycles increases the co-simulation time tenfold, but the percentage of error injection runs for L2C with errors persisting beyond the cycle limit is reduced from 1.8% to 1.4% only). Since these errors might vanish if given more co-simulation cycles, we do not report them as erroneous outcomes in Figs. 3 and 4 . However, one may conservatively choose to protect these flip-flops for error resilient design with additional costs, 12 as we did in this paper of QRR described in Section VI.
C. Application-Level Outcomes Accuracy
We compare the observed outcome rates from our mixedmode platform versus those obtained from RTL-only simulations. Due to the slow speed of RTL simulators, the comparison is limited to the FFT application with a smaller data set (1M cycles of execution time), running on four threads without an OS. ONA and OMM types are categorized into one outcome type because no specific output generation function (e.g., file write) is implemented in this setup. Fig. 7 compares the observed application-level erroneous outcome rates from the two setups obtained from 40 000 error injection samples each. 13 The observed rates from our mixed-mode platform closely match (0.9×-1.1×) those from the RTL-only simulations. 12 Less than 0.09% and 0.12% chip-level area and power overhead, respectively. 13 To obtain these simulation results, RTL-only simulations require more than 100 000 hours of accumulated simulation time, whereas our mixed-mode platform requires less than 800 h of simulation time. We collected our simulation results using more than 200 computing nodes. Note that due to the short application length (1M cycles) used in this comparison, we observe a limited simulation performance improvement from our mixed-mode platform as we discussed in Section II-C.
V. SYSTEM-LEVEL CHECKPOINT RECOVERY CHALLENGES FOR SOFT ERRORS IN UNCORE COMPONENTS Many error resilience solutions depend on system-level checkpoint recovery techniques to revert the system to an error-free state upon error detection [52] . One major challenge for ensuring correct recovery is the output commit problem that may incur long delays for system outputs. Since rollback recovery may not be able to invalidate committed outputs to the outside world, 14 such as network packets or human interactions, outputs should be committed only when it is guaranteed that the system will not roll back to a state before the outputs were produced [52] , [53] . To avoid long output delays introduced due to the output commit problem, two conditions must be satisfied.
1) Errors must be detected quickly.
2) The recovery operation should not revert the system to a very old state during rollback to an error-free state (i.e., the number of cycles rolled back, which is referred to as rollback distance [55] , should be short).
A. Long Error Detection Latency of Uncore Soft Errors
Error detection latency is the time elapsed from the cycle the soft error appears at a flip-flop to the cycle the error is detected by an error detection technique. Long error detection latency may cause the system to roll back to a very old state in order to revert the system to an error-free state. Error detection techniques at the software and processor architecture levels, such as error detection by duplicated instructions (EDDI) [56] and redundant multithreading (RMT) [57] , can detect uncore errors only after a processor core sees an erroneous output from the uncore component. Therefore, the shortest error detection latency for such techniques is longer than the error propagation latency to processor cores, i.e., the duration from the cycle when a soft error affects an uncore component until the cycle when uncore component produces an erroneous output to the processor cores.
For soft errors injected in the uncore components associated with the memory subsystem (L2C, MCU, and CCX) of OpenSPARC T2, we observed very long error propagation latencies (Fig. 8) . For example, soft errors in L2C take 36 million cycles to propagate to processor cores on average. For processor cores, in contrast, errors can be detected quickly within a short amount of time [58] , [59] .
Proactively loading and checking memory values from uncore components can reduce error propagation and detection latencies [60] , [61] . These quick error detection (QED) techniques are successfully used for validation purposes. Their bug detection capabilities can be utilized to detect bit-flips caused by soft errors as well. However, the software-only solution presented in [60] incurs significant performance overheads. The hardware-assisted solution presented in [61] reduces the performance overheads, but it still requires the applications to be transformed to insert QED checks prior to the execution. To ensure correct checking, these QED checks need to be executed in order; this requires additional barrier instructions for certain architectures with weak memory ordering (e.g., ARM, IA64, and POWER), since they can rearrange the order of store operations. These requirements do not hinder QED techniques from being effective validation techniques, but they may not be suitable for soft error detection during system operation. While QED techniques can be used for error detection, additional recovery mechanisms are required to provide a solution for soft error resilience.
B. Long Rollback Distance for Uncore Soft Errors
To ensure short rollback distance, the checkpointing mechanism has to create checkpoints frequently (short checkpoint interval). To frequently create checkpoints, the data size of each checkpoint has to be kept small due to the limited checkpoint storage size and bandwidth. Incremental checkpointing techniques reduce the data size of each checkpoint by saving logs of memory locations 15 modified by processor cores between two checkpoints [62] , [63] .
For soft errors in uncore components, however, such techniques may not be adequate. For example, suppose that processor cores modified memory contents in the address range [X-Y] (and, hence, only those memory contents were included in an incremental checkpoint). However, a soft error in L2C might corrupt the content of memory address Z which is outside the range [X-Y] (due to an address-related error). In such a case, the recovery mechanism must roll back to an older state with an error-free log on address Z.
The required rollback distance to recover from corrupted values in an arbitrary memory location is determined by when a processor core last modified that memory location. Fig. 9 shows the cumulative distribution of required rollback distances resulting from soft errors in L2C and MCU. To cover more than 99% of soft errors resulting in memory corruptions, the required rollback distance can be longer than 400M cycles. 15 Other architectural states, such as register values, have much smaller size compared to the main memory state, and may not require incremental checkpointing.
VI. UNCORE SOFT ERROR RESILIENCE USING
QUICK REPLAY RECOVERY To overcome uncore soft error recovery challenges associated with system-level checkpoint recovery techniques (Section V), we present a new soft error resilience solution targeting uncore components. Uncore soft error resilience can be achieved by utilizing radiation-hardened flip-flops [41] , [43] , but the associated costs may not be optimal if radiationhardened flip-flops are used as the only solution (Table VIII) . Logic parity [16] , [64] can detect errors with very short error detection latency; combined with an efficient recovery technique, logic parity can provide a low-cost error resilience solution. For processor cores, efficient error recovery techniques exist (e.g., by flushing instructions [57] , [65] , or by using instruction-level retry [66] ). For uncore components, such mechanisms are inadequate due to the following reasons.
1) As discussed in Section II-A, uncore components process request packets from processor cores. Those request packets need to be recreated for recovery. An uncore component may not be able to regenerate request packets by itself. 2) Requesting processor cores to resend request packets may not always be possible. For example, OpenSPARC T2 processor cores retain request packets only until L2C sends corresponding return packets. However, L2C may continue to process a request even after sending the return packet to the processor core. If a request results in a store miss, L2C may spend hundreds of cycles to fetch a cache line even after sending the return packet. The uncore operation may be affected by a soft error even after the processor core removes the request packet (upon receipt of the return packet). 3) Reverting processor cores to an older state, along with the erroneous uncore component, may result in cascaded rollbacks since each uncore component can interact with multiple processor cores and/or uncore components. For example, rolling back a processor core might require rolling back the uncore components the processor core interacted with, such as other instances of L2C. This, in turn, might require rolling back other processor cores that interacted with those uncore components. To address these challenges, we present a new technique called QRR targeting uncore components (Fig. 10) . QRR handles soft errors without engaging processor cores during recovery. It is applicable for uncore components that satisfy the following properties.
1) Executing requests multiple times in the same order does not change the outcome. For example, this property is maintained in storage components such as memory where duplicated operations in the same order do not change the outcome (for a detailed discussion regarding this property in the presence of requests accessing the same address, please refer to Section VI-C).
2) The uncore component should be able to resume its operation upon reset of its flip-flop contents. For flip-flop contents that should not be reset, such as flip-flops used for configuration bits (e.g., cache disable bit in L2C), radiation-hardening can be selectively used to protect those flip-flops (fewer than 3% for L2C and MCU) from soft errors. In this paper, QRR works in conjunction with logic paritybased error detection (other error detection techniques with very short error detection latencies are also possible). It provides the following functionality.
1) Record request packets using a record table in the QRR controller. Packets are stored in the table when a new request packet is sent to the uncore component, and deleted from the table when the associated operation is completed by the uncore component (details in Section VI-A). Flip-flops in the QRR controller are protected using radiation hardening. 2) When logic parity detects an error, the QRR controller performs recovery operation by resending the request packets in the record table to the uncore component (details in Section VI-B). We evaluate QRR for the L2C and MCU modules for which traditional checkpoint recovery techniques are inadequate due to the long error detection latencies (Section V). QRR is applicable to CCX and PCIe modules as well, which use the same uncore packet interface. For these uncore components that have much shorter error propagation latencies, other existing error detection techniques, such as EDDI or RMT, can be utilized. To create optimized soft error resilience solutions for these components, it is required to explore possible combinations of error resilience solutions using a systematic platform, such as [16] .
Because MCU receives access requests through L2C only (e.g., cache line fill, eviction, or noncached direct DRAM access), recording and replaying L2C requests effectively covers MCU requests as well. 16 QRR incurs a small performance impact during recovery. For L2C, in the worst case when every replayed packet results in the longest operation (L2 cache load miss), the recovery takes fewer than 5000 cycles.
A. QRR Normal Operation
During normal operation, the QRR controller keeps track of request packets that are being processed in the uncore component using its record table. 17 QRR for an L2C instance maintains a total ordering of all incomplete requests to that instance based on their arrival order. This is a stricter ordering than the original design, which only needs to maintain the arrival ordering between requests to the same cache line 16 Since an MCU instance operates with two L2C instances in OpenSPARC T2, soft error detection in an MCU invokes recovery operation of two QRR controllers in the two L2C instances. 17 In OpenSPARC T2, uncore packets coming from processor cores (PCX packets) have a fixed size (130 bits).
in order to preserve the required SPARC total store ordering (TSO) [27] . Since each L2C and MCU instance exclusively serves disjoint memory address ranges, maintaining ordering at each L2C instance (bank) is sufficient (without affecting requests being processed by other instances).
When requests are completed without errors, they no longer need to be stored by the QRR controller. A completion of a request is determined by monitoring return packets to the processor cores. For uncore requests that require post processing even after the return packet, additional monitoring may be required. For our QRR implementation targets (L2C and MCU), the only return packet type requiring additional monitoring is a store miss (described as an example in Section VI). In this case, the QRR controller waits until the cache miss handling logic (Miss Buffer) in L2C completes the operation before deleting the corresponding entry.
B. QRR Replay Recovery Operation
When logic parity detects an error, QRR first disables write enable signals to data arrays (e.g., L2 cache tag, data, and DRAM) and valid signals of data ports connected to processor cores or other uncore components to prevent the error from corrupting those arrays and propagating to other components.
Propagating the parity error detection signal (individual error signal) to the QRR controller and invoking the recovery operation may take multiple cycles because signals from multiple parity detectors have to be aggregated. If a (detected) flip-flop error propagates to a data array or to another component within a few cycles versus the number of cycles required to propagate the aggregated error signal to the QRR controller, then the soft error might corrupt the corresponding data array or the connected component before the recovery operation is invoked. This creates a nonzero chance of corrupt outputs being produced by the SoC. In our current implementation, we managed this issue by manually inspecting cases where such situations might arise, and fixed the issues by routing individual error signals to corresponding data arrays or other components before the error signals are fully aggregated into the input for the QRR controller. These manually routed signals disable write enable signals and valid signals before the flip-flop error propagates to those arrays or components.
The next step is to assert the reset signal of the uncore component to clear its flip-flop values. Accepting new request packets from processor cores is postponed until recovery is completed. After reset, the QRR controller sends recorded packets to the uncore component in the recorded order until all recorded incomplete request packets are replayed. During the replay, we do not try to distinguish which packets are affected by the detected soft error. If logic parity detects a bit-flip in the target uncore component, we replay all recorded packets after the reset. This may increase the replay overhead since we have to replay unaffected request packets as well, but the maximum replay overhead for replaying all entries is less than 5000 cycles per detection. After the replay completes, the uncore component resumes normal operation by starting to accept new request packets from processor cores.
C. QRR Correctness
QRR can successfully recover errors for the following reasons.
1) For L2C and MCU, executing incomplete request packets again (replay) does not change the outcome. As long as multiple concurrent requests do not access the same address, replaying requests in a given order results in the same outcome. For example, executing requests "read from X," "write A to Y," and "read from Z" multiple times have the same resulting effects. If there are dependencies (i.e., multiple requests that access the same address) between concurrent requests, executing requests multiple time may result in a different outcome (e.g., "read from X" and "write A to X" when the original value of X is not A). If there are such dependencies between the request packets, as discussed in Section VI-A, the TSO memory ordering in OpenSPARC T2 is designed not to begin the execution of the following request until the previous one completes (i.e., only one of the requests are executed at a time). Therefore, the replay by QRR does not result in a situation where multiple dependent requests are executed multiple times. For example, suppose that requests R1, R2, and R3 have dependencies to each other, and L2C executes the packets in R1→R2→R3 order. If a soft error is detected when L2C is executing R2, the replay by QRR results in a re-execution of R2 only for the following reasons.
a) The execution of R1 is already completed at that point (removed from the record table) and not included in the replay. b) The execution of R3 has not been started yet (since R2 is still in progress), and hence R3 is not the target of replay. 
D. QRR SEMU tolerance
For radiation hardened flip-flops in QRR, we use LEAP-DICE and light hardened LEAP (LHL), which are specially designed to tolerate SEMUs through charge cancellation [7] , [43] . These hardened flip-flops have been experimentally validated using radiation experiments on test chips fabricated in 90, 45, 40, 32, 28, 20 , and 14 nm nodes in both bulk and SOI technologies [7] , [43] , [67] - [70] .
For flip-flops protected using logic parity error detection in QRR, we minimize the effect of SEMUs through layouts that ensure a minimum spacing (the size of one flip-flop) between flip-flops checked by the same parity checker. This ensures that only one flip-flop, in a group of flip-flops checked by the same parity checker, will encounter an upset due to a single strike in our 28-nm technology in terrestrial environments [44] . Although a single strike could impact multiple flip-flops, since these flip-flops are checked by different checkers, the upsets will be detected. Since this absolute minimum spacing will remain constant, the relative spacing required between flipflops will increase at smaller technology nodes, which may exacerbate the difficulty of implementation. Minimum spacing is enforced by applying design constraints during the layout stage. This constraint is important because even in large designs, flip-flops will still tend to be placed very close to one another. Table VI shows the distribution of distances that each flip-flop has to its next nearest neighbor in a baseline design (this does not correspond to the spacing between flipflops checked by the same logic parity checker). As shown, the majority of flip-flops are actually placed such that they would be susceptible to an SEMU. After adding parity checkers with the design constraints, we see that no flip-flop, within a group checked by the same parity checker, is placed such that it will be vulnerable to an SEMU (Table VII) .
VII. QRR RESULTS
We implemented QRR for the L2C and MCU modules of OpenSPARC T2, and evaluated its effectiveness using the mixed-mode platform. From simulations using the same set of applications as in Section III-B, QRR successfully recovered from all errors injected into the flip-flops covered by logic parity for over 400 000 error injection runs for L2C and MCU. 18 
A. Radiation-Hardened Flip-Flops in QRR
To minimize the cost of parity-based error detection, we selectively use radiation hardening for the following flip-flops.
1) Flip-flops with timing slack shorter than the path delay of the XOR tree used to calculate a parity bit. In such a case, logic parity may not be a cost-effective solution since it is not possible to place the XOR tree without slowing down the clock or using additional flip-flops to split the XOR tree over multiple clock cycles. For example, if the original design has 1% of non-vanished outcome rates and the design with QRR reduces the rate to 0.1%, the resulting error resilience improvement is 10×.
The vulnerability (i.e., the likelihood of a soft error causing a non-vanished outcome) of individual flip-flops will vary across a design. Prioritizing protection of more vulnerable flip-flops achieves a target error resilience improvement while minimizing the number of flip-flops that need to be protected and the associated cost. Table VIII shows the targeted error resilience improvement goals and the percentage of flip-flops that need to be protected in order to achieve the target (the most vulnerable flip-flops are protected first). 22 Recall that radiation-hardened flip-flops reduces the soft error rate of a flip-flop by a certain factor only (e.g., LEAP-DICE reduces soft error rate by 1000× [43] ) while QRR (with logic parity) successfully recovers soft errors detected by logic parity. As a result, to achieve the same error resilience improvement, a solution implemented using only radiation-hardened flip-flops may require protecting slightly 19 Since QRR does not introduce execution time overhead (except for the short recovery upon error detection) and the QRR controller flip-flops are protected using radiation hardening, we assume the flip-flop soft error rates are the same for the original design and the design with QRR. 20 ONA, OMM, UT, and Hang. 21 When calculating the error resilience improvement for QRR, flipflops protected using logic parity error detection (and QRR recovery) are considered to have zero probability of having a non-vanished outcome (Section VII). Assuming 1000× soft error rate reduction of radiation-hardened flip-flops [43] , flip-flops protected using radiation hardening are considered to have 0.001× probability of having a non-vanished outcome compared to that of the corresponding flip-flip in the original (i.e., without radiation-hardened flip-flops) design. Table VIII presents the complete area and power overheads for the cross-layer solution QRR (e.g., combining circuit-level LEAP-DICE, logic-level parity, and architecture-level QRR) as compared to the solution of selectively applying circuitlevel LEAP-DICE only (Selective LEAP-DICE) for several error resilience improvement goals, ranging from 5× to 500×. Area and power overheads are obtained after synthesis and place-and-route. 23 Compared to Selective LEAP-DICE, QRR costs 15.56% and 10.39% lower area and power overhead, respectively, for a 50× error resilience improvement.
For error resilience improvement goals less than 10×, QRR has minimal savings (or can be costlier) when compared to the selective LEAP-DICE approach. This is because at these error resilience improvement goals, very few flip-flops need to be protected. Since implementing a QRR controller requires a fixed overhead regardless of the number of flip-flops covered by logic parity error detection, the cross-layer QRR solution is not as effectively amortized at these low improvement targets. However, at all other improvement targets, cross-layer QRR shows benefits over a Selective LEAP-DICE solution (more flip-flops require protection).
There is one final consideration that we need to take into account-the dependence of our results to the specific application(s) running (e.g., application sensitivity). For example, error injection runs may identify flip-flop 1 as resulting in non-vanished outcome for application A, but resulting in vanished outcomes for application B. In such a case, an error resilience solution that protects flip-flop 1 may not be an optimal solution for application B. It is important to understand the impact of this sensitivity since resilience is implemented based on a set of applications that may be different than what is ultimately used in the field. We evaluate this application sensitivity by randomly selecting six benchmarks as a training set, and using the remaining 12 benchmarks as an validation set. For a given error resilience improvement goal, we select 22 We also protect flip-flops that require longer co-simulation cycles discussed in Section IV-B. This incurs less than 0.09% and 0.12% chip-level area and power overhead, respectively. 23 The area overhead is obtained using Synopsys IC Compiler and a commercial 28-nm technology library. The power overhead is calculated using the Synopsys PrimeTime and application execution traces. Chip-level overhead is estimated based on published data in related OpenSPARC T2 studies [10] , [71] . flip-flops that needs to be protected based on the error injection results of the training set, and evaluate the achieved error resilience improvement using the validation set.
From Table VIII , we see that the actual (validated) error resilience improvement is typically lower than the targeted (trained) error resilience improvement and can be significantly underestimated for higher levels of error resilience improvement goal (e.g., improvements greater than 10×). For example, when we target 50× error resilience improvement, we select the top 33.58% of most vulnerable flip-flops to protect based on the error injection results using the applications in the training set. However, if we evaluate the error resilience improvement using the applications in the validation set, the achieved improvement is only 15.1×. Possible methods of reducing application sensitivity include incorporating more training data, anticipating worst case behavior, or through the design of better benchmarks [72] . An alternative approach is to augment protection using LHL (a lightly-hardened LEAP flipflop) [43] by protecting the design as normal using cross-layer QRR and subsequently replacing all flip-flops left unprotected using LHL [16] . Although LHL can only provide up to 4× soft error rate reduction, our combined approach using LHL augmentation enables our error resilience solutions to meet the improvement targets up to 50× at very low additional cost (e.g., with LHL, 60.6× error resilience improvement is achieved when targeting 50× error resilience improvement at 0.99% additional power cost at chip-level). Table IX shows the achieved error resilience improvement as well as the area and power costs for both QRR and Selective LEAP-DICE for a various error resilience improvement goals ranging from 5× to 500×.
C. QRR Overheads Breakdown
To show the details of the area and power overheads, Table X presents the breakdown of the overheads associated with the QRR implementation for L2C and MCU. 24 For this breakdown, QRR protects all flip-flops (i.e., flip-flops subject to error injection in Table IV ) in L2C and MCU. In this case, the majority of the overheads come from logic parity or flipflop hardening. The total area and power overheads of QRR are 31.87% and 34.83% at each uncore component level (2.31% and 4.48% at chip-level for all L2C and MCU instances). VIII. CONCLUSION Studying the application-level effects of uncore soft errors in large-scale SoCs is important but difficult. Our new mixedmode simulation platform enables us to accurately and effectively model uncore soft errors while achieving 20 000-fold speedup compared to RTL simulations. This platform enabled us to characterize, for the first time, system-level effects of soft errors in various uncore components of a large and industrial-grade multicore SoC.
Our results show that uncore soft errors can have significant impact on the overall reliability of for the studied OpenSPARC T2 multicore SoC. Hence, resilience techniques to overcome uncore soft errors are required. However, uncore soft errors pose several challenges for traditional system-level checkpointing techniques that are generally effective for processor cores. Our QRR approach overcomes these challenges for uncore components in the memory subsystem of OpenSPARC T2. We demonstrate the effectiveness of QRR for L2C and MCU in the OpenSPARC T2 design. QRR achieves 50× error resilience improvement for L2C and MCU with the chip-level area and power impact of 1.82% and 2.58%, respectively.
