Technology-scaling trends lead to smaller, faster transistors and lower supply voltages, but they also increase susceptibility to transient faults and degrade reliability, even in commodity microprocessors. To take advantage of the high transistor counts afforded by technology scaling, the microprocessor industry is adopting chip multiprocessors (CMPs). The IBM Power 4, for example, is a four-processor CMP. CMPs are building blocks in server-class machines, for which reliability is a key concern. This article focuses on hardware-assisted transient-fault recovery for CMPs.
Addressing CMP reliability issues, Mukherjee, Kontz, and Reinhardt describe the chiplevel redundantly threaded multiprocessor scheme (CRT) for transient-fault detection. 1 Several other schemes provide transient-fault tolerance by replicating an application into two communicating threads-leading and trailing-and comparing their values. These schemes include active-stream/redundantstream simultaneous multithreading (AR-SMT), slipstreaming, simultaneous and redundant threading (SRT), and simultaneous and redundant threading with recovery (SRTR). [2] [3] [4] [5] The "Related work" sidebar describes these and other schemes in further detail.
Building on CRT, we propose a transientfault recovery scheme for CMPs called the chip-level redundantly threaded multiprocessor with recovery (CRTR), which also uses leading and trailing threads. CRTR provides recovery from single transient faults unless the register file is not protected by error-correcting code (ECC) and thus is affected by a fault, in which case CRTR guarantees detection.
In CRTR and CRT, the leading and trailing threads execute on different processors to achieve load balancing and to reduce the probability that a fault will corrupt both threads. Consequently, to check for faults, the CMP requires interprocessor communication for comparing the leading and trailing values. Because layout constraints dictate that processors in a CMP cannot be physically close, the interprocessor communication paths' latency and bandwidth are critical performance factors.
Transient-fault detection in CMPs
CRT borrows its detection scheme from SMT-based SRT processors and applies the scheme to CMPs. CRT replicates an application into two communicating threads, one executing ahead of the other, and compares the results of two redundant executions to detect transient faults. Figure 1 shows a four-CPU CRT/CRTR multiprocessor, which is running leading thread i on CPU i and trailing thread i on CPU i +1 (modulo 4). The interprocessor fault-tolerance communication paths shown in Figure 1 are used to send fault tolerance information between processors.
Because detection depends on replication, the extent of application replication is important. CRT replicates register values (in each processor's register file) but not memory values. CRT's leading thread commits stores only after checking, guaranteeing correct memory. CRT compares only the stores and uncached loads, not the register values, of the two threads. Because an incorrect value caused by a fault propagates through computations and is eventually consumed by a store, checking only stores suffices for detection; other instructions commit without checking.
The leading thread places its committed store values and addresses in a store buffer (StB). CRT compares the trailing thread's store values and addresses against the StB entries to determine whether a fault has occurred. Only one copy of the checked store reaches the cache hierarchy. Because data in the cache hierarchy is not replicated, the cache hierarchy needs another form of protection, such as ECC.
Replicating cached loads is problematic because an external agent (such as another processor during multiprocessor synchronization) can modify memory locations between the time the leading thread loads a value and the time the trailing thread tries to load the same value. The two threads might diverge if the loads return different data. CRT allows only the leading thread to access the cache and uses the load value queue (LVQ) to hold the leading load values and addresses. The trailing thread loads from the LVQ instead of repeating the load from the cache, after comparing load addresses to ensure that no fault has occurred.
A key optimization in SMT-based SRT is that the leading thread runs ahead of the trailing thread by an amount called the slack (for example, 256 instructions). In addition, the leading thread provides its branch outcomes to the trailing thread via the branch outcome queue (BOQ). The slack and the communication of branch outcomes hide the leading thread's memory latencies and avoid branch mispredictions from the trailing thread. As a result of the slack, by the time the trailing thread needs a load value or branch outcome, the leading thread has already produced it. These optimizations also apply to CMP-based CRT.
SRT assumes that the processor performs uncached accesses nonspeculatively. SRT synchronizes uncached accesses from the leading and trailing threads, compares the addresses, and replicates the load data. SRT assumes that code does not modify itself, although selfmodifying code in regular SMTs requires thread synchronization and cache coherence, which can be extended to keep the leading and trailing threads consistent. For input replication of external interrupts, SRT suggests forcing the threads to the same execution point and then delivering the interrupt synchronously to both threads. These assumptions are also valid for CMP-based CRT.
Transient-fault recovery in CMPs
CRTR enhances CRT to include transientfault recovery. Like CRT, CRTR assumes SMT processors in the CMP and uses the configuration illustrated in Figure 1 .
Because of interprocessor communication latency, detection and recovery schemes need a long slack to hide the communication delay between leading and trailing threads. However, long slack causes a problem for recovery schemes. Detection schemes such as SRT and CRT commit register values before checking for faults, but they guarantee fault detection and avoid memory corruption by checking stores before committing. Because threads don't wait for each other to commit, a long slack doesn't stall commits. Recovery schemes, though, must be more careful because committing unchecked values can lead to a loss of recoverability. To allow for long slack while retaining recoverability, CRTR uses an asymmetric commit strategy, which AR-SMT also uses. CRTR enables long slack by allowing the leading thread to commit register updates before checking, so that long slacks don't delay leading-thread commits. However, CRTR allows the trailing thread to commit register updates only after checking, so that the scheme can use the trailing thread's register state for recovery. In contrast, CRT allows both threads to commit register updates before checking, eliminating the possibility of using the trailing thread for recovery. As in CRT, CRTR commits stores only after checking. Because stores are relatively infrequent, the slack can increase without stalling leading-thread commits. Unlike CRTR, the SMT-based recovery scheme SRTR commits register values in either thread only after checking the values. SRTR uses a moderate slack (for example, 32 instructions) and reduces stalls by checking within the leading instructions' complete-to-commit times. Because both leading and trailing threads run on the same SMT processor in SRTR, the complete-to-commit times are sufficient to hide interthread communication. In CMPs, however, the complete-to-commit times are insufficient to hide interthread communication.
78

MICRO TOP PICKS
IEEE MICRO
CRTR uses the long slack to hide interprocessor communication latency between the leading and trailing threads. In addition to communicating branch outcomes, load addresses, load values, store addresses, and store values, CRTR also communicates register values. It employs sender-initiated (that is, leading-thread-initiated) communication and queues up the values at the processor running the trailing thread. Thus, if the slack is appropriately long, the leading thread's values reach the trailing thread before it needs them, despite incurring a communication delay.
Implementation issues
CRTR poses a few implementation issues. First, because the leading and trailing threads execute on different processors, accurately estimating the current slack is difficult. To address this issue, CRTR approximates the slack value, basing the approximation on the number of values queued up for sending on the leading side and the number of values waiting for consumption on the trailing side.
Second, sending values at commit is problematic. The processor writes register values back to the register file at instruction completion, and the instruction doesn't have the value at commit. Retrieving the value from the register file would add significantly to the register file's bandwidth pressure. SRTR solves this problem by using a separate structure called the register value queue (RVQ) to buffer the values from completion until commit.
Third, CRTR must match the values from the threads for checking. Because most values are in program order, this matching is easy. Although trailing loads can execute out of program order, they are dispatched in program order. CRTR uses the dispatch order of trailing loads to match them to leading-load values.
Fourth, in CRTR, as in CRT, the only communication from the trailing thread back to the leading thread is the store-checking result so that leading stores can commit. Because of interprocessor communication delay and slack, leading stores wait for the processor to complete trailing stores and perform checking. In modern processors, loads search the StB to honor memory dependencies, so the StB cannot be large. Consequently, there is some pressure on the leading thread's StB.
We present additional CRTR implementation details in another publication. 6 Recovery using the trailing-thread state
In CRTR, the trailing processor preserves the faulting instruction's program counter (PC) value so that execution can restart from that PC value. The exception handler saves the trailing register state and PC to the CMP shared memory and launches a restoring thread on the leading processor to load the saved register state and PC value from memory. To ensure that faults don't corrupt the saving or restoring processes themselves, the restoring thread redundantly saves the register and PC state loaded in the leading processor to a different set of memory locations. The handler then compares those locations with the trailing processor state. If the comparison fails, CRTR performs the saving and restoring again.
The cost of exception-and register-copying is low enough to allow acceptable recovery times-for example, from less than 10 ms to 40 ms of network round-trip delay, which is imperceptible to networked clients.
There are faults from which CRTR cannot recover. After the processor writes back a register value and the instruction producing the value has committed, if a fault corrupts the register, the leading and trailing instructions' use of different physical registers lets us detect the fault on the register value's next use. However, CRTR cannot recover from this fault. To avoid this loss of recovery, one solution is to provide ECC on the register file, as suggested for SRTR. 5 
Tackling interprocessor bandwidth
The asymmetric commit in CRTR hides interprocessor latency. To tackle interprocessor bandwidth requirements, we pipeline the interprocessor paths and hide the pipeline latency using the asymmetric commit. We split the interprocessor wires shown in Figure 1 into several segments with latches between them. This pipelining boosts bandwidth supply, and we reduce bandwidth demand with two techniques. First, unlike SRTR, which checks speculative values, CRTR, like CRT, communicates and checks only committed values. Second, we extend SRTR's dependence-based checking elision (DBCE). Whereas SRTR uses DBCE to reduce the RVQ bandwidth, we extend DBCE to reduce CRTR's interprocessor register communication bandwidth. Importantly, DBCE reduces only register bandwidth and does not affect communication caused by loads, stores, and branches.
By reasoning that faults propagate through dependencies, DBCE exploits (true) register dependence chains, checking only the last instruction in a chain. Earlier instructions in the chains in both threads completely elide communication and checking, reducing bandwidth pressure. DBCE uses a hardware queue called the dependence chain queue (DCQ) to hold instructions and determine dependencies by matching appropriate register operands. DBCE redundantly builds chains in both threads and checks its own functionality. For example, in Figure 2a , instructions i3 and i4 form a chain, and i4 is checked. If the last instruction check succeeds, it signals the previous instructions in the chain that they can commit; if the check fails, all the instructions in the chain are marked as having failed and the earliest instruction in the chain triggers a rollback. Then a transient-fault exception is raised.
DBCE must consider masking instructions, which produce the same output value for different input values (for example, r4 := r6 & 0x2 in Figure 2a) . We conservatively consider all such instructions masking. A masking instruction can mask a fault on its inputs by producing the correct output even if an input is faulty. Such masking violates DBCE's key assumption that dependencies propagate faults. The last instruction in a chain that includes a masking instruction cannot detect the masked fault, and an irrecoverable error ensues if the faulty value is committed and consumed later. DBCE does not allow masking instructions to form chains, with the exception that a masking instruction can start a chain. DBCE allows a masking instruction at the beginning of a chain because DBCE checks the instruction's source operands in previous chains, without allowing any fault masking.
Conservatively restricting masking instructions to the start of chains exposes bandwidth pressure. Many integer and almost all floating-point instructions (because of their finite precision) are masking, so this restriction limits DBCE chain lengths and reduces DBCE effectiveness. To address this problem, we extend DBCE to exploit the death of register values in a scheme we call death-and dependence-based checking elision (DDBCE).
The problem with a masking instruction occurs if one of its source operands is faulty, and a later instruction, other than the masking instruction, also consumes the faulty value. By tracking register death, we identify masking instructions that are the last (in program order) consumers of their source operands-that is, the source operands die after consumption by the masking instruction. Operand death ensures that a masked fault cannot corrupt later computation, allowing masking instructions to join chains without losing the ability to recover. In Figure 2b , masking instruction i3 is the last use of r6 (before r6 dies in i5). We can chain i3 to i1 because any fault in r6 is not visible beyond i3. The resulting chain includes i1, i3, and i4. Because many register values are consumed by only one or two instructions, DDBCE boosts DBCE's bandwidth reduction.
80
MICRO TOP PICKS
IEEE MICRO
Experimental results
We modified the Simplescalar out-of-order simulator to model CMPs. Each core in the CMP is an out-of-order SMT processor that can issue up to four instructions per cycle. Each core has its own private L1 and L2 caches. Because the StB in real systems typically contains 20 to 30 entries, we assumed a 20-entry StB to model the pressure on the StB discussed earlier. We modeled the system bus as a split-transaction, pipelined bus connecting the private L2 caches to memory. We provide more details in another publication. 6 We obtained results for a two-CPU CMP. We paired two SPEC2000 benchmarks to generate the CMP workload. To isolate CRTR's performance cost, we present results in the absence of faults. Figure 3 compares CRTR with CRT, showing performance normalized to the base CMP without any fault tolerance. We denote slack as x/y, where x is the number of values queued for sending at the leading thread and y is the number of values waiting for consumption at the trailing thread. CRT's best-performing slack is 16/32. We vary CRTR's slack as 16/8, 32/16, and 64/32. In this experiment, we assumed a 20-cycle (one-way) latency and infinite bandwidth for interprocessor communication. Figure 3 shows that CRTR with a slack of 32/16 performs at a level close to that of CRT. There are three reasons for this similarity:
• CRT and CRTR incur similar overhead for StB fill-ups and extra instructions. • Both CRT and CRTR communicate branch and memory values. This communication reveals the 20-cycle interprocessor latency in both CRT and CRTR. CRTR's long slack, enabled by asymmetric commits, absorbs this latency to an extent similar to that absorbed by CRT's long slack.
• Although CRTR additionally communicates register values, it incurs negligible performance degradation compared with CRT because we assume infinite interprocessor bandwidth in this experiment. Because loads and stores are frequent (loads and stores are 30 percent to 50 percent of all instructions), the communication latency of register values is hidden under that of load values and store check confirmations.
Finally, we studied the bandwidth requirements of CRT, CRTR, and CRTR using DBCE and DDBCE. We computed each technique's bandwidth requirements by averaging the individual programs' bandwidth requirements. Our simulation results show that CRT communicates about 5.2 bytes per cycle, and CRTR without DBCE almost doubles the bandwidth at 9.8 bytes per cycle. DBCE with a 30-entry DCQ cuts CRTR's bandwidth requirement to 7.8 bytes per cycle, a reduction of about 20 percent. DDBCE, also with a 30-entry DCQ, further reduces the requirement to 7.1 bytes per cycle, a reduction of 9 percent from DBCE. More results on the effect of interprocessor latency and bandwidth appear in another publication. 6 U sing SPEC2000 benchmarks and execution-driven simulation, we found that a long slack enables CRTR to perform comparably with CRT even at an interprocessor latency of 30 cycles. This tolerance helps ensure CRTR's effectiveness in future technologies in which global wire delays will pose a serious problem for microprocessor and CMP performance. However, CRTR's true cost is that it needs more bandwidth than CRT. DDBCE and DBCE reduce this bandwidth demand. Because interprocessor bandwidth is a key resource in present and future CMPs, the traffic reductions achieved by DBCE and DDBCE are important. As technology scaling continues, transient-fault tolerance techniques like CRT and CRTR will be essential.
MICRO
Mohamed A. Gomaa is pursuing a PhD in computer engineering at Purdue University. His research interests include fault tolerance and computer system reliability, particularly architectural methods to incorporate transient-fault detection and recovery techniques in modern processors. Gomaa has a BS and an MS, both in computer engineering, from Cairo University, Egypt. He is a student member of the IEEE and the ACM.
Chad Scarbrough is a graduate student in computer engineering at Purdue University. His research interests include fault injection in a fault-tolerant SMT processor and transientfault detection and recovery in chip multiprocessors. Scarbrough has a BS in electrical and computer engineering from The Ohio State University, Columbus. He is a student member of the ACM. http://computer.org/tdsc
T. N. Vijaykumar
