Fault tolerance techniques are used to allow computer systems to continue correct operation despite component failure. Hardware-supported concuxrcnt errordetection and limited fault tolerance in system components, as implemented by c o d i i or replication, are often required. Detection latency can be reduced by increasing the visibility of internal module state using compressed ''signatures'' of internal values. Thus, encoders. decoders. comparators. and data compression circuitry are of critical importance in fault-tolaant V U 1 systems. In this paper we describe alternative implanentations of such circuits and various ways in which they can be connected in V U 1 modules. We also describe possible performance enhancements through the use of a technique. called micro rollback which allows error detection to be performed in parallel with inter-module communication. As a concrete example, we present area and performance measurements of alternative microarchitectures and circuits that can be used to add detection and correction to a V U 1 RISC processor we are implementing.
I. Introduction
Using fault tolerance techniques the reliability of computer systems can be increased beyond the reliability of the underlying hardware components. Errors generated by faulty components are detected and recovery procedures correct the errors. High-speed error-detection circuitry is needed to detect errors as smn as they occur and prevent the spread of erroneous data throughout the system. In many applications the performance penalty of system-wide recovery cannot be tolerated so it i s desirable for modules to include mechanisms for rapid correction of most internal errors (i.e.. local recovery).
Possible errors in V U 1 chips include: corruption of the contents of storage elements. incorrect results produced by computation modules (e.g.. an ALU). and corruption of data and control signals (e.g.. buses). These errors are the result of transient or permanent faults due to design and fabrication flaws (e.g., marginal timing. incorrect dosage of ion implants), environmental factors (e.g., noise, radiation), and wear-out mechanisms (e.g.. electromigration) [3].
Many of the techniques used to detect and correct errors caused by hardware faults rely on a few basic components: enccders, decoders. comparators, and data compression circuitry. In a V U 1 processor. coding can provide error detection and correction of data in the register file and other tThb d U suppaned by Hughes Aircnft Company and he Sute of cllifamir MIcRopolguh storage (e.g. PSW. caches, TLB). For example. single-bit parity detects odd errors in registers. while Error Correcting Codes (ECC). such as Hamming code. provide error correction capabilities [6] . Check bits must be computed every time storage is modified. and v d i e d whenever storage is accessed. In many modem processors a modification or access of the register file can occur every cycle. thus requiring low latency and high throughput for the circuits generating and verifying check bits. To achieve higher coverage and to detect errors in other modules (not just storage), duplication and comparison can be used [8] at either the module or chip level. To minimize detection latency, the values of internal nodes of modules should be compared each cycle. This can be accomplished without adding numerous extra pins, by "compressing" the values of the nodes into signatures which are then compared [S] . Since the comparison is done every cycle, compression and comparison must also be performed with low latency and high throughput
The modules needed to implement the error detection and correction techniques described above, namely encoders. decoders, comparators. and data compression circuits, are often based on Exclusive OR (XOR) gates. Alternative implementations of multi-input XOR gates are presented in in Section 2. The different implementations are evaluated with respect to performance. area, and noise margins. The evaluation is performed in the context of the microarchitecture of a VLSI RISC processor where such modules might be used for error detection and correction.
In Section 3 we describe and evaluate circuits for implementing Error Correcting Codes (ECC) based on Hamming Code, in which code generation and error correction require multiple parity circuits. Through proper choice of high-speed parity circuits and the specific code to be used (M-code[6] ). fast correction and check bit generation are achieved.
In Section 4 we discuss the comparators and data compression circuitry needed for implementing duplication and comparison. When the two modules whose outpuu are being compared are on different chips, compressing the data and sending it off chip for comparison may introduce significant delays in system operation. This potential performance penalty can be greatly reduced using a technique. called micro rollbuck [9] . that allows detection to be performed in parallel with normal system operation. We show that micro rollback can be used to support local recovery without the need for ECC circuitry. N-Chain. The simplest implementation of a switching cell uses four N-transistors (Figure6). Four such cells connected serially produce a result (the parity of four bits) with a delay of 1Jns. The delay of this circuit grows quadratically with the number of cells [7] . Hence, to compute the parity of a 32-bit word, buffers are inserted in the chain after every four cells to obtain a total delay of l h . The basic switching cell has a stride of 271. The pitch of the switching cell is 32A in order to leave 28X for the buffer foreach four cells. Since an inverter is needed to provide Di and Di for each cell. the stride of the circuit is 48X (18k for the inverters. 31anrbu for interconnections). Ow simulations indicate that if the chain is implemented m an N-well process. the logic 1 levels at the internal nodes of the chain are degraded fiom Vdd to 3.6 volts because a logic 1 is passed through N-transistors. The following three methods improve the noise margins of the internal nodes at the expense of a small increase m area (stride). Bootstrapping requires a "bootstrap" capacitance that is several times larger than the capacitance of the nodes that are to be charged to the high voltage. For this circuit, the precharge signal is applied to the gates of 64 transistors so the boot capacitor must be more than 100 times larger than the minimum size transistor. The noise margins are restored to normal levels and the delay for computing a 32-bit parity is llns (with buffers in the chain). The stride of the circuit increases by 29% for larger basic cells and for routing a precharge signal to all the cells (the size of the bootstrap circuitry is not included). On the other hand, the noise margins are normal and the bootstrap circuit is eliminated. The delay to compute the parity through 32 switching cells and 8 buffers remains 1 Ins.
IL

LE%
Dualchain. The switching cell can be implemented using full transmission gates (Figure 9 ). The noise margins are then maintained at proper levels but the stride of the circuit is 35% larger than the P-precharged chain, The speed is degraded due to the added capacitance at each node.
Qi
Figure 9 : Dud-Chain XOR Cell Uslne Sense ArnDiiflers. Davis [ 11 proposed an implementation of a multiple-input XOR gate using a 2-level tree of 8-input XOR gates (Figure 10 ). Sense amplifiers speed up the calculation of intemediate results (Figure 1 I) , providing fast computation at the expense of area and p r noise margins. The sense amplifier cell contains significantly more logic than the switching cell, and has to be made "narrow" to match the pitch of the data bus, resulting in a substantial increase of the stride. Including the routing necessary forqitch matching and two levels of logic and an inverter for D i , Di we obtain a stride of 1401 The sense amplifier design can be use serially in order to avoid the two level structure, but the delay becomes proportional to the size of the chain, making the delay to go through a 32-cell chain 8ns.
The characteristics of the different XOR gate implementations are shown in Table 1 . The circuits are designed to match the pitch of the bus of our processor. The table includes the stride of each basic cell and of the complete circuits. the circuit delay, and the noise margins. 
E X . Error Correction Circuitry
The XOR circuits described in the previous section can be used for error detection with a single parity bit. A simple method for correcting errors locally, without resorting to system-wide recovery, is to use error correction codes (ECC) [6] . In this section we discuss error correcting codes based on Hamming Code, which are ~m m o n l y used to detect and correct errors in storage elements [6] . The check bits of these codes are generated and.verified using multi-input XOR gates. Each check bit is generated by XORing a different subset of the data bits. When storage is accessed, the same subsets Md their corresponding check bits are XORed to produce a syndrome which is used to correct some m r s and flag others (e.g., multiple bit error) as uncorrectable [a] . Figure 12 .
Since seven rows of XOR gates are needed when using M-code. minimizing the stride of each multi-input XOR gate becomes more critical than when there is only one row for single-bit parity. Since the inputs to each XOR gate are not necessarily adjacent. a static implementation (Figure 3) is not appropriate -it requires excessive routing. If area is the main concern. Nchain XOR gates. without precharging. can be used. For a wide pitch (581). the stride of each N-chain can be reduced to 191 and the check bits can be computed in 3. 811s.
XOR gates based on P-pechuged chains can be used if normal noise margins are needed. To maximize speed. XOR gates based on N-chains with sense amplifiers can achieve a delay of 2.5-(two level tm. two sets of seven bits in the fust level).
when a data word is accessed, a seven-bit syndrome is generated by XORing all the bits in each row, including the check bits. To differentiate between single-bit and double-bit errors. an XOR of the syndrome is generated [6] . If the result is one, there is a single-bit error and correction can be performed. If the result is zero, and the syndrome bits are not all z m . a double-bit enor is signaled.
For single-bit e m r correction, the erroneous bit is identified by the syndrome. Using a decoder and a controlledinverter (XOR gate), the faulty bit can be flipped (see Figure  12) . Table 3 The datapath in many microprocessors is based on two parallel buses whose data bits are interleaved. Two simultaneous reads b m the register file are usually supported. In order to simultaneously detect mors on both buses, the stride of the ECC circuitry has to be significantly increased.
Specifically. the detection subcircuit (seven multi-input XORs and a XOR for the syndrome) must be doubled. If, once an enor is detected. multiple phases are available for correction, one syndrome decoder can be shared by both buses and correction can be done sequentially. The correction circuitry must also be doubled. In our design, with a bus pitch of 3% and XOR gates implemented as P-precharged chains, the stride of the detection circuitry is 328X. the decoder is 147X. and the correction is 61X. For two buses, sharing the decoder can decrease the total stride from 10701 to 9331.
For data in the register file, the ECC circuitry can be connected between the register file and the ALU. If it is connected "serially" the processor cycle time must be stretched to accommodate the added delay. Much of this performance penalty can be avoided by connecting the ECC circuitry in parallel with the datapath. The data from the registn file is Sent to the ALU without waiting for the result of ECC. The ECC circuitry must be fast so that if an error is detected, it is possible to abort the operation before there is permanent damage to the processor state.
IV. Duplication and Comparison
Using duplication and comparison it is possible to achieve highaverage error detection for all types of modules [8.2] .
Two identical modules process the same information in parallel and some of their output pins are compared every cycle ( Figure 13 ). In this section we described the circuits needed for duplication and comparison. These include comparators as well as data compression circuitry necessary to reduce the pin requirements when the two modules whose outputs are compared are on different chips. We show how enor correction can be performed in a system based on duplication and comparison by transferring state from the fault-free module to the faulty module.
A. Compression
It is often impossible or undesirable to duplicate large V U 1 modules, such as processors. on the same chip. In the context of this discussion, a simple and effective data compression technique is to use several parity bits computed across the data "word" to be compressed. For example. for a 32-bit word, we can compute a 4-bit "signature" by constructing four interleaved parity chains, each consisting of eight bits from the word. Each chain includes every fourth bit in the word. The implementation of this interleaved parity scheme uses the circuits already described in Section 2. Given two 32-bit mors. one correct and one erroneous. a large percentage of errors can be detected by comparing the 4-bit signatures of the two words: all single bit mors. any odd number of bit errors, and many multiple bit adjacent errors. For random multi-bit m o r s , 93.75% of the e n o n will be detected.
A 4-bit signature of a 32-bit word can be generated in 3ns using four 8-input P-precharged chains (Figure 14) . The four chains can be compressed into one chain where every fourth cell is connected through metal lines. For a pitch of 39 IC we obtained a stride of 133L which includes an inverter, a precharge line, and internal routing. 
B. Comparison
A comparator can be implemented using the design shown in Figure 15 . We designed the layout of a 32-input comparator to match the pitch of the datapath and its stride is 44X. The output of the comparison is computed in 5.811s.
C.. Mlcro Rollback and Error Correction
The necessity of off-chip transmission of data for comparison increases the error detection latency beyond the p i n t where the processor can be interrupted and its last instruction restarted if an error is detected. Several clock phases are necessary for compressing the data, sending the result off-chip. latching the data in the comparator, comparing the inputs, sending the outcome back to the processor. and latching in the result. We have previously introduced a technique, called micro rollback which allows VLSI modules to roll back their state to its value several clock cycles earlier [9] . With this technique. it is possible to begin processing information several cycles before its validity is verified, since if it turn out to be erroneous. its effects can be undone. In a system that supports micro rollback duplication and comparison across chip boundaries can be supported with minimal performance penalty.
In a system based on duplication and comparison with micro rollback, local recovery from most errors can be supported without conventional ECC circuitry. When the outputs of the two modules differ. the modules are rolled back several cycles. If the error was caused by a transient fault on, for example. the buses or ALU, it is unlikely to recur when the last few cycles are "repeated." If the error was caused by a transient fault in storage (e.g. register file), recovery requires copying the valid state from the fault-free module to the faulty module. In order to identify which module is faulty. each module must be capable of local detection of errors in storage.
This requires parity encoders and decoders on each module.
However, as discussed in sections 2 and 3. the area for singlebit parity generation and verification is less than 10% of the area for ECC circuitry. Furthermore. this techniques allows recovery from many errors that cannot be handled by conventional SEC-DED codes.
V. Summary
Most fault-tolerant systems require that key components, such as V U 1 processors, include significant local error detection and correction capabilities. The circuits that provide these capabilities are typically encoders, decoders, comparators, and data compressors. These circuits must provide low latency. high throughput operation in order to be able to perform checks every cycle and prevent erroneous information from propagating throughout the system. Multi-input XOR gates are critical building-blocks for many of these circuits.
We have described several implementations of XOR gates: a tree of static XOR gates, a compact N-chain, two precharged chains and a dualchain that provide normal noise margins, and a fast implementation based on sense amplifiers. The discussion included major tradeoffs (speed, area, noise margins, pitch matching) in the implementation of circuits for generating parity, for computing ECC check bits, for correcting erron. and for compressing and comparing data.
We discussed a few of the interrelationships between the microarchitecture of the processor and the appropriate choice for detection and correction circuitry. We described how duplication and comparison a m s s chip boundaries can be used for high-coverage error detection and correction with minimal performance penalty. The key features of this scheme are "compression" and comparison of the values on internal nodes as well as on buses and the use of a technique. called micro rollback. which allows error detection and correction to bc performed in parallel with intermodule communication.
