An ever increasing demand for a ordable on-chip fault-tolerance, the inherent unreliability attendant upon very large scale integration (VLSI), and the overwhelming complexity of fault-tolerance have elevated the automatic design of fault-tolerant VLSI systems into a research problem of immediate practical relevance. In this paper, we will outline (i) a exible methodology for compiling an algorithmic description into an equivalent fault-tolerant VLSI IC subject to an application speci c policy for fault-tolerance and (ii) a framework that embodies this methodology. The framework subsumes algorithms for synthesizing self-recovering, fault-secure, and reliable VLSI ICs from high-level algorithmic descriptions.
Introduction
The rapidly emerging trend towards very large scale integrated circuit implementation of crucial tasks in life-critical, mission-critical, and safety-critical applications (such as automobile/process control systems and medical instrumentation) is stimulating the need for on-chip fault-tolerance.
Fault-tolerance refers to a collection of policies including diagnosis, detection, recovery, and masking. The speci c policy (or policies) of interest to an application depend(s) upon a multitude of factors: the target environment, the economic constraints, and the speci c requirements of the application itself, to name a few. For instance, in automotive electronic subsystems, on-board diagnosis for timely detection of malfunctioning components is of utmost importance. Equipping all service sites with sophisticated diagnostic equipment is a highly expensive proposition. Moreover, such o -line diagnostics may not always be able to trap the actual fault; things that may look proper in a test environment may not be normal under the dynamic conditions actually experienced 1]. On the other hand, emerging satellite-based mobile communications place more emphasis on reliability (tolerance to wear out induced by phenomena intrinsic to the IC) because it is essential that electronic subsystems aboard such unmanned satellites remain operational throughout the lifetime of the mission. Many other industrial environments place higher emphasis on graceful recovery from faults external to the chip.
Until recently, fault-tolerance techniques such as concurrent error detection for on-board diagnostic systems, as well as checkpointing in self-recovering systems have been mainly embedded into software 3, 20, 22, 12, 1] . For improving system reliability, most designs resorted to expensive, system-level replication 18]. Whereas the time overhead associated with fault-tolerance in software is unacceptable in time-critical applications, the space overhead of straightforward replication-based fault-tolerance is overly expensive for consumer-oriented applications. Recent advances in VLSI Technology are making it feasible to design area-e cient, hardware-oriented approaches to fault-tolerance; general purpose processors with checkpointing capability 19], application-speci c ICs for comprehensive concurrent error detection 23] and
ICs with built-in reliability 7] are beginning to emerge. However all of these hardware oriented approaches entail signi cant e ort on the part of the designer; the onus of either partitioning a system and inserting checkpoints, or meticulously handcrafting the fault-detection mecha-nism, or improving reliability by injecting selected doses of redundancy, rests entirely on the designer.
Related Research
Earliest implementations of on-chip concurrent error detection and fault-tolerant general purpose processors have been mostly paper designs 4, 6] . Actual implementations have been In contrast, in this paper we will outline computer aided design techniques for rapidly designing area-e cient fault-tolerant VLSI systems. Related research in this area includes automatic incorporation of built-in self repair (BISR) 5] and checkpoint insertion 17] . Towards this end we will rst outline the computer aided design methodology for fault-tolerant VLSICs in section 2. This will be followed by a detailed description of self-recovering and reliable VLSIC synthesis in sections 3 and 4. In section 5, the fault-tolerance overhead of these fault-tolerant designs will be evaluated vis-a-vis a non-redundant implementation.
CAD of Fault-Tolerant VLSI Systems
In a top-down VLSI design methodology an abstract design speci cation is successively re ned into a concrete implementation. Microarchitectural (or high level) synthesis and physical design are the two important steps in such a VLSI design methodology.
Microarchitectural synthesis transforms an algorithmic description of a system into a microarchitecture (using register transfer level modules such as adders, multipliers, multiplexers, registers and buses) subject to designer speci ed constraints such as time and resource. Initially, the algorithmic description (which is described in a hardware description language) is translated into an intermediate representation called the data ow graph(DFG). Then each operation in the DFG is scheduled into a clock cycle. Subsequently, the operators are bound to the available functional units, variables are allocated to registers and data transfers are assigned to buses. This phase is called data path synthesis. Next, by modeling each clock cycle as a state and transitions between clock cycles as directed arcs between states a state diagram is synthesized. This state diagram is then automatically translated into a controller that generates control signals that ensure the desired system operation.
In the following sections, we will describe the automatic synthesis of fault-tolerant VLSI ICs from algorithmic level descriptions. Firstly, we have incorporated the fault-tolerance mechanisms {checkpointing and rollback based on-chip recovery 8], and modular redundancy based reliability enhancement 15]{ into the microarchitecture synthesis phase. We follow this up by transforming the resulting fault-tolerant microarchitectures into physical layouts so as to demonstrate the feasibility of on-chip fault-tolerance. These physical designs, in addition to validating the design methodology and the CAD algorithms, also helped evaluate the overhead of fault tolerance at the layout level. We digress for a moment and answer the lingering question: why ingrain the two fault tolerance constraints into microarchitectural synthesis ?
The microarchitectural level is indeed the right abstraction at which to optimize the signicant area overhead associated with these mechanisms for the following reasons: (i) The control information essential for implementing checkpointing and rollback {clock cycle boundaries at which to perform checkpointing and intermediate results that have to be voted upon{ can be e ciently garnered at the algorithmic level. (ii) The reliability enhancement mechanism is based on redundancy injection into the register transfer level modules. (iii) Finally, the choice of a good algorithmic level design representation together with e cient manipulation of such a representation can further optimize the area overhead attendant upon these fault-tolerance mechanisms.
Our Assumptions
Our assumptions regarding the input formats and the models and metrics for performance and hardware at the microarchitectural and the physical design level are as follows: point arithmetic) was selected as the vehicle for our experiments. In the following two sections we will describe the synthesis of self-recovering and reliable fault-tolerant VLSI systems starting from an algorithmic description. The systematic incorporation of the unique con-straints attendant upon self-recovery and reliability into the microarchitecture synthesis will be emphasized.
Synthesis of Self-Recovering VLSICs
A microarchitecture that can detect and recover from a (transient) fault is self-recovering.
The model used for synthesizing a self-recovering microarchitecture is clari ed using the control data ow graph (CDFG) shown in gure 1. In the gure, nodes are numbered while the edges are annotated using the alphabet. In order to detect an erroneous computation due to a faulty functional unit, the computation is duplicated (shown as the solid and the dotted CDFGs).
The two copies of the computation are then executed on disjoint hardware and the result is checked using a voting circuit; both copies of the computation take the same input and a fault is detected if their outputs disagree. There are three main sources of area overhead associated with checkpointing. Going back to the example, one can see that there is an area overhead due to duplication. Implementation of the dotted CDFG necessitates two additional functional units and two additional registers.
As described previously, there is also a checkpoint register overhead. Finally, there is an area overhead due to the voting circuitry that is used to detect an erroneous computation. Assuming that the voting is carried out at the time of checkpointing, two voters are required to detect an error in the computed values c and d. These voters are the single points of failure in an otherwise fault-tolerant design and hence are meticulously handcrafted for fault-tolerance.
Self-Recovering Data Path Synthesis
Towards optimizing the area overhead attendant upon checkpointing we have developed a self-recovering microarchitecture synthesis system 16, 14] . A novel self-recovery scheduler 16] has been developed to optimize the register and voter overhead. Furthermore, the scheduler inserts checkpoints into the computation. In contrast to traditional applications, the data transfers are as important as the operations in a computation. This is because, these data transfers represent (i) the amount of state information to be stored between checkpoints and (ii) the voters required to detect erroneous computations. In this manner, elimination of bad checkpoints is intertwined with constraining of the low entropy edges to non-checkpoint clock cycle regions. Finally, after all checkpoints are uniquely determined, high entropy edges are forced to straddle the checkpoints. This ensures that only good edges are assigned to checkpoints and bad edges are prevented from crossing a checkpoint.
In turn, this minimizes the register and voting overhead of checkpointing. The various steps of the algorithm are summarized in gure 3.
Self-Recovering 16-point FIR Filter
Towards validating the self-recovery synthesis system described earlier, we have designed a self-recovering symmetric 16-point FIR lter chip. Initially, an algorithmic representation of the symmetric 16-point FIR lter shown in gure 4(a) is mapped into a self-recovering schedule subject to recovery time constraints (speci ed as a retry period of three clock cycles) and performance constraint of nine clock cycles. For simplicity, we assumed that back-toback chaining of operations in a clock cycle is not permissible and that detection is based on straightforward duplication. Whereas the schedule of the original computation entails two adders and two multipliers, straightforward duplication requires twice as many pieces 3,4,5,6,7,8  1,a,b,c,d,e,f, The net list of the self-recovering microarchitecture is created in the bdnet 2] format and then transformed into an equivalent physical design using the macrocell place-and-route tools.
The area related characteristics of the self-recovering chip obtained using chipstat 2] are summarized in table 2. Interestingly, the chip is dominated by interconnect (which constitutes about 85% of the total area).
Since voting is performed in a separate clock cycle, the total number of clock cycles is the sum of the number of functional clock cycles (C num ) and the number of checkpoint clock cycles (C check ). The throughput characteristics of the design are obtained as the total number of clock cycles (C num + C check ) times the duration of a clock cycle (C dur ). In turn, C dur of the self-recovering design is estimated as Delay mult +Delay reg +Delay mux . Consequently, one pass through the lter takes 12 131:7 = 1580:4 ns.
Synthesis of Reliable VLSICs
Section 3 has focused on incorporating mechanisms for tolerating transient faults that arise mainly from phenomena extrinsic to the IC (for example, electromagnetic radiation and vibrations 18]). In this section we devise synthesis techniques for reliability that are tailored towards transient and permanent faults intrinsic to the IC (for example, hot electron e ect and electromigration). We will show how reliability constraints are ingrained into the module set selection phase of microarchitecture synthesis 15] and follow it up by deriving a reliable implementation of the 16-point FIR lter. Speci cally, we outline a technique for deriving reliable module sets subject to input constraints on chip area and system throughput. A reliable module-set comprises of a set of modules each having a possibly di erent redundancy
structure. An triple modular redundant (3MR) structure has three modules of the same type and a majority voter. Such a 3MR structure can mask the e ects of at most one faulty module.
ftwo 3MR adders, and one 3MR multiplierg is an example fault-tolerant module-set.
Derivation of Module Reliabilities
In order to synthesize reliable VLSICs, it is rst necessary to characterize the individual module reliabilities. Towards this end we derived the failure rates using the empirical relationship MIL-HDBK-217E 18] given in equation 1:
where, l (range: 1-10) is the learning factor re ecting the maturity of the fabrication process, q (range: 0.25-20) is the quality factor depending on the burn-in procedures applied, t (range: 0.1-1000) is the temperature factor based on the ambient operating temperature, v (range: 1-10) is the voltage derating factor for CMOS devices, e is the application environment factor and C 1 ( 0:00085 p gates), and C 2 are complexity failure rates dependent on the number of gates and number of pins 9]. In addition to choosing optimistic values for each of these factors (i.e. l = 1, q = 0.25, t = 1, v = 1), we set C 2 = 0, as these modules are standard cells and not packaged ICs. Finally, we assumed that the system has a mission time of 10 6 hours.
The module reliabilities are summarized in table 3.
Reliable Data Path Synthesis
The reliability of a microarchitecture can be approximated as the product of the reliabilities of the components in its module-set. Let R sys be the desired system reliability and let R 1 i , R 2 i , ..., R n i i be the reliabilities of the n i modules of the i th type. These can be computed using the formula for 
Since the reliable module set selection as stated is a di cult optimization problem, we use a greedy heuristic for redundancy injection. For each module M in the original module set, let R + M be the improvement in reliability resulting from adding the minimum additional redundancy and let A M be the corresponding increase in area. Redundancy is injected into a module that a ords the maximum improvement in reliability per unit increment in area (i.e.
has the maximum value for R + M A M ). This can be used to maximize reliability subject to an area constraint or minimize area subject to a reliability constraint.
Reliable module-set derivation is then integrated into scheduling disciplines that maximize either system reliability (time-and-area constrained scheduler) or performance (reliability-andarea constrained scheduler) or minimize chip area (time-and-reliability constrained scheduler).
For example, the time-and-reliability constrained scheduler synthesizes a microarchitecture that is area-e cient without violating the designer speci ed performance (C sys ), and reliability Next, the net list of the reliable microarchitecture is created using bdnet. Subsequently, the macrocell place-and-route subsystem is invoked to generate the physical layout.
The chip area statistics of the reliable 16-point FIR lter IC are summarized in table 4 . Contrast the 69.59% interconnect overhead of this implementation with that of the selfrecovering implementation. This is because of the lesser amount of on-chip redundancy.
The clock duration (C dur ) is obtained as Delay mult +Delay majvoter +Delay mux +Delay reg .
The rst two terms represent the delay of the 3MR multiplier. Although the C dur of the reliable implementation (145.7 ns) is larger than that of the equivalent self-recovering implementation (due to the larger propagation delay of the 3MR multiplier), the overall iteration time of the reliable implementation (1311.3 ns) is smaller than that of the self-recovering implementation.
This can be attributed to the lesser number of clock cycles required to complete one iteration.
Evaluation of Fault-Tolerant Designs
In the previous two sections, we have systematically derived VLSI implementations of two fault-tolerant mechanisms and summarized their time and area overheads. However, in order to provide a meaningful evaluation of these designs (in terms of area and performance overheads), we have designed a non-redundant version (or the basic design) of the FIR lter.
While the chip layout of this basic design is given in gure 7, the area statistics are summarized in table 5.
The clock duration (C dur ) as well as the overall iteration time is smaller than that of the fault-tolerant designs. This is because neither C dur (= Delay mult + Delay mux + Delay reg ) nor Table 6 : A summary of area overhead Towards gaining insight into the speci c apportionment of this hardware overhead, we conducted a detailed evaluation that focused on the overhead due to each of its several sources:
the redundant hardware and voting circuitry in the data path, the additional control, and the extra interconnect.
The overhead due to redundant hardware (in the case of fault-recovery it is equal to the additional functional units and checkpoint registers together with the voting circuitry, while in the case of reliability enhancement it is equal to the additional functional units and voting circuitry). is summarized in Table 7 : Overhead of redundant hardware Now we will evaluate the increase in complexity of the controller for each of the faulttolerant designs against vis-a-vis the complexity of the controller for the non-redundant design.
It is important to note that the controllers themselves are not fault-tolerant. Although the control overhead with respect to the controller for the non-redundant design is signi cant (12% for the reliable design and 58% for the self-recovering design), the control area as a percentage of the total chip area is by itself very negligible (less than 2%).
On an average, about 70% of the chip area is occupied by interconnect. The situation is even more acute for the self-recovering design, wherein 85% of the chip area is interconnect.
Two useful metrics for evaluating the increase in interconnect due to fault-tolerance are total net length and average net length. The interconnect of the reliable design is about 1.33 times that of the basic design, while the interconnect of the self-recovering design is about 2.33 times.
There is also a corresponding increase in the average net length.
Finally, the increase in pin count (from 26 in the basic design to 28 in the fault-tolerant designs) is negligible and hence is not a bottleneck for implementing on-chip fault-tolerance.
The two additional I/O pins are used for the additional power and ground lines.
On the Performance Penalty of Fault-Tolerance
The voting circuitry in the fault-tolerant design contributes to the performance penalty.
The increase in clock period for the reliable VLSIC is 10.6% while there is no increase in clock period for the self-recovering design. This was because, instead of increasing the duration of the clock cycle, the number of clock cycles were increased in the latter case. The performance of the system is measured as the time for completing one computation. Consequently, there is a performance penalty of 10.6 % associated with reliability injection and 33.33% associated with self-recovery incorporation.
Self-Recovering Fifth Order Elliptic Filter
In this section we will evaluate the self-recovery overheads on a fth order elliptic lter.
Toward this end we have designed a self-recovering fth-order elliptic lter using the Mentor Graphics toolkit using the 1.2 micron technology and a standard cell approach. For comparison purposes a non-redundant version of the design has been implemented. The retry period was set to the number of clock cycles (16 clock cycles). 8-bit xed point two's complement arithmetic was used. In order to estimate the maximum clock frequency for the two layouts, timing simulation was performed using QuickSimII. The results are summarized in 
