In this paper we present an area-efficient register transfer level technique for gracefully degradable data path synthesis called phantom redundancy. In contrast to spare-based approaches, phantom redundancy is a recovery technique that does not use any standby spares. Phantom redundancy uses extra interconnect to make the resulting data path reconfigurable in the presence of any (single) functional unit failure. When phantom redundancy is combined with a concurrent error detection technique, concurrent error detection followed by reconfiguration is automatic. We developed a register transfer level synthesis algorithm that incorporates phantom redundancy constraints. There is a tight interdependence between reconfiguration of a (faulty) data path and scheduling and operation-to-operator binding tasks during register transfer level synthesis. We developed a genetic algorithm based register transfer level synthesis approach to incorporate phantom redundancy constraints. The algorithm minimizes the performance degradation of the synthesized data path in the presence of any single faulty functional unit. The effectiveness of the technique and the algorithm are illustrated using high level synthesis benchmarks. 
Introduction
Advances in VLSI have made it possible to implement complex algorithms on a single integrated circuit (IC) with the attendant advantages of reduced power consumption, higher reliability and reduced size and weight. While increasing device densities have made it possible to implement such complex VLSI systems, they have also rendered the ICs highly susceptible to a variety of fabrication-time fault mechanisms. In many VLSI applications, it is not uncommon to experience circuit yields on the order of 10% or even less thereby increasing the cost of manufacturing the circuit.
A number of researchers have examined fabrication-time reconfiguration approaches to enhance the yield of ICs. These techniques identify failed functional units in a fabricated IC and program the wires to reconfigure the fault free functional units into a working IC.
Built-In-Self-Repair (BISR) is a popular reconfiguration technique. BISR approaches have been applied mostly to regular architectures such as memory [1] . In BISR, reconfiguration is realized by providing a set of spare modules in addition to the core operational modules [1] .
In this paper we present a register transfer level technique for reconfiguration of ICs called phantom redundancy that does not use spare modules. Rather, phantom redundancy uses redundant programmable interconnect. When a functional unit is faulty, the interconnection network in the data path is reprogrammed to configure the fault-free functional units into an operational data path albeit with a degraded performance.
Phantom redundancy is applicable to both regular and non-regular data paths and entails small area overhead. Phantom redundancy does not perform CED. When combined with a concurrent error detection (CED) and faulty unit location technique such as introspection [6] , phantom redundancy can be used for dynamic reconfiguration in the field.
Related Research
VLSI reconfiguration techniques have been developed to make regular processor arrays tolerant to faults occuring during operation. Using a spare row (column) of processing elements, Negrini et. al. developed a rippling replacement strategy [14] . A faulty module is replaced with its neighbor in the same row (column). When both a spare row and a spare column are available the fault stealing strategy can be used. In fault stealing, a faulty module is replaced with a neighbor either in the same row or in the same column [14] . When multiple spare rows and columns are present a repair-most strategy can be used [15] . Repair-most strategy is based on a graph theoretic formulation and bipartite matching approach. An RT level reconfigurable data path synthesis technique based on spare functional units called built-in-self-repair (BISR) has been proposed by Guerra et. al. [3] . Instead of one spare module for each active module, BISR uses one spare module for each module type. All of these approaches use spare modules.
Tolerance to IC fabrication process related defects can be improved using two techniques. Tuning the process parameters can reduce such fabrication time defects in the device [19] . However, such process yield maximization does not totally eliminate the fabrication-related defects. Along an orthogonal dimension, defect-tolerant circuit design and layout techniques can maximize the circuit yield. While Chiluvuri and Koren [20] developed layout compaction algorithms to maximize defect-tolerance, Allan et. al. [21] proposed selective relaxation of the layout design rules. Phantom redundancy complements these layout level defect-tolerance.
RT level synthesis techniques for area optimal [8, 9] , performance optimal [10, 11] and power optimal data path design have been explored [22, 24] . RT level data path synthesis targeting off-line testability [23, 26] and on-line testability [2, 3, 4, 5, 7, 12] has also been addressed. In [2, 5] it has been shown that recovery from transient faults can be done efficiently at RT level by checkpointing and roll back in hardware. Before, rollback based recovery or reconfiguration can be carried out, the faulty unit should be identified.
Concurrent error detection (CED) and faulty unit location are hence important.
Straightforward duplication entails significant area overhead. RT level techniques for area-efficient CED based on fault security were developed in [4, 7] . RT level techniques using spare capacity in a design have also been proposed [6] . RT level reconfigurable data path synthesis technique using spare functional units has been proposed by Guerra et. al. [3] . On-line testable controller unit synthesis has been reported in [12] . The proposed technique can be used in combination with these CED techniques.
Issues in Gracefully Degradable Data path Synthesis

The Design Methodology
We propose to incorporate phantom redundancy reconfiguration constraints within a topdown VLSI design methodology. From among the various levels of abstraction in such a VLSI design methodology, the register transfer (RT) level is the right abstraction at which to incorporate phantom redundancy. This is because:
1. there is a tight interdependence between the synthesized data path and the reconfiguration of such a data path, 2. the fault model is at the RT level of functional units, and 3. data for reconfiguration such as the clock-by-clock schedule and operation-tooperator binding can be easily obtained at the RT level. It has been shown that scheduling and binding are NP-hard [13] . Besides, scheduling and binding are interdependent. Hence numerous heuristics have been proposed to solve these problems [9] . Most RT level synthesis systems solve scheduling and binding independent of each other. Since the synthesized architecture profoundly influences its reconfigurability, it should be integrated manner with the other synthesis tasks.
In this paper we developed a genetic algorithm [18] based technique to solve the simultaneous scheduling, binding and reconfiguration problem. The schedule and binding in the presence of any single functional unit failure is constructed simultaneously. This yields an RT level data path with a minimal degradation in performance.
Controller Issues
In a gracefully degradable data path the control unit is important since it orchestrates the reconfiguration. There are two viable options for designing a controller for reconfiguration.
1. Programmable Controllers: Although programmable controllers suffer from the disadvantage of slightly larger silicon area for implementation and a slightly lower performance, they have a major advantage in terms of ease of reconfiguration. Even in the absence of faults in the system the extra interconnect and the controller programmability gives the user the option to implement new CDFGs on the architecture much more efficiently.
Composed Controllers:
The controller for operating the fault free data path is composed with the controllers for each of the single unit failure scenarios. Although these composed controllers are smaller in size and faster they are hardwired.
Research Contributions
The important contributions of this paper are:
1. Phantom Redundancy: we present an area efficient technique for data path reconfiguration. Phantom redundancy adds extra programmable interconnect to make the resulting data path reconfigurable in the presence of functional unit failures.
Integrating reconfiguration constraints with scheduling and binding:
We developed a genetic-algorithm-based global optimization approach for the synthesis of area-efficient gracefully degradable data paths. This is because the problem of reconfiguring a data path with minimal area overhead strongly depends on the original data path. The algorithm performs simultaneous scheduling, binding and reconfiguration to minimize the performance degradation in the presence of a functional unit failure. The reported technique is applicable to regular array architectures and non-regular data path based designs.
Phantom Redundancy
Phantom redundancy is an area-efficient approach to implement gracefully degradable data paths. Phantom redundancy uses additional interconnections and yields gracefully degradable data paths with low hardware overhead. Upon detecting a faulty functional unit, the interconnection network is programmed to perform the intended function on the fault-free functional units albeit at a reduced throughput. Phantom redundancy can be used for fabrication-time and real-time reconfiguration of data paths. This capability is crucial in military and space applications where replacement of a faulty module is either impossible or prohibitively expensive.
Towards illustrating and clarifying the concept of phantom redundancy, consider a CDFG consisting of six operations a, b, c, d, e, f shown in Figure 1 . Assuming that all operations are of the same type and no back-to-back chaining is allowed, the fastest schedule requiring two clock cycles and four functional units is shown in Figure 1 Figure 2 (a,b) ). This data path does not use any spare modules but uses two additional interconnections. This data path can also tolerate 50% of all two-unit faults ((F1, F2) and (F3, F4) ). For all these scenarios the reconfigured data path consumes twice as many clock cycles as the fault-free data path.
Consider another CDFG consisting of fifteen add operations (a1,..,and a15) as shown in Figure 3 (a). The schedule shown here uses three adders (A0, A1, A2). One possible operation-to-operator binding is shown in Figure 3 . The functional unit on which an operation is carried is shown in capital letters. In Figure 3 The two functional units forming a backup pair need not be identical in terms of performance although they should be capable of carrying out the same function. For example, a fast multiplier unit can have a slow multiplier as its backup. It is the task of the RT level synthesis algorithms to explore these tradeoffs. Further, it has been shown that single unit tolerance is sufficient in most cases.
RT Level fault model
Our functional fault model is based on the observation that the critical area (i.e., the area susceptible to faults) of the functional units is much larger than that of the buses and the registers. Consequently, the probability of faults in functional units is much larger than that in the buses and register files. Hence, we initially target single functional unit failures only. Faults in a bus or a register file results in faulty data being fed to all the units that are connected to it. Hence, fault in a bus or a register file is equivalent to multiple functional unit failures. Faults in those buses and register files that feed into a single functional unit can be targeted using phantom redundancy. The controller fault-tolerance can be implemented using the technique presented in [12] or by straightforward duplication.
Genetic Algorithm based Gracefully Degradable Data path Synthesis
We will outline a genetic algorithm [18] based approach to synthesizing gracefully degradable data paths. GAs have been used in a wide variety of optimization tasks, including the traveling salesman problem, circuit design and job shop scheduling [27, 28] .
A genetic algorithm is based on the principles of the evolution via natural selection. It employs a population of individuals that undergo selection in the presence of mutation and recombination (i.e. crossover) operators. These two operators introduce variation into the individuals in a population. A fitness function is then used to evaluate individuals, and reproductive success varies with this fitness.
An initial population M(0) is randomly generated. The fitness f(i) for each individual i in the current population M(t) is computed. Selection probabilities p(i) for each individual i in M(t) are defined such that p(i) is proportional to f(i). Population M(t+1) is generated by probabilistically selecting individuals from M(t) and combined to produce offspring via the mutate and crossover genetic operators. Applying crossover and mutation probabilistically modifies the individuals in a population. The crossover and mutation operations depend not only on the problem structure but also on the way the solution is encoded as chromosomes. Crossover exchanges partial solutions from two chromosomes that have been probabilistically selected based on their fitness functions. Mutation is applied with a very low probability to introduce new search points. This process is repeated until either the best solution is found or the maximum number of generations is reached.
A genetic algorithm can be applied to a problem, once the solutions to the problem are encoded as chromosomes. An effective genetic algorithm representation and a simple and meaningful fitness function are key to the success deployment of a GA.
The gracefully degradable data path synthesis problem can be formulated as follows:
Given a CDFG and a hardware model, synthesize a gracefully degradable data path such that:
1. Performance of the unimpaired system is not compromised.
2. Performance degradation in the event of any single functional unit failure is minimal.
3. Area overhead of reconfiguration is minimal.
We will now outline the fitness function, the problem specific coding of the solutions, the genetic operators and the fitness proportionate stochastic selection scheme of the genetic algorithm to solve this RT level synthesis problem.
The Fitness Function
The proposed algorithm simultaneously explores the time and space domains of the design space. A candidate solution α is evaluated by the following fitness function
where, w i are user defined weights, F is the set of functional units, M is the set of registers, and X is the set of multiplexers, IC is the interconnect complexity, S is the number of clock cycles and γ is a user defined parameter. Thus the fitness function is of the form area × time γ .
The interconnect complexity IC is obtained as the weighted sum of the area required for the links associated with the inputs and outputs of each functional unit and the number of buses required to provide the requisite data transfers as given below.
where, i , , and o denote the sets of variables assigned to the left input, the right input and the output, respectively of functional unit
multiplexer is provided at input f i of the functional unit if the number is greater than one.
The area cost of the multiplexers is obtained from a table containing the area of the multiplexer for different number of multiplexer inputs. ρ(var) is the minimum number of registers required to store the set of variables var, given the lifetime table of the variables in var. This is obtained using the left-edge algorithm [25] . The number of links is computed as the sum of the number of links used by each of the functional units in the architecture. The number of buses required is the maximum number of distinct sources and sinks over all time steps.
Problem specific genetic coding
Each valid solution is encoded using four chromosomes (strings) g 1 , g 2 , g 3 and g 4 . An operation that is bound to a faulty functional unit is re-bound to its backup unit. If the backup unit is free, the operation is scheduled in that clock cycle. Otherwise, the operation is deferred to a later clock cycle when the backup unit becomes available.
Construction of a schedule and binding
An operation bound to a fault-free functional unit may have to be re-scheduled and rebound when its predecessor operations in the CDFG are re-scheduled and re-bound. If the functional unit is still free, then the operation is scheduled into the clock cycle.
Otherwise, if the backup unit is free then the operation is re-bound to the backup unit. In the worst case, the operation is deferred until one of the units in the backup pair becomes available.
Faults in different functional units impact the performance degradation of the data path to a different extent. Hence, the worst-case performance degradation δ of the gracefully degrading data path should be considered. The worst-case performance degradation δ is measured, as the additional clock cycles required executing the CDFG. The cost function is modified to account for δ as follows:
C (α) = (w 2 ×area (F) + w 3 ×area(M) + w 4 ×area(X) + w 5 ×area(IC)) × (w 1 ×S+ w 6 ×δ) γ Where, w 6 is the user-defined weight for the worst-case performance degradation δ for the gracefully degradable data path and S is the number of clock cycles for the basic data path.
Genetic Operators
The genetic encoding employed allows the use of simple crossover and mutation operators. We use the single point crossover for the strings g 1 , g 2 , and g 3 . The minimal area overhead constraint requires that each functional unit have only one backup unit.
This implies that the index of each functional unit must appear only once in string g 4 .
This cannot be enforced using the single point crossover. Hence we employ the partially matched crossover (PMX) [18] for string g 4 . PMX ensures that all functional unit indices used in a design appear at least once and no more. 
Figure 8: partial matched crossover (PMX)
In PMX, two parent strings are aligned and two crossover sites are selected at random along the strings. These two points define the matching section that is used to effect a crossover through position-by-position exchange operations. This is shown in Figure 8 .
PMX proceeds by position-by-position exchange between the two strings. First, the string Parent 2 is mapped to Parent 1 , and the entries in the matching section are exchanged. In Figure 8 , the sub string A2A3A0 Parent 2 is mapped on to the corresponding sub string in 
Selection scheme
An individual is selected for crossover and reproduction with a probability that is proportional to its fitness [18] . This selection scheme increases the representation of above average fitness chromosomes in the population and has a marked effect on the performance of the genetic algorithm. The genetic operators exploit this increased concentration of quality chromosomes in constructing better solutions as the generations evolve.
Results
We summarize the trade-off studies conducted on a set of benchmark examples including the fifth order elliptic wave digital filter (EWF5), an AR filter (ARMA), a third order bilinear loss-less discrete integrator filter (LDI3), and the FIR filter (FIR16).
Phantom Redundancy using Non-pipelined Functional Units
Initially, we used non-pipelined functional units with the multiplier taking 2 clock cycles and the adder taking one clock cycle. Also, we assume that the adder is also capable of carrying out subtract operations. The system word length is 24 bits and the filter coefficients are 16 bits. Table 1 : Impact of phantom redundancy on designs synthesized using non-pipelined functional units.
Results of this experiment are summarized in Table 1 when compared to the clock cycle of the basic data path. The overall performance overhead when this is considered is 1.1x of that discussed in the rest of the paper. To compare the performance overhead of the proposed scheme viv-a-vis the BISR scheme, we assume that the performance of a BISR design is same as that of the basic design.
Hence, the reported performance degradation for phantom redundancy is also its performance degradation when compared to BISR. Number of registers in the original design and number of registers in the gracefully degradable design appear in the next two columns. The chip area estimates for the original IC, the gracefully degradable IC, and the BISR IC are given in the columns titled Chip Area Orig., Chip Area Phant. and Chip Area BISR. The reported chip area estimates were obtained using HYPER hardware database [11] . These area estimates include the controller area, are fairly accurate and are known to be, in the worst case, 15% off the actual layout areas [11] . From Table Table 1 , it can be seen that on an average, phantom redundancy entails 28.19% less area (with a standard deviation of 11.96%) when compared to BISR. The figures for LDI and EWF5 indicate that large savings in area can be obtained at the cost of a small performance loss -2 clock cycles for the EWF5 and 1 clock cycle for LDI3.
While area overhead of phantom redundancy is negligible when compared to that of the original design, additional area required for BISR corresponds to a significant proportion of the original chip area. If the basic design has a large number of functional units, then the phantom redundant design has to support one backup schedule for each possible failure scenario. This is the main source of controller overhead and contributes to the overall area overhead. Performance degradation of these gracefully degradable data paths using non-pipelined functional units is quite significant. Performance degradation (of over 78%) is highest for the AR filter built from 2 adders and 2 multipliers. A closer look into this synthesized design and the AR algorithm reveals that this is because of two factors. Firstly, the number of functional units of a given type is very small resulting in a small number of operational functional units in the presence of a failure. Secondly, the multiplication and the addition operations in the AR filter are clustered. This dramatically increases the critical path in the algorithm.
The schedule and binding corresponding to the seventh row Table 1 for the 16-tap FIR filter with 4 adders and 4 multipliers is shown in Figure 9 (a). The schedule and binding when adder A0 fails is shown in Figure 9 (b). The performance degradation is one clock cycle. The schedule and binding when multiplier M0 fails is shown in Figure 9 (c). The performance degradation is four clock cycles. Similarly, a 17-clock cycle schedule for EWF5 with 3 adders and 3 multipliers (corresponding to second row in Table 1 ) is shown in Figure 10 (a). The reconfigured 18-clock cycle schedule and binding when adder A0 fails is shown in Figure 10 (b). The reconfigured 19-clock cycle schedule and binding when multiplier M0 fails is shown in Figure 10 (c).
Impact of Pipelined Functional Units on Phantom Redundancy
We have seen that non-pipelined functional units entail large performance degradations.
However, all these reconfigured ICs are obtained at no or minimal additional cost, thereby increasing the effective number of usable ICs (perfect ICs + partially good ICs).
We will now assess the impact of pipelined functional units on performance degradation.
For this experiment we use a two stage pipelined multiplier with a latency of 2 clock cycles and an initiation rate of 1 clock cycle. The results are summarized in The best results were obtained for the EWF5 example where the performance degradation is 11.76% while the area overhead incurred is only 2.29%. The BISR strategy leads to no performance degradation but the area overhead amounts to 45.53%. The worst performance degradation occurs for the ARMA example with 2 multipliers and 2 adders.
The increase in degradation in performance is largely due to the availability of only one adder (multiplier) unit in the event of a failure in one of the 2 adder (multiplier) units present. This leads to a 45.45% penalty in performance when one of the adders fails. On an average the savings in area over the spare-based BISR approach is 30.34% while the standard deviation is 13.28%. The average degradation in performance is 25.61% and the standard deviation is 14.85%.
There is a marked reduction in performance degradation when pipelined functional units are used. This is because pipelined units permit the initiation of operations at a much higher rate when a functional unit fails. Thus operations assigned to a failed unit are scheduled in earlier steps as opposed to waiting for a multi-cycle operation to be completed before the next operation can be initiated. Pipelining is particularly effective when operations of a type are clustered in the CDFG. The reduction in performance degradation will be even more significant with deeply pipelined functional units (multiple pipeline stages).
Phantom Redundancy using multifunctional ALUs
In Table 3 we present the results when multifunctional ALUs are employed. The ALU is assumed to carry out multiplication, addition and subtraction in a single clock cycle. Table 1 and Table   2 because the number of functional units used in the examples is much smaller.
Phantom redundancy based on an enhanced fault model
A closer look at the components of the area of an IC shows that interconnect uses about 50% of the total area. Hence, targeting faults in single functional units alone may not be sufficient. Consequently, we adopted an enhanced functional fault model proposed in [3] that targets single faults in functional units, register files and interconnections. A fault in a register file is considered as a fault in the functional unit that the register file feeds, while a fault in an interconnect line is considered as a fault in the functional unit/register file from which it emanates. These constraints are then modeled as register and bus allocation constraints. Phantom redundancy can be used for defect tolerance in mature fabrication processes where the process yield is sufficiently high to make the area overhead of BISR unreasonable. In mature fabrication process the phantom redundancy technique provides a low cost technique for improving the yield of ICs by yielding partially good chips. In application scenarios where the targeted performance quality is not stringent, phantom redundancy is once again a good alternative. BISR does not entail performance degradation and may be preferable in designs where tight performance constraints on the system should be met. BISR is good for new process lines where stringent performance standards must be satisfied.
The synthesis results showed that the interconnect area is about 50% of the total area.
Phantom redundancy technique presented in this paper that is based on enhanced fault 1. In data dominated designs, the data path tends to occupy far more area than the controller. So the impact of control on the overall area of the chip is small.
2. In the hardware model that we use the control is distributed throughout the design and any increase in controller area is accommodated in the dead areas of the layout.
3. While the complexity of the control logic increases, it does not correspond to the controller area in a one-to-one fashion. The controller area is a non-linear function of the schedule, allocation and mapping. Moreover, state assignment for the new controller results in a different encoding of the states and hence is a factor in keeping the control overheads low.
If controller fault-tolerance is necessary, then existing techniques such as [12] or straightforward duplication can be used.
Conclusions
In this paper we presented a low-cost RT level technique for designing gracefully 
