W e will present the area-ecient faultdetection synthesis component o f S Y N C E R E , a n integrated system for synthesizing area-ecient selfrecovering microarchitectures. In the SY N C E R E model for self-recovery, transient fault detection is based on duplication and comparison, while recovery from transient faults is accomplished via checkpointing and rollback. SY N C E R E minimizes the overhead of duplication using two complementary areaoptimization techniques. Whereas imposing in ter-copy hardware disjointness at a sub-computation level instead of at the overall computation level ameliorates the dedicated hardware required for the original and duplicate computations, restructuring the pliable input representation of the duplicate computation further moderates the overall hardware. 31 ST ACM/IEEE Design Automation Conference ®
Introduction
The current generation of automotive electronic ICs have to meet the military-quality and fault-tolerance goals at commodity prices [1] . Likewise, life-critical applications such as medical life-support units, and industrial process controls mandate fault-tolerance. This growing demand for fault-tolerance coupled with the inherent unreliability attendant upon VLSI has elevated the design of fault-tolerant VLSI systems into a research problem of imm ediate practical relevance. SY N C E R E also restructures an input representation so as to exploit the raggedness of its hardware utilization prole. Restructuring of an input representation is automatic and is steered by a multidimensional force metric. Finally, it exploits the unique features of a checkpointed computation by identifying selected intermediate computations and dedicating hardware for each of them and their counterparts. Such a strategy This research w as supported by NSF grant fosters inter-copy hardware sharing without however compromising the 100% fault detection capabilit y.
In most high-level synthesis systems, only trade-os between performance and area are explored [2] . Only recently, newer quality metrics such as testability and fault-tolerance have been considered at the microarchitectural level. In [5, 3] testability issues were explicitly addressed at the microarchitectural level, while in [4, 6, 8] fault-tolerance issues were explored. Orailo glu and Karri [6] h a v e developed heuristic and optimal strategies for coactive s c heduling and checkpoint insertion during self-recovering microarchitecture synthesis. This is in contrast to the heuristic techniques for scheduling followed by c heckpoint insertion proposed in [8] . Along a dierent dimension, Karri and Orailo glu [4] h a v e also developed a system for synthesizing reliable microarchitectures.
The rest of this paper is organized as follows. Section 2 outlines our model for self-recovery and presents the methodology. Section 3 describes an area-ecient fault detection mechanism and a scheduling algorithm that performs area optimization. Additional area optimization via owgraph restructuring is explored in Section 4. Section 5 then describes incorporation of essential fault-detection constraints such as hardware disjointness between the original and the duplicate computations. Section 6 summarizes the results of synthesis experiments.
The model and the methodology
In our model for self-recovery, partial results from two copies are compared at a checkpoint (duplication and comparison), and if they agree, are written into the checkpoint registers (checkpointing). On the other hand, if the results disagree, the computation rolls back to the previous checkpoint and retries. Checkpoint insertion [6] completely determines the checkpoints as well as the edges assigned to each of them. Checkpointing groups clock cycles into checkpoint zones. A checkpoint zone is the set of clock cycles between two adjacent c heckpoints. Nodes belonging to the same checkpoint zone are coeval, while nodes that are voted upon at a checkpoint are secured. Finally, a secured node together with its coeval predecessors form a r subgraph. W e will illustrate these concepts Although each secured node belongs to one and only one r subgraph, two or more r subgraphs can have common nodes. For example, r subgraphs f6,8g, and f6,7,9g have node 6 in common. SY N C E R E uses a two-pass scheduler. In the rst pass, checkpoints are inserted using an edge-based scheduler [6] . Moreover, the voting area overhead is explicitly optimized. In the second pass fault-detection constraints are incorporated. Initially, the original ow graph is scheduled using an aggressive m ultidimensional force-directed scheduler which performs negrain area optimization in addition to coarse-grain area optimization. Subsequently, the duplicate owgraph is restructured using a force-directed transformational subsystem. However, inorder to compare results from identical computations at a checkpoint, such o wgraph restructuring is conned to coeval subgraphs. Finally, the duplicate owgraph is scheduled using a retentive scheduler that considers (i) hardware utilization characteristics of the original owgraph, (ii) sharing between the original and the duplicate computations, and (iii) disjointness between the original and the duplicate r subgraphs.
Multidimensional Scheduling
Traditionally, s c heduling algorithms have optimized the peak hardware consumption of the overall owgraph (we will refer to this as coarse-grain area optimization). However, since the area-ecient fault detection mechanism enforces hardware disjointness at the r subgraph level, the peak hardware o f e ach of these r subgraphs have to be minimized as well.
We will outline a multidimensional analog of the well known force directed scheduling algorithm [7] that minimizes the maximum hardware used both by the original owgraph as well as by all of its constituent r subgraphs (we will refer to this as ne-grain area optimization).
In the basic force-directed scheduler, a distribution graph is set up for each clock cycle i and for each operation type j as given by equation 1 where F k is the number of clock cycles that node k (of type j) can be feasibly assigned to.
Next, a node is assigned to a clock if it yields the minimumforce [7] . The multidimensional force-directed scheduler extends and supplements the traditional coarse-grain area optimization with ne-grain area optimization. Fine-grain area optimization is accomplished by assigning a node to a clock that additionally minimizes the peak hardware usage of each of the r subgraphs to which the node belongs. Fine grain distribution graphs called the r distribution graphs are computed for each r subgraph. r f orces of assigning a node to a clock are computed {one for each r subgraph to which the node belongs to{ by using the corresponding r distribution graphs. Let R 1 , R 2 , .., R n be the r subgraphs to which the node belongs to. Let F 1 , F 2 , .., F n be the corresponding r forces resulting from assigning node to clock, and F O be the overall force. The total force F(node; clock) of assigning node to clock is given by equation 2.
The benets of such ne grain area optimization can be illustrated. Consider gure 2a with a checkpoint inserted at the clock boundary 4-5. A c o e v al subgraph (belonging to the rst checkpoint zone) comprising of three disjoint r subgraphs (r subgraph1 = f8,6,3,4,1,2g, r subgraph2 = f9,6,7,3,4,5,1,2g, and r subgraph3 = f13,12,11,10g) is also shown.
All nodes are color coded to highlight their respective r subgraph membership. Observe that r subgraph1 and r subgraph2 have some operations common to them. All of these r subgraphs should be scheduled into the rst checkpoint zone alone. Traditional force-directed scheduling does not discriminate between the assignment of operation 3 to clock cycle 1, operation 5 to clock cycle 1, operation 10 to clock cycle 1 and operation 11 to clock cycle 1 as all of these assignments have the same force. Hence, a possible schedule resulting from such a s c heduler is shown in gure 2b. However, such an assignment results in peak hardware usage of 3 (for r subgraph1), 4 (for r subgraph2), and 2 (for r subgraph3). However, a superior schedule that additionally minimizes the peak hardware usage of each of the r subgraphs (2 for r subgraph1, 3 for r subgraph2, and 1 for r subgraph3) is shown in gure 2c.
Force-Directed Restructuring
The benets of owgraph restructuring in the context of self-recovering microarchitecture synthesis can respectively (in the left hand side box). Straightforward duplication requires four subtractors. However, distributivity followed by associativity on the left hand side r subgraph and associativity on the right hand side r subgraph has resulted in a functionally equivalent o wgraph (shown in the right hand side box) requiring only two subtractors. Also, transformations have been conned to within a checkpoint zone. Furthermore, we h a v e annotated all nodes with a functional unit to demonstrate the existence of a feasible operator binding satisfying the hardware disjointness constraint.
We implemented a force-directed approach t o r estructuring the duplicate coeval subgraphs. The candidate transformations are evaluated using a multidimensional-force metric and a transformation with the minimum negative force is then invoked. Initially, global transformations are applied as they improve hardware utilization across clock cycles in addition to uncovering ow graph structures amenable to local transformations. The multidimensional force computation described in section 3 is generalized to derive the force associated with the invocation of a transformation as follows: Let DG untrans (i; j) and DG trans (i; j) be the distributions of the j th operator type in the i th clock cycle prior to and after the invocation of a transformation respectively and let DG(i; j) be the dierence between these distributions computed using equation 3.
DG(i; j) = DG trans (i; j) DG untrans (i; j) (3) In order to incorporate the hardware utilization characteristics of the original scheduled owgraph, the distribution graph of the duplicate owgraph is supplemented with the hardware utilized by the original copy (say H (i; j)). The force F of the transformation, is then obtained using equation 4. F = X 8i X 8j (H(i; j) + DG(i; j)) DG(i; j) (4) Since the overall hardware is inuenced by the peak rather than the per-clock utilization, the force computation is further enhanced to explicitly optimize the peak hardware utilization as follows: DG(i; j) i s modied by substituting the per-clock hardware distribution DG(i; j) with the current peak committed hardware as the reference (approximated by the peak hardware H (j), (= max 8j H (i; j)) of the scheduled original owgraph) as given by equation 5.
DG(i; j) = DG trans (i; j) DG untrans (i; j); if (DG trans (i; j) > H ( j )) = 0; otherwise (5) Next, we will describe the incremental derivation of DG trans (i; j) for the distributivity transformation.
Consider the untransformed owgraph on the left hand side of the arrow in gure 4a. Assuming a time constraint o f t w o clock cycles, the distribution graphs for the adder and the multiplier are shown below the ow graph. On the right hand side of the arrow, a transformed ow graph (resulting from the invocation of distributivity) and the corresponding distribution graph are shown. Note that the transformed ow graph has a better distribution graph and consequently a l o w er force when compared to the original ow graph. Consider the more general owgraph structure within which distributivity can be applied. In gure 4b, if L1 and R1 are the left and right predecessor subgraphs respectively of the head node (of type )and if R2 is the right predecessor subgraph of the tail node (of type ) In this nal step, the multidimensional forcedirected scheduler described in section 3 is supplemented with retention and discrimination capabilities. The retention capability fosters hardware sharing between the original and the duplicate owgraph, by exploiting the compatibility of their hardware utilization proles. If H 1 (i; j) is the hardware usage of the j th type in clock i in the original schedule, and DG 2 (i; j) i s the hardware distribution graph of the duplicate computation alone, then DG(i; j), the retentive distribution graph, is given by equation 6. DG(i; j) = DG 2 (i; j) + H 1 ( i; j) (6) Similarly, H 1 k (i; j) is the usage of the j th type of hardware in clock i in the schedule of the k th r subgraph in the original computation, and DG 2 k (i; j) is the hardware distribution graph of the corresponding duplicate r subgraph alone, then DG k (i; j), the retentive r distribution graph, of the k th r subgraph is given by equation 7. DG k (i; j) = DG 2 k (i; j) + H 1 k ( i; j) (7) The retention capability is supplemented with discrimination that promotes hardware sharing only if such sharing does not compromise 100% fault-detection capability of the resulting microarchitecture. 100% faultdetection necessitates dedicated hardware for the original and the duplicate computations, and can be enforced by substituting the per-clock cycle hardware usage in equation 6 with the peak hardware usage. However, since straightforward duplication entails signicant hardware overhead, SY N C E R E imposes the inter-copy hardware disjointness only at the granularity o f r subgraphs as given in equation 8. DG k (i; j) = DG 2 k (i; j) + max 8i H 1 k (i; j)
Such a discriminating retention promotes hardware sharing between the original and the duplicate computations without however compromising on the 100% fault-detection capability of the resulting microarchitecture.
Experimental Results
We will now e v aluate the SY N C E R E approach t o optimizing the area overhead of hardware redundancy based detection on three high level synthesis benchmarks namely, fth order elliptic lter, AR Filter, and 16-tap FIR lter. The reduction in hardware due to the proposed fault-detection scheme are summarized in table 1. The voting overhead reported in column 3 is an artifact of the checkpoint insertion phase. Reduction in hardware vis-a-vis straightforward duplication is then computed in column 6. A s c hedule for the elliptic lter, corresponding to the second row in table 1, is shown in gure 6. The hardware savings were 16.67% for the adders and 25% for the multipliers. From this experi- the number of inserted checkpoints. This is because as more and more checkpoints are inserted, the size of a checkpoint zone as well as the mobility of the nodes in the resulting coeval subgraphs decreases.
We will now e v aluate the benet of owgraph restructuring. The results of this set of synthesis experiments are summarized in table 2. Reduction in hardware due to owgraph restructuring vis-a-vis areaecient duplication is computed in column 6. Restructuring proved extremely benecial in synthesizing area-ecient designs in the presence of tight performance constraints. The benets of restructuring are however limited by the inserted checkpoints. Specifically, both the clock cycle boundaries where checkpoints are inserted as well as the numb e r o f c heckpoints inserted impact the eective application and exploitation of restructuring. Insertion of too many checkpoints precludes the application of global transformations such as associativity.
Conclusion
In SY N C E R E the synergies arising out of an areaecient fault-detection technique, ecient manipulation of algorithmic level design structures, and aggressive area optimization at the microarchitectural level have been exploited to synthesize area-ecient 
