Abstract| This paper introduces a technique to transform a given RT-Level design, consisting of control logic and data path, into a functionally equivalent, minimized design which is 100% testable under full-scan at the gate level. The proposed RT-Level optimization technique uses the RT-Level structure and exploits the interaction between the control and the data path. Our approach maintains the RT-Level design hierarchy while performing RT-Level transformations of initially speci ed data path, followed by resynthesis of control using don't cares extracted from the data path. Experiments with several RTL benchmarks demonstrate the e ectiveness of the technique in generating fully testable designs. In addition, comparison with logic-level techniques show the advantages of the proposed technique as an optimizing tool to produce circuits with reduced area and delay.
Introduction
The goal of behavioral synthesis is to produce a RTL hardware description that meets various design constraints. The major tasks involved are scheduling of operations into control steps and allocating resources (functional units and registers) to perform operations and storage. A summary of recent approaches on these topics is available in 1], 2], 3].
In most approaches, scheduling is performed in a way that meets the resource and timing constraints before the actual RTL realization. Since we want to consider general designs consisting of control as well as data operations, as opposed to data-intensive designs like DSP applications, we examined two scheduling strategies applicable to controldominated designs in greater detail: the AFAP schedule as described in 4] and an alternative MINC schedule, introduced and described in 5] . Our experiments con rm experience of other researchers: regardless of which schedule one uses, none of the RTL designs that are produced are 100% testable at the gate level. Also, the initial RTL designs can be signi cantly minimized before submitting to nal technology mapping.
A general RT-Level design consists of control and data paths. Since the structural description of the data path is not amenable to two-level and multi-level minimization, and since extracting its state transition graph is not feasible, the logic-level synthesis for testability techniques are applicable only to the control part of the RT-Level description 6], 7]. More recently, techniques have been proposed to synthesize testable data paths from high level descripThe research of Subhrajit Bhattacharya has been supported by a benchmark grant from ACM/SIGDA and a grant from C&C Research Laboratories, NEC USA.
tions 8], 9], 10], 11]. While the control and data path can be made fully testable, the disadvantage of considering two parts of the synthesis process separately is that the whole design may not be fully testable.
An RTL description consisting of control and data path can be optimized at the logic level using known heuristics which minimize multi-level logic: either by direct redundancy removal 12], by using don't cares and factorization techniques 13], 14] or by combining both techniques on selected partitions 15], 16]. However, lack of knowledge of the RTL structure at the logic-level may make the logiclevel techniques, like extracting useful don't cares, prohibitively expensive. Also, application of the logic-level techniques to the complete design destroys the RT-Level hierarchy.
An alternate and more e cient approach to optimize RT-Level designs is to use RT-Level transformations. The CALLAS high-level synthesis system uses RT-Level transformations to minimize the number of multiplexors in an RTL design 17] . In this paper, we propose a general methodology to generate optimized RTL control-data path designs which are 100% testable under full-scan at the gate level. Also, all false paths are implicitly removed from the data path. Our approach uses the hierarchy of the RT-Level design, and the interaction between control and data path. Using the knowledge of the RTL structure allows application of simple RT-Level transformations on the data path. The RTL hierarchy also allows for easy calculations of exact observabilities of data signals, and exact observability don't cares for control signals. The resultant synthesis process consists of transformations applied to the data path, followed by extracting don't cares from the data path to optimize the control.
The attractive features of the proposed RT-Level transformations are the following.
1. The RTL hierarchy is maintained.
Data path undergoes RT-Level transformations instead
of gate level transformations. For instance, RTL units like Functional Units, multiplexors and registers are optimized. 3. Using the RTL structure simpli es extracting don't cares from the data path to minimize the control logic. Lack of knowledge of the RTL structure at the logic level makes extracting the same don't cares at the logic level computationally very expensive, if not impossible. 4 . All combinational false paths in the data path are im-plicitly removed. This too is simpler at the RT-Level than at the gate level. A circuit with false paths can be fully testable for single stuck-at-faults. However, removing false paths in the data path optimizes logic as well as reduces the cost of test pattern generation. 5. The nal circuit is 100% testable under full scan. We have applied our RT-Level optimization techniques on several RT-Level benchmarks 5]. We report on our results in the context of the overall synthesis process as depicted in Figure 1 . Given a behavioral speci cation, we will have synthesized di erent RTL implementations, depending on the schedule we use. Experimental results with benchmarks demonstrate that while a scheduling strategy does a ect the initial standard cell layout area, critical path delay and gate level testability, the proposed RTL transformations and resynthesis reduce this variance signi cantly while consistently achieving 100% fault coverage at gate level, regardless of the initial schedule.
The paper is organized as follows. In Section 2, we use a benchmark example to illustrate the RT-Level hierarchy and demonstrate the major steps of the proposed approach. In Sections 3 and 4, we formalize two important transformations that allow us to generate a canonical representation of the data path. We show in Section 5 how to e ciently extract and use don't cares associated with the canonical form of the data path to minimize the control logic such that the network combining both becomes fully testable, while also removing all false paths in the data path. The section on experimental results tabulates interesting e ects of the transformations on area, delay and testability of several RT-Level benchmarks. The experiments show that while none of the initial designs are fully testable, the transformations almost always produce fully testable designs, while consistently reducing area and delay.
RT-Level Hierarchy and Transformations: An Illustration
A typical RT-Level design consists of a data path (DP) and control logic. We partition the control logic into three blocks, the encoding logic EL, the decoding logic DL and the control logic CL. The RT-Level hierarchy is illustrated by Figure 2 , which shows a partial RT-Level speci cation of the benchmark Fancy.b 5]. The RT-Level description was obtained by scheduling the Control Data Flow Graph (CDFG) from a behavioral description of the benchmark, shown in Appendix B, using the MINC scheduling technique 5]. For simplicity, not all of the control signals in the block DL are speci ed. Besides the four major partitions, the data path registers P and state registers S are the other constituents of the RTL speci cation. The data path (DP) block consists of a structural description of multiplexors (muxes) and functional units (FUs). The muxes are typically used to route the output of data path registers and FUs in the data path back to the data path registers. This corresponds to assignment statements in a high level description. The data path network follows two design rules. The data input signals to the FUs and muxes are either outputs of registers, muxes or FUs. The control signals to the muxes can only come from either the DL block or the data path registers. As will be seen later, the design rules allow simple algorithms for data path optimization.
The control of a general design consists of three parts, a functional description of an FSM implementing the schedule (CL), a structural description of comparators implementing the conditionals in the high level description (EL) and a decoding logic (DL). The DL uses the state information from the FSM state registers S and the evaluation of the conditionals in EL to control the data path by switching the muxes. While the CL and DL blocks are amenable to two-level and multi-level minimization, the structural description of the comparators in the EL block is not. Hence, the EL block is maintained separately while minimizing CL and DL using don't cares derived from the data path. However, any comparator module with constant inputs is made irredundant separately before using it in the circuit.
While each block in an RT-Level description may be in itself irredundant, the overall circuit may not be fully testable, in spite of all the scan registers shown. The initial RT-Level design of benchmark Fancy.b is shown in Figure  2 . The complete design is not 100% testable, as shown in Table 5 . Using RT-Level transformations introduced in this paper, the initial RT-Level design (Figure 2) can be transformed into the optimized RT-Level design shown in Figure 4 (b), which is 100% testable under fullscan at the gate level.
The dependencies of the controlling inputs in this example give rise both to redundancies as well as to false paths. The initial data path is rst transformed into a cascade of multiplexors and functional units, thereby reducing the functional units to its minimum, as shown in Figure 3 (a). The second transformation converts the data path into a canonical form, minimizing the number of multiplexors. The two chains of multiplexors in Figure 3 (b) represent the canonical form which is equivalent to the network of multiplexors in Figure 3 (a). The new control signals are simply data input observabilities, e.g. signal O 2 a represents the observability of data input a at output F 2 of the mux network. Consequently, the control logic block DL changes as shown in Figure 3 (b). The next step is identifying don't cares from the data path, and using the don't cares to minimize the control logic. For example, when control signal O 2 c = 1; the value of the control signal O 2 a does not a ect the output of the data path. Hence, the ON-set minterms of O 2 c are don't cares for signal O 2 a , and can be used to minimize O 2 a : Minimizing the control logic using data path don't cares identi es that O 1 d = 0: This indicates that the data input d is redundant, allowing for removal of the last false path, and thereby further minimizing the data path. The multiplexor corresponding to the constant signal e is also pruned. The optimized control (DL) and data path of the benchmark Fancy.b are shown in Figure 4(a) . Finally, the multiplexor chains are balanced, generating the nal RT-Level design shown in Figure 4(b) .
While the initial RTL design (Figure 2 ) has 10 multiplex-ors and 3 adders, the nal RTL design (Figure 4(b) ) has only 5 multiplexors and 1 adder. Reducing the number of FUs required, thereby increasing resource sharing. A brief word on notation. The nodes of the data path network represent muxes and functional units. We designate the nodes connected to the \0" and \1" data input, control input and output of the multiplexor M i as l i ; r i ; c i ; m i respectively. The operator corresponding to a FU node \x" is referred to as \op(x)".
In this section, we give an algorithm to transform a data path of multiplexors and functional units into a Mux-FU network. It is assumed that functional units have two inputs (operands) and the operations they perform have an identity element. The data path transformation can be achieved in terms of two operations, propagate and duplicate, which we describe below.
The propagate operation is invoked when each input of a multiplexor M i has a functional unit of the same type, that is, (op(l i ) = op(r i )). It simply propagates the FUs through the mux M i , e ectively merging the functional units and introducing two multiplexors at each input of the propagated functional unit. The multiplexors now multiplex the inputs to the two functional units in the initial network and both muxes are controlled by c i . If any of the FUs has multiple fanout, the FU is duplicated, one copy for the fanout feeding to the multiplexor M i , and the other copy for the other fanouts. The propagation of the FUs from the inputs to the output of multiplexor M i is done subsequently. The propagate operation is illustrated in Figure 5 , which shows two adders propagated to the output of the multiplexor.
In a general data path, both the inputs of a multiplexor may not have FUs, or the FUs on the inputs may be of di erent types. This is illustrated by the data path shown in Figure 6 (a). Multiplexor M 1 has an adder on one input and a multiplier on the other input. In that case, the duplicate operation is invoked so that both inputs of the multiplexor have an FU of the same type. The duplicate operation is invoked either when one of the inputs has a FU node and the other one does not or when the FU's on the left and right input of the mux are di erent.
In the latter case, either the left FU or the right FU could be duplicated before invoking the propagate operation. The FU chosen could a ect the nal delay or area but not the testability. Since our primary goal in this paper is testability, we arbitrarily choose the left FU to simplify the algorithm. In the former case, the operation introduces an FU at the mux input which does not have a FU node. This new FU node has one input corresponding to the input node to the mux (before introduction of the FU), the other input is the identity element of the operation the FU performs. This transformation enables propagation through the multiplexor, while preserving the functionality of the data path. The second case of assymetric FU's is dealt with similarly. Figure 6 illustrates the process of propagating FUs to produce a Mux-FU network. A data path consisting of multiplexors and FUs is shown in Figure 6 (a). Since there are FUs on the path from multiplexor N to multiplexor M, the data path needs to be transformed into a Mux-FU network. Figure 6 (b) shows the e ect of the duplicate operation on multiplexor M in Figure 6 (a). Following the duplicate operation, the adders are propagated by a subsequent invocation of the propagate operation, as illustrated in Figure 6 (c). Finally, a Mux-FU network is achieved by propagating the two multipliers as shown in Figure 6 (d). We outline the algorithm which transforms a network of muxes and FU's into a Mux-FU network, using the propagate and duplicate operations described above. path, it can be shown that the transformation will always decrease the number of FUs, thus improving resource sharing. However, in a more general case, the number of FUs may increase on propagation as for the QRS benchmark. In this situation, we do not propagate the FUs. Similarly, the delay of the RTL design may be a ected while propagating FUs. For example, propagating two FUs of di erent types to the output of the multiplexor may increase the length of the critical path. Any increase in number of FUs or delay can be viewed as penalties for improved testability. We are currently investigating the trade-o between resource sharing, delay and testability.
In the next section, we show how we can transform a multiple output multiplexor network as derived by Algorithm 1 into a canonical form. The canonical form can be resynthesized, as discussed in Section 5, to result in a circuit which is not only fully testable, but also free from false paths.
Transforming a Multiplexor Network into a Canonical Form
After the data path has been transformed into a Mux-FU cascade, the mux network is transformed into a canonical form. The canonical network consists of disjoint mux subnetworks, one for each output of the Mux-FU cascade. These subnetworks can be realized as a chain of muxes or as partially or fully balanced mux trees. Transforming the mux network into a canonical form makes the following tasks simple: removal of false paths in the data path and extracting don't cares from the data path to minimize the control. Figure 3(a) shows the Mux-FU cascade generated from the data path shown in Figure 2 . Given the mux network in Figure 3(a) , the equivalent canonical mux networks corresponding to output F 1 and F 2 are shown in Figure 3(b) . In the canonical mux network corresponding to output F 1 , the control signals to the muxes are the observabilities of the data inputs at the output F 1 in the initial mux network of Figure 3 (a). The same is true for the canonical mux network corresponding to F 2 . In this section, we rst show how the observabilities of the data inputs of a Mux-FU cascade can be easily calculated. Next, we show how the data observabilities can be used to derive the desired canonical mux networks. Algorithm 2 uses the following properties to calculate the exact observabilities of the data inputs at the outputs of the mux network.
Computing Data Observabilities
P1. For any node x, the observability of x at the output When deriving an expression for the output of a mux network it should be noted that all the data inputs to a network may not have paths to all the outputs of a multiplexor network. However, for simplicity of notation, we consider all the data inputs to all the mux networks when deriving expressions for the output of the k th mux-chain, F k . If there are inputs which do not have any path to the particular output in question, its observability function is identically zero. In general, the observability functions have a support set which is a subset of the set of control signals to the muxes c 1 ; c 2 ; . . .; c n .
Changing the ordering of the data inputs to the MUX chain changes the order of the don't care sets generated for the control signals and hence changes the control signals. This can a ect the nal delay or area of the circuit but not the testability. Further research needs to be done in ordering algorithms for the muxes to synthesize faster and smaller circuits. . . .O k dp we state the following properties.
Transforming into Canonical Form
. . .:O k dp?2 :O k dp?1 ) Property 1.1 expresses each output F k of the mux network in the Mux-FU cascade in terms of the data inputs and their observabilities. Properties 1.1 and 1.2 indicate that for a given set of control input conditions, the output cannot be a function of more than one data input. Property 1.3 states that for any value of the control signals, the output of the multiplexor network has to be equal to one of the data inputs. Properties 1.4 and 1.5 can be derived from Properties 1.2 and 1.3.
The canonical network for any output F k of the mux network is derived using the following lemma. Lemma The transformation to the canonical form is done in the following steps. The data inputs which have a path through the mux network to the output F k are identi ed. The observabilities for each of these data inputs with respect to F k are computed. The canonical chain of muxes is constructed, as shown in Figure 3(b) . The \1" input to each mux is one of the data inputs which had a path to the output and the control signal is the corresponding data observability. The transformation has the e ect of re-expressing each output F k in the equivalent canonical form F canonical k .
The control signals to the data path are now the observabilities of data inputs. Hence, the DL block of the control logic is changed.
We give an outline of Algorithm 3 which uses Lemma 1 to transform the mux network with outputs F k into a functionally equivalent canonical mux network with outputs F canonical k . We designate the \0" and \1" inputs of the multiplexor M i as l i and r i respectively , the output as m i and the control input as c i .
Algorithm 3 (Deriving Canonical Mux-Chains) Figure 3(b) illustrates the canonical mux network generated from the mux network in Figure 3(a) . Note the change in the control signals to the data path, and hence the DL part of the control logic. The transformation into canonical mux network prepares the RTL circuit for the nal resynthesis phase. In this phase, the constant inputs are pruned and the control logic minimized using don't cares derived from the data path, as discussed in the next section.
5 Exploiting Don't Cares Derived from the Data Path to Synthesize Irredundant Circuits
This section exploits the interaction between control logic and data path to obtain an optimized RTL structure which is fully testable at the gate level. We introduce an e cient technique to extract don't cares from the data path and use the don't cares to optimize the control logic. Resynthesis of the control logic also provides opportunities to further minimize the data path. This process eliminates all the redundancies in the control logic and all the false paths in the data path.
Minimizing Control Logic Using Data Path Don't Cares
We rst address the issue of using don't cares derived from the data path to synthesize minimized and 100% testable control logic. In Figure 7 (a), we show the control logic (DL) and the canonical multiplexor network (MC) of a typical RTL design. The DL block has been made irredundant by logic synthesis techniques. The MC block is also irredundant. However, cascading the two blocks together gives rise to redundancies in the control logic (as shown by \X" in Figure 7 (a)). Let f i be the boolean function corresponding to the control signal c i . In Figure 7 (a), note that when c 1 is \1", the value of the control signal c 2 does not a ect the output of MC, \z". Consequently, the on-set minterms of function f 1 are observability don't cares for function f 2 . Similarly, the on-set minterms of the function (f 1 + f 2 ) are the observability don't cares for f 3 .
Optimizing control logic DL using the observability don't cares derived from the data path produces a minimized control logic as shown in Figure 7(b) . An important consequence of the optimization is that the cascaded DL-MC network shown in Figure 7 (b) is fully testable.
Knowledge of the RTL structure facilitates easy extraction of the data path don't cares. Consider the canonical mux network as derived from the initial data path using Algorithm 1 and Algorithm 3. As mentioned before, the boolean function corresponding to control signal c i is called 
Minimizing Data Path by Eliminating False Paths
A false path in a data path is a path from a data input to the output of the data path which is never sensitized. In an RTL network, there may be false paths in the data path but no redundancies (An illustrative example is the RTL circuit of Figure 8(a) ). Deriving the canonical multiplexor network implicitly eliminates some false paths. Subsequent control minimization using data path don't cares produces a constant control signal for each false path that still remains in the data path. Eliminating false paths would lead to further optimization of the data path. Also, the cost of test pattern generation would be reduced. Instead of explicitly identifying false paths and eliminating them as proposed in 18], we remove the false paths from the canonical network by identifying control signals which are constant. After don't care minimization of the control logic, if a control signal c i is reduced to constant \0", the path from the data input corresponding to the \1" input of M i is false and the data input can be removed along with mux M i . If c i reduces to \1", all the paths from the data inputs which pass through the \0" input of M i are false and all muxes M j ; j i can be removed. This process ensures an e cient elimination of all false paths in the data path.
We give a simple example how false paths in the data path are eliminated by our algorithms. Though the RTL structure shown in Figure 8 d1 d2 
Algorithm to Eliminate Redundant Faults and False Paths
We present an algorithm which exploits the interaction between control and data path to synthesize an optimized RTL structure using the two techniques outlined in Section 5.1 and 5.2. The multiplexors in the target architecture are in a canonical mux-chain form cascaded with the FU network as shown in Figure 3(b) . The algorithm extracts don't cares from the canonical mux-chain to optimize the decoding logic (DL). Subsequently, it also optimizes the data path (DP) by implicitly eliminating the false paths. The EL and CL blocks are optimized independently of the other blocks. It may be observed that don't cares from the EL and CL blocks may be used to further minimize the design. This is the subject of ongoing research. The notation used for the mux inputs, control and output remain the same as used in Algorithm 3. The structure of the mux network produced by Algorithm 1 is shown in Figure 9 . The Algorithm 4 optimizes the DP and prunes the constant inputs to the muxes to produce the mux network of Figure The e ect of identifying data path don't cares and minimizing the control logic DL of the benchmark Fancy.b is shown in Figure 4(a) . Note the reduction in the size of DL after the minimization. Minimizing DL using data path don't cares identi es control signal O 2 d as the constant 0: This indicates that the path from data input d is false, allowing for removal of the corresponding mux, and thereby further minimizing the data path. The mux corresponding to the constant signal e is also pruned. The optimized control (DL) and data path of the benchmark Fancy.b are shown in Figure 4 (a). 
Need for Test Point Insertion
Instead of minimizing the whole control logic (DL, CL and EL) using the data path don't cares, Algorithm 4 is used to minimize the DL block only. This may be necessitated because the EL block typically contains comparators and is not amenable to two level or multilevel minimization. Consequently, there may still be redundant faults when EL and DL are cascaded. If the complete RTL design is not fully testable after applying our optimization techniques, we insert test points by adding scannable registers at the output of EL which are used only in the test mode. As can be seen from the experimental results in the next section, we are not required to insert test points for most of the designs.
Transformation for Performance: Balancing the Multiplexor Chain
The RTL technique presented in the paper generates multiplexor chains before extracting and using data path don't cares for control logic minimization. However, balancing the mux chains may reduce the critical path and improve the performance of the nal RTL design. The balanced mux tree corresponding to Figure 7 is shown in Figure 11 . Assuming the muxes have a delay of two units, it can be easily seen that the balanced mux tree in Figure 11 has a smaller delay than the corresponding multiplexor chain in Figure 7 . We give a simple recursive algorithm for balancing. Given a multiplexor chain, the algorithm breaks the chain into two equal chains. It balances each chain separately and then introduces a new multiplexor with the two balanced chains as its two inputs. We denote by mux chain an array of muxes of the canonical multiplexor chain, where mux chain i] refers to the i th mux. Similarly, mux tree is the resultant balanced tree of muxes. The mux balancing can be done before optimizing with don't cares. Instead of adding don't cares to the mux-chain, appropriate don't cares can be added to the control signals of the balanced mux tree. As in Theorem 1, we can show that the resultant DL cascaded with the multiplexor tree (as opposed to the multiplexor chain) is irredundant. Transforming a mux chain into a balanced mux tree may change the control logic, introducing a new critical path through the control and data path. Consequently, the performance of the RTL design may actually degrade. We are currently investigating the issue of balancing the mux chains for improved performance. Di erent scheduling and resource allocation strategies produce distinct RTL speci cations which map into speci c technology with varying degrees of e ciency in terms of area, speed and testability. Since we want to consider general designs consisting of control as well as data opera-tions, as opposed to data-intensive designs like DSP applications, we examined two scheduling strategies applicable to control-dominated designs in greater detail: the AFAP schedule 4] and an alternative MINC schedule 5].
The results for the benchmarks are shown in Tables 1,  2 , 3, 4, 5 and 7. For GCD, Barcode and FalsePath benchmarks, we applied both scheduling strategies, AFAP and MINC, to derive two distinct initial RT-Level speci cations. The columns AFAP and MINC tabulates results for the RTL speci cation produced by the AFAP schedule and MINC schedule respectively. The MINC-ndc refers to MINC schedule with no data chaining. Similarly, AFAP-dc and AFAP-ndc refer to AFAP schedule with and without data chaining. Note that the AFAP schedule has been obtained by us and not by using the AFAP scheduler in the IBM system. For each schedule, we report parameters of interest of the initial design. We report the same parameters after applying our RTL optimization tool WON-DER. To demonstrate the e ectiveness of optimization at the RT-level, we also report results using SIS, a logic level optimization tool from UC-Berkeley 23].
The RT-Level parameters reported are the number of register bits, muxes, FUs and control states. The transformations always reduce the number of muxes signi cantly, while the number of FUs either reduce or remain same. The number of control states always remain the same, except for the design derived from MINC schedule of FalsePath 5] .
To observe the e ect of the transformations at the logiclevel, we further synthesized each benchmark to a standard cell layout using OASIS 24] , with the scalable SC-MOS 2.0 micron library supplied with the Logic Synthesis'91 Workshop benchmarks 25]. Parameters to measure the area, like number of combinational gates, literals, register bits, transistor pairs and layout area are reported. To demonstrate that optimization at RT-Level is more e ective than optimization at logic level, we optimize the initial designs at the logic level using SIS. The standard script of SIS was used for logic minimization (LM) followed by the red removal command to remove redundancies (RR) to the extent possible.
The experimental results show that in almost all cases the nal design produced by RTL optimization has signicantly less area and delay than both the initial design and the optimized design produced by SIS. A possible reason for the better performance of WONDER is that transformations involving RTL structures like FUs and muxes are possible at the RT-level, not at the logic level. Logic level tools do not recognize FUs so as to be able to propagate and merge them. However, the e ectiveness of our RT-level techniques even when the number of FUs in the initial and nal RTL descriptions are same is demonstrated by the results on BarCode, QRS and the AFAP version of the FalsePath benchmark. Though SIS extracts and uses don't cares while optimizing the entire design, WONDER produces better results because it utilizes the RT-level structure to extract and use the exact set of data path don't cares.
The e ect of the RT-Level transformations on the logic level testability of the designs is shown in the rows stuckat-faults, untested faults, and the fault coverage. The number of stuck-at-faults which are not testable are reported as untested faults. These include faults proven redundant or aborted after the nominal default backtrack limit of 100 backtracks, using PODEM-based test generation algorithm in OASIS. Fault Coverage gives the percentage of total faults which are testable in the scan mode. The results show that while none of the initially scheduled designs are fully testable, all nal designs except for QRS produced by WONDER achieved 100% fault coverage without degradation in area or critical path delay. Increasing the test generation backtrack limit for initial designs only increased the CPU costs without signi cantly improving the fault coverage. Also, while RTL optimization produces circuits which are 100% testable, circuits produced by SIS are not.
It is also demonstrated in Table 6 that as the bitwidth of the data increases, the CPU time required for logic optimization increases substantially while the time for RTLevel optimization remains independent of the data bit width. In fact, for the 16 bit example, the standard script of SIS did not complete in 3 days. We used a modi ed script 1 to complete the optimization process. Table 6 also demonstrates that as the circuit size increases, the gap between optimization achieved at the RT-Level and logic level widens.
The experimental results further demonstrate that while a scheduling strategy does a ect the initial standard cell layout area, critical path delay and logic level testability, the proposed RT-Level transformations reduce this variance signi cantly while consistently achieving 100% fault coverage at the logic level, regardless of the initial scheduling algorithm used.
QRS: A Chip for Biomedical Applications
We brie y discuss the RTL optimization of the QRS chip 26], which is signi cantly larger than the other benchmarks reported 22]. QRS is a real life design for a speci c biomedical application, with a large number of control as well as arithmetic operations. The initial RTL description satis es the constraint of three adders and two counters. The statistics of the initial description are given in Table  7 under column Init. It is comparable in size (transistors) to the circuit synthesized by CALLAS from VHDL 22] . The initial circuit cannot be optimized at the logic-level using the standard script of SIS due to prohibitive memory/CPU requirements. The optimization achieved using a modi ed script of SIS, followed by redundancy removal, are reported in column SIS. Note that the optimized circuit has many untestable faults. Next, we applied the RT-level optimization technique to the initial circuit.
Propagating the FUs in QRS leads to a substantial increase in the number of FUs, a possible outcome of Algo- Table 7 : QRS (16 bit case) -results of logic-level and RT-level transformations rithm 1 anticipated in Section 3. Hence, we did not transform the initial design to obtain a cascade of MUX-FU network, as required by our algorithms. Instead we identi ed mux subnetworks which had no node with multiple fanouts. Subsequently, the mux subnetworks were used for the derivation of canonical mux chains and don't care minimization, performed by Algorithms 3 and 4 respectively. The parameters of the resultant circuit are reported in the column WONDER in Table 7 . While the number of functional units and control states remain the same as in the initial design, the number of muxes reduce signi cantly. The marginal increase in the number of register bits is due to the extra test points that were added to the outputs of the EL logic to improve testability. The logic-level parameters reveal a signi cant level of optimization achieved by our technique. For instance, the number of literals have been reduced from 12245 to 9306, a reduction of 24%. The delay of the circuit remains almost the same. The testability of the circuit also improved signi cantly. While the initial circuit had 417 untested faults, the nal circuit has only 12 untested faults.
To make QRS 100% testable, we added extra test points at the outputs of the three adders. This e ectively broke the long data chains and made most of the aborted faults testable. The inputs to the logic blocks were made controllable in the full scan mode by propagating the data path registers to their fanouts. The resultant circuit is 100% testable, and its parameters are reported in column WONDER+ of Table 7 .
Conclusions
We successfully addressed the problem of transforming a given RTL speci cation into a functionally equivalent, minimized and 100% testable design. We maintained the RTLevel design hierarchy while performing RT-Level transformations of initially speci ed data-path, followed by resynthesis of control logic using data path don't cares. The transformations use information about the RT-Level structure, making several computations like observability don't cares very simple. Notably, upon resynthesis of the control, we implicitly removed all redundancies in the control as well as all false paths in the data path itself. Experiments with several RTL designs demonstrate that the transformations signi cantly reduce layout area and delay, while consistently achieving 100% fault coverage at the gate level. The experimental results also demonstrate that optimizing designs at the RT-level is more e ective than the traditional optimization techniques performed at the logic level.
The transformations and resynthesis process described in this paper were aimed towards deriving a fully testable design. Some of the transformations, like propagating functional units and balancing multiplexor chains, may adversely a ect the area and delay of the nal design. We are currently investigating various RT-Level transformations for optimizing area and performance while preserving testability. Proof If there is any false path through the multiplexor chain from the data input d i ; i 6 = p, we show that c i is stuck-at-0 and hence is removed by step 5 of Algorithm 4. There can be two possibilities. It is possible that c i is stuck-at-0 even before adding don't cares in which case it would be removed in step 5 of Algorithm 4. Otherwise it would mean that whenever f i was true, some f j , j < i was true. However, f i = O k di and f j = O k dj . But from Property 1.2, we see if i 6 = j, f i and f j cannot be \1" at the same time. Hence we have a contradiction. If the false path was from data input d p , it can be shown that either c p?1 would be stuck-at-1, or some c j , j < (p?1), would be stuck-at-1. In both cases, d p and the corresponding mux would be removed in step 5 of Algorithm 4. We discuss the proof when none of the c j ; j < (p?1) reduces to \1" after step 4 of Algorithm 4. In this case, whenever f p?1 is \0", some f j ; j < (p ? 1) is \1" because the path from d p to the output of the mux chain is false. Hence, f p?1 is a don't care for f p?1 . Thus after don't care minimization in step 4, f p?1 reduces to \1" i.e. c p?1 is stuck-at-1. 2 Lemma 4 After applying Algorithm 4, there are no redundancies in the combined decode logic (DL) and multiplexor chain (MC) if they are isolated from the other modules. Moreover, if the inputs to the FUs are multi-fault testable and independently controllable and the FUs are irredundant, then the cascaded blocks of DL, MC and FU are irredundant.
Proof (1) We rst show that the combined DL and MC block is irredundant. In part (a), we show that the DL block has no redundant fault. We next show in part (b) that the MC block has no redundant fault. We show that if there is a redundant fault in DL when cascaded with the mux chain (MC), then DL had not been made prime and irredundant with respect to the don't cares as claimed in step 4 of Algorithm 4.
(a) Suppose there is some redundant fault inside DL after connecting it back to the MC. Let the fault be at node X. We discuss the case X stuck-at-1 rst. The argument for X stuck-at-0 follows similarly. Let c i be the control signal with the smallest index in the transitive fanout of the node X. We assume for the moment that neither f 0 or f 0 is in the transitive fanout of X, where f 0 is the boolean function corresponding to c 0 in Figure 10 . The fault D at X can propagate to c i because it has been claimed in step 4 of Algorithm 4 that DL is irredundant. However, D cannot propagate to the output F k of the multiplexor chain. This implies that each minterm that sets X to \0" and propagates D to c i , also sets some f j , j < i to \1". Hence these minterms are in the onset of some control signal c j , j < i. Consequently from section 5.1 it can be seen that these minterms are don't cares for c i and hence for the corresponding boolean function f i . Let f X be the function corresponding to node X. The function f i can be written as f i = f X :f ijf X =1 + f X :f ijf X =0 and the corresponding observability function for X at c i as (f ijf X =1 :f ijf X =0 + f ijf X =1 :f ijf X =0 ) The minterms that set X to \0" and also propagate it to c i are given by (f ijf X =1 :f ijf X =0 + f ijf X =1 :f ijf X =0 ):f X We have found that these minterms are don't cares for f i .
We draw the Karnaugh-map of f i along with its don't care set in Figure 12 .
From the Karnaugh map, we see that f ijf X =1 is a minimal cover for f i with the don't cares added. If X were to be present in DL after step 4 of Algorithm 4, then the cover derived for c i and hence for DL was not prime and irredundant. This is a contradiction. Hence, the assumption that there was a fault X which was redundant is false. Carrying the proof for faults in the cone of f 0 and f 0 is similar and is hence not discussed.
(b) To prove there are no redundant faults inside the multiplexor chain (MC), we refer to Figure 13 which shows a typical multiplexor in the multiplexor chain Figure 10 and produced by Algorithm 4. The nodes I, c j and d j are independent and controllable where I is the output m j+1 of multiplexor M j+1 or is a data input. It is not possible that f j is a constant, because step 5 of Algorithm 4 removes constant outputs of the DL block. If node 1 is redundant, it can be shown that f j will be reduced to a constant after step 4 of Algorithm 4. Let us assume that 1 is stuck-at-1.
Hence all the minterms that propagate a \0" to c j make some f i ; i < j \1". Hence f j is a don't care for f j i.e. f j can be reduced to \1". The corresponding c j must have been removed in step 5 of Algorithm 4. Hence redundant nodes like 1 could not exist after application of Algorithm 4. Similarly it can be shown that the faults 2 ; 3 ; 4 and 5 are not redundant. 
