Abstract-Test data propagation through modules and test vector translation are two major challenges encountered in hierarchical testing. We propose a new synthesis-for-test approach in which multiplexers are embedded in the behavioral models of the various modules constituting a hierarchical system. This approach can also be applied to system-on-a-chip designs in which synthesizable models are available for the embedded cores. The embedded multiplexers provide complete single-cycle transparency, thereby offering a straightforward yet effective solution to the problems of test data propagation and test vector translation. In order to determine module I/O bitwidths for single-cycle transparency, a global analysis is carried out using a graph-theoretic framework and an optimization method based on integer linear programming. Case studies using high-level synthesis benchmarks and an industrial-strength benchmark show that synthesis for transparency introduces very little area and performance overhead.
I. INTRODUCTION

H
IERARCHICAL test methodologies handle large systems in a divide-and-conquer fashion by relying on precomputed test patterns for each module, which are subsequently translated and applied from the system's primary inputs [16] - [19] , [23] . Hierarchical testing techniques offer lower test generation costs and increased test reuse. Tests developed for the individual modules can be reused and there is no need for gate-level test generation for the entire system. However, these techniques must provide mechanisms for justifying the precomputed test vectors for a module from the system inputs to the module inputs. Similarly, the test responses must also be propagated to observable system outputs. This problem is becoming increasingly important with the emergence of core-based system-on-a-chip (SOC) designs [14] , [29] . Since the embedded cores in an SOC are not directly accessible from the chip I/Os, the system must devise methods for providing test access.
A number of hierarchical testing techniques have been presented in the literature for justifying test patterns and propagating test responses. Early work in this area was based on the Manuscript received July 11, 2001 ; revised February 11, 2002 . This work was supported in part by the National Science Foundation under Grant CCR-9875324. A preliminary and abridged version of this paper appeared in Proc. Int use of F-paths [12] and I-paths [1] , which utilized existing propagation paths between module I/Os and chip I/Os. In [25] , a technique based on Boolean difference was presented for test generation and test translation in modular combinational circuits. More recent work has focused on design for testability (DFT) methods. For example, the FScan-BScan method utilizes a combination of full scan and boundary scan [9] . In this method, every core (or module) is made testable by full scan and system-level testability is achieved by isolating all modules using boundary scan. This provides full controllability and observability for every module, and eliminates the need for test translation. However, FScan-BScan introduces high area and delay overheads, and requires enormous test application time due to serial test access. In the related Fscan-Tbus method, full scan is combined with a system-level test bus [9] . However, this method also suffers from area and delay overheads; moreover, the interconnect testing problem cannot be directly addressed using this technique. DFT is also used in [4] to test macro blocks within a circuit using a variety of test techniques. It relies on full, partial, and boundary scan, and uses ad hoc module transparency to reduce test overhead. To reduce test generation time, a hierarchical test generation strategy was presented in [19] . This approach is based on the notion of test plans, which provide control sequences to propagate test data through modules. Recent work on hierarchical testability analysis (HTA) has been directed at the efficient generation of such test plans [16] - [19] , [26] . The key idea in these methods is to identify control sequences that allow test data to be transparently propagated through modules. DFT enhancements are usually necessary to augment the amount of transparency that can be achieved using HTA. In [14] , a DFT approach was presented to make cores in an SOC transparent by extracting their test control/data flow information. While such an approach provides parallel test access and allows any test sequence at the inputs of a module to be propagated to the outputs, it suffers from two main drawbacks 1) high test application time due to the transparency latency (multicycle transparency) required to propagate test data through a module and 2) the test control/data flow must be extracted in order to provide transparency.
Testability-driven behavioral synthesis can produce area-efficient designs with low test-related overhead [23] . However, synthesis for testability approaches have rarely addressed hierarchical or SOC test [5] . These methods have focused primarily on selection of registers for BIST, selection of scan flip-flops, reduction in the number of loops in the datapath, and hierarchical testability enhancement of RTL circuits obtained through behavioral synthesis [10] , [13] . In order to provide hierarchical test capabilities, they typically require behavioral descriptions in the form of control-data flow graphs [5] .
Instead of providing complete controllability and observability to core I/Os, more recent work has focused on providing controllability and observability to SOC cores on an "as needed basis" [24] . This is achieved through transparency analysis using nondeterministic finite-state automata, and transparency enhancement through symbolic justification and propagation analysis based on a regular expression framework. However, this approach requires test sequence composition from symbolic tests.
Transparency of modules can be trivially achieved by providing a direct path from each module input to output with the help of multiplexers. However, the area/delay overhead of such a solution is usually prohibitive. In particular, the interconnect required for routing the additional bypass lines can be excessive. In this paper, we present a synthesis-for-transparency approach that alleviates the above problems. The synthesis method is based on the insertion of multiplexers in the behavioral models of the modules making up the system. These models can describe both combinational and sequential circuits. A behavioral synthesis tool can then be used to derive transparent modules with embedded multiplexers that ensure complete reuse of module-level tests and easy system-level test application. The proposed synthesis for testability methodology is also applicable to SOC designs composed of soft cores for which behavioral, synthesizable models are readily available.
A related approach for SOC testing is based on adding a bypass mode using multiplexers and registers [20] . However, this requires packetization of test data, and serial-to-parallel and parallel-to-serial bit-matching circuits-hence, it does not provide single-cycle transparency. In the proposed method for singlecycle transparency, every module in the system is synthesized to operate in two modes-a normal functional mode, and a transparent mode in which all inputs are passed unchanged to the outputs. An external control input is used to select the appropriate mode of operation for a module. In order to test module , it is set to the functional mode while all other modules are set to the transparent mode. To allow the complete flow of test data through in its transparent mode, the bitwidths of the I/O ports of may need to be increased. While full transparency can be ensured by making the input and output bitwidths of equal, this is often not necessary. The test-related area overhead can be reduced if the bitwidth of is determined by analyzing the test propagation requirements of the other modules in the system. We formulate this problem using the notion of test graphs and determine the bitwidths by solving an integer linear programming model. While the insertion of embedded multiplexers during synthesis is trivial, bitwidth sizing for the various modules in the system is challenging and nontrivial.
The proposed approach offers a number of unique advantages. It provides single-cycle transparency without the addition of registers for bypass or the extraction of control/data flow information. Its conceptual simplicity makes it easy to implement and integrate in the design flow. It does not require any test composition or sophisticated test scheduling algorithms. Precomputed test sets for every module can be readily applied without requiring any transparency latency or test vector translation. These test sets may contain functional vectors, scan vectors, or ordered test sequences for nonscan sequential circuits. The test methodology provides parallel test access to the embedded modules, thereby facilitating at-speed test and increasing the coverage of nonmodeled and performance-related defects. The synthesis of explicit transparent paths obviates the need for complex data decoding in test generation for microprocessor circuits. Finally, interconnect testing can be carried out by simply setting all the modules to the transparent mode.
We consider two representative SOC designs as case studies. First, we present experimental results on applying the synthesis-for-test methodology to a nontrivial example system constructed by stitching together several high-level synthesis benchmarks [6], [22] . We show that embedded multiplexers offer a more efficient solution to hierarchical test than the explicit insertion of multiplexers at the gate level. First, we use Synopsys Design Compiler to synthesize each of the benchmarks to be transparent and determine the impact of transparency on area and delay. We then make the overall system transparent by reformulating the transparency requirements for the individual modules. Once again, we determine the effect of the synthesis procedure on the system area and delay. Our results demonstrate that transparency can be achieved with negligible impact on system area and performance. For the overall system, the area increase was 3.2% and the increase in delay (measured by the clock period) was only 1.6%.
We also apply the proposed synthesis approach to two VHDL modules in the LEON core [30] . The LEON core is a hierarchical SPARC-compatible processor developed by the European Space Agency for future space missions. It is available as a highly configurable, synthesizable VHDL model. We show that the two synthesizable modules can be made transparent with no delay overhead and area overhead of no more than 5.2%.
The paper is organized as follows. In Section II, we present the basic concept of transparency for individual modules using embedded multiplexers, and we explain how transparency can be achieved using embedded multiplexers for combinational circuits and for finite-state machines. In Section III, we present our example system composed of several high-level modules and describe the proposed hierarchical test method for this system. We also investigate the problem of determining the bitwidths required for the modules in order to achieve single-cycle transparency. We present experimental results in Section IV.
II. TRANSPARENCY USING EMBEDDED MULTIPLEXERS
Transparency can be achieved by embedding multiplexers in the behavioral models described using a hardware description language such as Verilog or VHDL. For example, consider the Verilog model of a combinational module shown in Fig. 1(a) .
has one 4-bit input port and one 2-bit output port . A transparent behavioral model for is shown in Fig. 1(b) . An additional control input is used to determine the mode of operation. The control input is used to switch from the normal mode ( ) to a pass-through, transparent mode ( ). An additional 2-bit output port is added to ensure complete transparency of . Alternatively, the bitwidth of a port can also be increased to provide single-cycle transparency. In general, the bus widths of input ports may also have to be expanded to provide transparent access to other modules in the system. In Section IV, we describe how the overhead of additional ports and associated wiring/interconnect area can be minimized by analyzing the system-level hierarchical testing needs.
Multiplexers can similarly be embedded in the behavioral models of sequential circuits. Fig. 2(a) shows the VHDL model for a finite-state machine (FSM). In order to make it transparent, a multiplexer is embedded in the behavioral model of Fig. 2(b) . The FSM operates in the normal functional mode when . However, when , it works in a transparent mode and "out" is connected directly to "in." Therefore, it now operates as a pass-through combinational circuit. The state transitions of the FSM are not specified for the transparent mode in Fig. 2(b) . They can either be interpreted as don't-cares or the clock to the FSM can be disabled by gating it with to save power during testing. Since the next-state functions are not affected by the embedded multiplexer, synthesis tools can be expected to provide efficient implementations of transparent FSMs. This approach is similar to the design of testable FSMs based on multiplexer embedding in flip-flops [21] .
In high-performance designs, multiplexer embedding should not lead to any performance degradation. For such designs, we first synthesize the high-level modules without embedded any multiplexers. This allows us to determine the critical paths in the system, and the (critical) primary outputs on which these critical paths terminate. Once this is done, we partition the set of primary outputs of each module into the set of critical outputs and the set of noncritical outputs . Multiplexers are then embedded at the outputs in corresponding to the high-level behavioral descriptions of these modules. Unless there is significant retiming during synthesis, multiplexer embedding introduces very little change in the internal latch-to-latch delays within the modules. While this approach does not guarantee that the delay of the synthesized transparent circuit will not exceed that of the original circuit, it increases the likelihood that the performance will not be significantly affected, especially if the synthesis script is run with higher weight assigned to timing optimization. (Most synthesis tools allow the user to assign relative weights to timing and area goals.) A consequence of selective multiplexer embedding is that some amount of test translation is necessary at the outputs of a module that are not fed by an implicit bypass path from the module inputs. The outputs in cannot be used for transparent test data propagation.
III. SYSTEM-LEVEL TEST STRATEGY
In this section, we describe the hierarchical test methodology using a nontrivial example of a system with two 32-bit input ports and two 32-bit output ports and composed hierarchically of several synthesizable modules (Fig. 3) . The example was constructed using four high-level benchmark circuits (GCD, Barcode, Kalman, and am2910) [22] and a 32-bit combinational multiplier. Each benchmark is a sequential circuit with clock and reset inputs, which are not shown explicitly in Fig. 3 . In order to make the example nontrivial and realistic, we introduced a feedback loop and reconvergent fanout, and used bus lines of unequal widths. We also introduced bus truncation and fanout at several places in .
In order to apply the proposed test methodology to a hierarchical system, we first construct a weighted directed system graph whose vertices are the synthesizable modules in the system and whose edges represent functional interconnections between the modules. The weight of an edge denotes the total width of the buses (including all ports) connecting to . The system graph for is shown in Fig. 4(a) . The vertices , , and denote fanout branches in the system. The source and sink vertices in the graph represent system-level primary inputs and primary outputs, respectively. Note that if selective multiplexer embedding is employed, the module outputs that are marked as critical (as described in Section II) do not contribute to the edge weights in the system graph.
Next, we break all cycles in the system graph. This problem is related to the minimum feedback vertex set problem, which despite being NP-hard, can be solved efficiently for large problem instances using heuristic methods [2] , [8] , [11] . In order to reduce the overhead due to system-level I/O pins and interconnect, the feedback loops should be broken in such a way that buses with the least bitwidth are multiplexed to primary I/Os. For many hierarchical systems, this problem is tractable enough to be solved by inspection due to the small problem size. For example, we can see that for , the cycle can be broken by removing the edge -this implies that the 5-bit input to must be multiplexed to a primary input. The acyclic system graph corresponding to Fig. 5 is shown in Fig. 4(b) . For notational convenience, the edge weights in are denoted by . Ad hoc sharing methods can be used to reduce system-level I/O overhead. For example, in Fig. 5 can be multiplexed with five lines from the bus.
The proposed test methodology applies precomputed test sets to the modules in in multiple test sessions. Exactly one module is tested in one session-the module under test is set to the functional mode while all other modules are set to the transparent mode. This is achieved using a separate control input for all the modules. This suggests that control inputs are necessary for modules . However, the number of control inputs can be reduced to using a decoder since at most one module is tested in any session, which implies that at most one of the 's, , is 0 in any test session. An additional test session is necessary for interconnect testing-all modules in the system are set to transparent mode for this session.
As discussed in Section II and illustrated in Fig. 1 , if a module s output bitwidth is less than its input, then additional outputs must be added to to make it fully transparent. However, this is not always necessary when is embedded in a larger system. In order to reduce overhead, the increase in the I/O bitwidths of in a hierarchical system must be carefully minimized by analyzing the propagation requirements of the other modules. The global analysis that we present next leads to a system graph with modified edge weights. In Fig. 4(b) , we used to denote the edge weights in . We now use for the corresponding edge weights in . For each test graph , we associate a set of constraints on edge weights for the edges in . These constraints, which provide sufficient conditions for transparency, are of two types: i) justification constraints, which ensure that test data from can be transparently propagated from the source vertex to through other modules, and ii) propagation constraints, which ensure that test responses for can be transparently propagated through other modules to the sink vertex. The constraints are obtained as follows.
1) Justification constraints:
If lies on a path from the source vertex to in , then the sum of the weights of the edges directed away from in must not exceed the sum of the weights of the edges incident on . This ensures that the bitwidth at the inputs of is adequate for justifying tests for
. If corresponds to a reconvergence point, then it is quite likely that test data for , then the sum of the weights of the edges incident on must not exceed the sum of the weights of the edges directed away from . This ensures that if a -bit test response for is incident on one or more inputs of , then it can be propagated through one or more outputs ports of . Note that the above constraints only reflect sufficient conditions. They can handle any combination of bus truncations and reconvergent fanouts. These conditions are not necessary though since transparency for any module modules can be ensured by providing exactly one justification path from system inputs and exactly one propagation path to system outputs. We only consider sufficient conditions here since we are attempting to minimize the amount of computation necessary to ensure module transparency. Even though some amount of global information is encapsulated in the test graphs, the constraints are local in the sense that we examine only one module at a time in every test graph.
If additional computation for global analysis is permitted, we can determine a unique path to system I/Os from the module under test in each test graph . We can then attempt to determine bitwidth constraints for each edge on these paths. However, path selection is complicated by the presence of fanout vertices, and an appropriate cost function must be used for path selection. For example, the propagation path for in Fig. 6 (c) can either be through , , or a combination of and . We do not consider an explicit path selection procedure in this paper.
The various constraints on the edge weights are shown beside the test graphs in Figs. 6 and 7(a) . The constraints derived from the various test graphs typically overlap, thus these constraints must be combined to obtain the set of global constraints. The total increase in the system-level interconnect for is given by , where the s are variables whose values are to be determined and the s are known constants. Our objective is to minimize subject to the constraints on the edge weights. This can be expressed as the following integer linear programming (ILP) model:
Objective Minimize subject to the following.
.
The above ILP model can be easily solved using a standard public-domain solver (we used lpsolve [3] ) to obtain as shown in Fig. 7(b) . The solver run time was only a few seconds for this example. The edges whose weights have been updated are shown in bold; for the sake of comparison, their original values are also shown. In order to ensure transparency, the high-level modules must be synthesized with the number of I/Os corresponding to the s. The resulting system incorporating these transparent modules can be tested in hierarchical fashion by making complete reuse of the precomputed tests for the individual modules. Note however that these precomputed tests must be generated for the transparent modules since we expect the tests to change (relative to the original circuit) after synthesis with embedded multiplexers. The impact of the resynthesis process on module test sets remains an interesting open problem. An alternative approach is to introduce test points for greater test access and perform test generation for the flattened SOC. However, we expect the number of test points and the test generation time to be high for SOC designs.
IV. EXPERIMENTAL RESULTS
In this section, we present experimental results on the proposed synthesis-for-test method for high-level synthesis benchmarks [6], [22] . In addition to the four benchmarks comprising , we also use the benchmarks lru, diffeq, and dhrc. All these benchmarks are available as behavioral VHDL models. In order to illustrate the insertion of the embedded multiplexer, we present the GCD example in the Appendix. We carried out three sets of experiments using the Synopsys Design Compiler (lsi10k library) running on a Sun Ultra 10 workstation with a 333 MHz processor and 256 MB of DRAM. The synthesis time varied from a few seconds for the individual modules to less than 10 minutes for the complete system. The area figures were obtained by initially setting the Design Compiler's wire load model parameter to 0.5 and then changing it to 0.1. This parameter is used by the Synopsys synthesis tool to estimate interconnect area relative to gate area.
First, we synthesized nontransparent and transparent versions of each of the high-level benchmarks to evaluate the impact of embedding multiplexers on their area and performance. We then carried out a case study by synthesizing the example formed from four benchmark circuits and a 32-bit combinational multiplier. We then synthesized an easily-testable version of by making each module in it transparent. The optimization model of Section III was used to derive a transparent behavioral model that minimized additional interconnect area. Next, we examined all the benchmarks with explicit multiplexers (inserted after synthesis) to evaluate their impact on area and performance. This allowed us to compare embedded multiplexers with the trivial approach of adding explicit multiplexers. Finally, we applied the synthesis-for-transparency approach to the LEON processor core. Table I presents experimental results on synthesis using embedded multiplexers for seven high-level benchmarks. Some of these benchmarks are included in System . The goal of this experiment was to evaluate the effect of synthesis for transparency on the area and delay of individual modules. The average increase in area due to multiplexer embedding is only 3.96%. Interestingly, in many cases, the delay of the circuit decreased (due to efficient resynthesis) despite the multiplexer added to its behavioral description. This is in sharp contrast to external multiplexers inserted at the gate level, which usually increase the delay.
In Table II , we present experimental results on the synthesis of the hierarchical system using transparent modules with embedded multiplexers, and using the graph model and optimization framework described in Section III. The results show that complete hierarchical testability of can be achieved with an area overhead of 3.2% and performance loss of only 1.6%. As discussed in Section III, the additional system I/O pins for test data can be multiplexed using ad hoc methods based on the functional interconnections between the modules. A systematic strategy for minimizing the number of I/Os needs further investigation.
Finally, in Table III , we present the impact on the area and performance of adding explicit multiplexers. As expected, the trivial method of adding multiplexers at the gate level leads to much higher overhead. The area overhead (9% on average) and the penalty on system performance (also 9% on average) are both high. Finally, we present the results of applying the synthesis approach to the proc and peri modules in the LEON core. Unfortunately, we were unable to make the synthesis script handle other LEON modules. A block diagram of the LEON core is shown in Fig. 8 , and block diagrams of the proc and peri modules are shown in Fig. 9 . The proc module contains the integer unit, clock/reset generation and an optional floating-point unit (not considered in our experiments). The peri module is a smaller controller unit used for instantiating all peripherals. Table IV shows the impact of transparency synthesis on these two LEON modules. The CPU time for synthesis was less than seven minutes in each case. We also conducted an experiment in which we treated peri as a hierarchical system composed of synthesizable blocks. The area overhead in this case was only 0.3%.
V. CONCLUSION
We have presented a new synthesis-for-test approach in which multiplexers are embedded in the behavioral models of the various modules constituting a hierarchical system. This approach can also be used to design a hierarchically-testable system-on-a-chip using synthesizable embedded cores. The embedded multiplexers provide single-cycle transparency, thereby offering a straightforward yet effective solution to the problems of test data propagation and test vector translation. In order to reduce area/performance overhead and maximize test reuse, we have presented a graph-theoretic framework and an optimization method based on integer linear programming. We have presented case studies using high-level synthesis benchmarks to show that synthesis for transparency introduces very little area and performance overhead.
The results presented in this paper raise a number of interesting open questions. While we can obtain parallel access to module I/Os for at-speed testing, the clock frequency at which patterns are applied is affected by the longest combinational path that spans several modules in transparent mode. This limits the speed at which the test patterns can be applied even though parallel access is available. We are investigating the insertion of observation points at internal nodes in the system to overcome this problem. We are also developing an algorithm that will facilitate the sharing of system-level I/Os since the entire I/O bitwidth is not necessary for any single test session. In addition, we are studying how BIST can be incorporated in the synthesis framework. SOC pin count limitations can be overcome using a combination of width compression [7] , [15] and output space compaction [26] . Even though these approaches will restrict the synthesis approaches to specific test sets, they will provide single-cycle transparency with a small number of SOC pins.
Despite the numerous benefits offered by the proposed synthesis methodology, there are a few limitations that limit its applicability to specific types of hierarchical systems. The embedded multiplexer approach depends on the availability of synthesizable models. However, these models are usually not available for intellectual property (IP) cores. For systems that use a combination of synthesizable and IP modules, we can combine synthesis for transparency with other known DFT methods. For example, IP modules and hard cores, which are usually wrapped for ease of testing, can be treated as pseudo-system I/Os for determining transparency paths. The transparency paths can begin and end at the scan cells within the wrappers.
APPENDIX
We present an example of the GCD benchmark to illustrate how transparency is inserted at the behavioral level. Fig. 10 shows the original benchmark's synthesizable behavioral VHDL description and Fig. 11 shows the behavioral model with the embedded multiplexer.
