Multiple levels of design hierarchy are common in currentgeneration system-on-chip (SOC) 
Introduction
The integration of a complete electronic system on a single chip is now commonplace. A system-on-chip (SOC) typically integrates a heterogeneous mix of digital logic, embedded memories, and analog blocks. To reduce test cost, SOCs are being increasingly tested in a modular fashion, i.e., their various design modules are tested separately [1, 2] . Such an approach is mandatory for embedded non-logic components such as memories and analog modules, as well as for black-boxed third-party cores. However, also for other parts of the SOC, modular testing brings advantages, including reduced ATPG complexity and greater test reuse. Modular testing uses an on-chip test access infrastructure, which consists of test wrappers and test access mechanisms (TAMs) [1] . Test wrappers isolate the various modules from their surrounding circuitry during test. TAMs transport test stimuli between SOC pins and module terminals, and test responses vice versa.
Multiple levels of design and test hierarchy in currentgeneration SOCs are quite common. A 'parent' core may contain several 'child' cores, which in turn may contain their own embedded ('grandchild') cores. For example, [3, 4] describe SOCs for digital video, where the SOC design is partitioned into chiplets, which in turn consist of embedded cores. Despite this, most prior research on SOC testing (unrealistically) assumed that there is no design and test hierarchy, i.e., that all cores are at the same hierarchy level [6, 7, 2] . Test wrappers, TAMs, and test schedules designed for non-hierarchical SOCs are typically not valid for SOCs with hierarchical cores. The hierarchy imposes a number of constraints on the manner in which tests must be applied to parent cores and their embedded child cores [8] ; hierarchy-oblivious methods make no attempt to satisfy these constraints.
In this paper, we describe a test infrastructure design approach for hierarchical SOCs; our approach is based on hierarchy-aware wrapper design for parent cores, TAM optimization techniques for the SOC and the parent cores, and chip-level test scheduling. We consider two different test infrastructure design scenarios. In Scenario 1, we assume that the wrappers and TAM architectures for the child cores are given and fixed (hard), while the wrappers and TAM architectures for the parent cores are to be determined by our approach (soft). In Scenario 2, the wrapper and TAM for both parent and child cores are assumed to be soft.
The sequel of this paper is organized as follows. In Section 2, we review the limitations of prior work. In Section 3, we discuss various DfT techniques that can be used to reduce test length. Section 4 describes our approach for Scenario 1, while Section 5 addresses Scenario 2. We derive lower bounds on the test time in Section 6. In Section 7, we present test application times and lower bounds for the ITC'02 SOC test benchmarks [5] . Finally, Section 8 concludes the paper.
Limitations of Prior Work
Most prior work on wrapper/TAM optimization for SOCs has assumed a non-hierarchical test infrastructure [2, 6, 7, 10, 11] . In comparison, only a limited amount of work has been done on wrapper design and TAM optimization for hierarchical SOCs. Recently in [8] , the issue of wrapper design for hierarchical cores has been addressed, and in [9, 12, 13] , wrapper design and TAM optimization techniques for hierarchical cores have been explored.
Wrapper Design Issues
To understand the differences between the testing of nonhierarchical and hierarchical SOCs, it is important to understand the functionality of some elements of the IEEE P1500 Wrapper architecture [11] . The main difference between testing nonhierarchical and hierarchical SOCs arises due to the functionality of the wrapper cells that buffer the functional inputs and outputs of the core. The input wrapper cell shifts and applies test stimuli in the INTEST mode, and captures and shifts test responses in the EXTEST mode. The operation of the output wrapper cells is the reverse of the input wrapper cells in the two modes. Figure 1(a) shows a basic "default" IEEE P1500 wrapper cell; CFI and CFO are the functional input and output of the wrapper cell respectively, and CTI and CTO are the test data inputs and outputs respectively. During the INTEST mode, the path from CTI to CFO through the flip-flop and the two multiplexers is used, and during EXTEST mode, the path from CFI to the flip-flop is used to capture the test response from the chip interconnect. Consequently, the path from CFI to CFO through the upper selection of the Mux when m1=1remains untested. This is a drawback of this wrapper cell, which is highly undesirable since the untestable path is used in functional mode. Hence an alternative P1500-compliant wrapper cell configuration, as shown in Figure 1(b) , is more commonly used in practice, e.g., at Philips. The paths used during INTEST-apply and EXTESTcapture cycles are highlighted in Figure 1 (b) and (c) respectively. Both paths of the second multiplexer are exercised in the test mode, thereby allowing better testability of the cell. Also, during INTEST-apply cycle, the feedback path causes the test data bit of the previous cycle to be fed back into the flip-flop. This is an added advantage of this cell, since it can prevent data corruption due to clock skew in back to back wrapper cells. However, in this wrapper cell configuration, the feedback path also implies that the apply cycle of the INTEST mode cannot coincide with the capture cycle of the EXTEST mode. The multiplexer control signal values for the two modes are conflicting, as shown in Figure 1(d) .
In this paper, and in [8] , the use of the wrapper cell configuration of Figure 1(b) is assumed. For a hierarchical core, suppose that the parent core and the child cores are tested on different TAM partitions. When the parent core is in the INTEST mode, the embedded child cores have to be in EXTEST mode to ensure complete internal testing of the parent core [8, 9] . Thus, the parent and the child INTEST testing cannot be done concurrently, since the child cores cannot be configured simultaneously in INTEST and EXTEST mode. Due to this added constraint, the wrapper/TAM optimization techniques that are used for nonhierarchical SOCs cannot be directly used for hierarchical SOCs that rely on the wrapper cell configuration shown in Figure 1(b) . Nevertheless, the design in Figure 1 (b) is still the preferred wrapper cell due to its better testability.
Wrapper/TAM Optimization
In [12] , a wrapper/TAM optimization approach is presented, which uses an existing hierarchy-oblivious TAM optimization approach iteratively to solve the problem of TAM optimization for hierarchical SOCs. However, in this approach, the constraint of child and parent INTEST testing being time multiplexed is ignored. In [13] , a TAM design technique for hierarchical SOCs is presented in which the hierarchical cores are assumed to be hard wrapped cores. This approach requires large area overhead due to the added registers for bandwidth matching, and it also requires synchronization of the clock signals.
In [9] , modified input and output wrapper cells are used to allow concurrent INTEST testing of the parent and child cores. An extra flip-flop and a multiplexer are used in this configuration. Therefore, the wrapper cells used in this approach have larger area compared to the conventional IEEE P1500 wrapper cells; thus they may not always be feasible.
In [8] , the problem of designing an IEEE P1500-compliant wrapper for hierarchical cores has been addressed. The IEEE P1500-compliant wrapper for hierarchical cores has a new set of terminals, called the CTAM terminals, which are used to access the child cores embedded in a parent core. The wrapper for the parent core operates in two complementary test modes. In the INTEST P mode, internal testing of the parent core is carried out. Test data is scanned through the parent core scan chains, the parent core wrapper cells, and the child core wrapper cells. The child cores have to be in the EXTEST mode to ensure complete core-internal testing of the parent core. Hence, in the INTEST P mode, the available TAM wires are distributed between the parent core scan chains and wrapper cells, as well as the child core TAM architecture. This was ignored in the prior work on TAM optimization for hierarchical cores in [12] . In the INTEST C mode, child core internal testing is carried out; all child cores are in INTEST mode. In this mode, the available TAM width at the wrapper interface can be utilized for child core testing alone. Since the two modes operate independently, they are timemultiplexed using multiplexers.
The wrapper design approach presented in [8] is suitable only for hierarchical cores with a hard TAM architecture. Moreover, the wrapper design approach of [8] has not been used for TAM optimization. In this paper, we solve the wrapper and TAM optimization problem in conjunction for two different design scenarios and evaluate the approach using the metrics of area impact and test application time.
In this section, we discuss the impact of using (a) scan chain bypass, (b) core bypass, and (c) scan chain reordering, on the test time for a hierarchical core. We also discuss the impact of these DfT techniques on chip area. Scan chain bypass and core bypass have been studied in prior work [14] , but they have not been discussed in the context of hierarchical cores.
Scan chain bypass : The test time T i (w) for a core C i is defined as:T i (w)=( 1+m a x( si i ,so i )) × p i +min(si i ,so i ), where si i and so i are the maximum scan-in and scan-out length of core C i for a TAM width of w, and p i is the number of test patterns. Typically, it is assumed that for any given core, the test data volume in the INTEST mode is much greater than the test data volume in the EXTEST mode. Thus, the child core architecture, which consists of the child cores' wrapper and the embedded TAM, is optimized for the INTEST mode only. During INTEST, the test stimuli are scanned into the input wrapper cells and scan chains, and the test responses are scanned out of the scan chains and the output wrapper cells. Hence, the scan chains are sandwiched between the input and output wrapper cells. This ensures smaller scan-in and scan-out lengths during INTEST, compared to the scan chains being ordered before or after the wrapper cells. However, in the EXTEST mode, the scan chains of the child cores add to the scan-in and scan-out length even though they do not participate in the test. This additional contributer to the scan length can be eliminated by using a scan chain bypass. A flip-flop or register is not required in the bypass path if the scan-chains are local to the same module. Typically, a scan chain bypass has an area overhead of one multiplexer. Thus, this trade-off between scan length reduction and increased DfT overhead has to be carefully evaluated. In the case of a hierarchical core, the scan length reduction in child core EXTEST mode pays off only if the scan chain for the child core architecture is long enough to be the maximum scan-in and scan-out length of the parent core.
Core bypass: The core bypass feature allows the bypass of an entire core in one clock cycle by means of a bypass register or flip-flop across the core.. It has been shown that this feature can reduce the test time significantly in a TestRail architecture [2] . In [8] , it is shown that core bypasses can reduce the test time of hierarchical cores as well. If the TAM width at the parent core's interface is less than the number of CTAMs, test length is likely to be reduced if one or more CTAM chains of the child core architecture are daisy-chained. This results in an increase in the scan-in and scan-out lengths of all the cores. A core bypass can reduce the impact of daisy-chaining on the scan lengths of the cores. Scan chain bypasses can also reduce the scan-in and scanout lengths in this example; however, core bypasses are more effective in the INTEST C mode since they eliminate the wrapper cells in the scan path as well. Scan chain bypasses are more useful for the INTEST P mode, because the scan chains are excluded from the scan path without excluding wrapper cells, which are essential for the INTEST P mode.
Scan chain reordering:
This is another technique that can be explored to reduce the CTAM scan length of the child cores during the INTEST P mode. Compared to the INTEST mode, the roles of the input and output wrapper cells in the EXTEST mode are reversed. Ideally, in the EXTEST mode, the output wrapper cells should precede the input wrapper cells to minimize the scan-in and scan-out lengths. By using additional hardware, it is possible to make the wrapper reconfigurable in the E XTEST mode, such that the order of the wrapper cells is reversed. This technique requires three two-input multiplexers for each scan path of the core. The benefits of using this type of design optimization, as in the case of scan chain and core bypasses, depends on the child core TAM architecture available. The improvements in the CTAM lengths depends on the difference in the number of input and output wrapper cells of the cores that interface directly with the CTAM terminals.
Design Scenario I
In this scenario, the parent core provider implements the wrapper and TAM architecture for the child cores before core delivery. The system integrator designs the overall wrapper/TAM architecture for the hierarchical SOC. The information provided for the hierarchical core by the core provider includes information about the test infrastructure for the child cores. This information is used for wrapper design of the hierarchical core in the INTEST P and INTEST C modes.
The problem of wrapper/TAM optimization for this scenario is stated as follows. Problem P I : Given (i) a hierarchical SOC with N top-level cores, where one or more top-level core may be a parent core with embedded child cores, (ii) the wrapper and TAM architecture for each embedded child core, (iii) the test set parameters for each top-level core C i , 1 ≤ i ≤ N , and (iv) the SOC-level TAM width W , determine: 1. The number of TAM partitions and the partition widths; 2. Wrapper design for each top-level core C i , 1 ≤ i ≤ N ; 3. Assignment of top-level cores to TAM partitions, such that the testing time for the SOC is minimized.
2 The test set parameters for each core C i , 1 ≤ i ≤ N , includes the number of primary inputs fi i , primary outputs fo i , bidirectional I/Os b i , test patterns p i , scan chains k i , and the different scan chain lengths. The cores are assumed to have hard scan chains, i.e, the number and length of scan chains are fixed. IfC i is a hierarchical (parent) core, the test data for it also includes the test data and test infrastructure parameters for its child cores. For any given parent core C i , the test data and test infrastructure for its embedded child core includes the following: (i) the set M i of CTAM chains; (ii) the set CC i of embedded child cores; (iii) for each child core c ∈CC i ,the number of test patterns p c , the total scan length, scan-in length, and scan-out length on CTAM chain k, 1 ≤ k ≤| M i |, denoted by sl c,k , si c,k , and so c,k , respectively. If core C i is not hierarchical, M i = ∅ and CC i = ∅.
If all cores in the SOC are non-hierarchical, i.e., M i = ∅, 1 ≤ i ≤ N , P I reduces to Problem P NPAW , the wrapper/TAM optimization problem for non-hierarchical SOCs from [6] . Since P NPAW was shown to be NP-hard in [6] , P I is also NP-hard. Hence, we resort to efficient heuristics to solve this problem.
In the IEEE P1500-compliant wrapper design technique presented in [8] , wrapper design allows a hierarchical core to interface with a TAM partition of any width. Thus, using this wrapper design technique, hierarchical cores can be integrated into an existing wrapper/TAM optimization technique for non-hierarchical SOCs. We implement wrapper/TAM optimization for this scenario using the TR-ARCHITECT tool [2] .
The TR-ARCHITECT tool includes four main procedures, namely: (1) CREATESTARTSOLUTION; (2) OPTIMIZE-BOTTOMUP; (3) OPTIMIZE-TOPDOWN; (4) RESHUFFLE.I n C REATESTARTSOLUTION, an initial TAM partition is created by assigning to each core, at least one TAM wire based on its test data volume. The other procedures optimize the initial solution. The procedure OPTIMIZE-BOTTOMUP attempts to minimize the test time for a given test architecture by merging the TAM partition with the shortest test time with another TAM partition, such that the current test time of the SOC is not exceeded. The wires that are thus freed up can be used for overall test time reduction. In OPTIMIZE-TOPDOWN, first, the TAM partition with the largest test time is merged in an iterative manner with another TAM partition. If the first step does not reduce the test time any more, two TAM partitions that are not the partitions with the largest test time are merged to free up wires, provided the test time does not increase. The freed up wires are then used to reduce the overall SOC time further. The procedure RESHUFFLE minimizes the test time for a given test architecture by moving individual cores assigned to the TAM partition with the largest test time to another TAM partition, such that the current time of the SOC is not exceeded. The test times for the cores, for any given TAM width, is obtained using procedure WRAPPERDE-SIGN. In our work, the wrapper design technique from [8] is used for hierarchical cores and WRAPPERDESIGN is used for non-hierarchical cores.
For our experimental setup, we use TR-ARCHITECT to create a TAM architecture for the child cores embedded in the parent cores in the ITC'02 SOC benchmarks. We assume an internal CTAM width for the hard TAM architectures of the child cores, and use TR-ARCHITECT to design the wrapper and TAM architectures for the child cores. The TAM architecture is designed to minimize the overall test length of the child cores. From the results thus obtained, it is possible to determine the scan-in, scanout and scan-lengths of the cores on each CTAM chain. This information is sufficient for the wrapper design procedure used in the overall design flow of the hierarchical SOC. We present experimental results for two of the ITC'02 benchmark SOCs in Section 7.
Design Scenario II
In this scenario, since the embedded child core test infrastructure is soft, the system integrator can design the wrapper and TAM architecture of the child cores in accordance with the TAM width available at the parent core interface. Thus, if w TAM wires are available at the parent core level, the child core architecture can be designed to have up to w CTAM terminals. Thus, the CTAM terminals can be directly connected to the available TAM width, and external daisy chaining of the CTAM chains is not required. In this case, the problem of designing the wrapper and TAM architecture for the child cores corresponds to that of designing the wrapper and TAM architecture for an SOC; the TAM architecture is designed for an SOC-level TAM width of w. However, wrapper design for the parent core and the overall wrapper/TAM optimization problems for the hierarchical SOC are still different from the corresponding problem for non-hierarchical cores .
The problem statement for this scenario is as follows. Problem P II : Given (i) a hierarchical SOC S with a set C of N cores, i.e., |C|= N,(ii) the test parameters and the parent core PC i for each core C i , where 1 ≤ i ≤ N , C i ∈C ,and PC i ∈{ S}∪C\ { C i } ,and (iii) the total SOC-level TAM width W , determine: 1. The number of TAM partitions and the partition widths; 2. Wrapper design for each core C i ; 3. Assignment of cores to TAM partitions, such that the total testing time for the SOC is minimized.
2 The test parameters for each core C i , 1 ≤ i ≤ N includes the number of primary inputs fi i , primary outputs fo i , bidirectional I/Os b i , test patterns p i , scan chains k i , and the different scan chain lengths. The cores are assumed to have hard scan chains, i.e., the number and lengths of scan chains are fixed. The hierarchy tree is determined from the parent module information given for each core. The parent for the top-level modules is the SOC itself. If the parent module for every core in an SOC is the SOC itself, i.e., PC i = S, 1 ≤ i ≤ N, then P II for that SOC reduces to Problem P NPAW from [6] . Thus, by the method of restriction, Problem P II is also an NP-hard problem.
We extend TR-Architect in two main ways to solve P II : (a) we pre-process the SOC specifications, such that only the toplevel cores are presented to the TR-ARCHITECT tool as described in [2] , and (b) we replace WRAPPERDESIGN with another procedure called HIERWRAPPER. The HIERWRAPPER procedure makes use of TR-ARCHITECT,W RAPPERDESIGN, and WRAPP, where WRAPP is the wrapper design technique for INTEST P mode, as presented in [8] . Table 3 . Comparison of the test time and area overhead of the proposed approach with the approach from [9] . parent core level for the bottleneck core. Table 2 also shows the test time results with and without core bypasses. As expected, for many values of W , the test time for the TAM architecture with core bypasses is less than that for the TAM architecture without core bypasses. The reduction in test time with the use of core bypasses decreases with an increase in W . This happens because the daisy chaining of the child core CTAM lengths typically reduces with an increase in TAM width. The core bypass implementation requires 46 and 96 additional 2-to-1 multiplexers in p22810 and p34392, respectively. If N c is the set of child cores in an SOC, and w i is the TAM width available to each child core C i i ∈ N c , the number of extra multiplexers required for the implementation of core bypasses is given by |Nc| i=1 w i . Since the child core architecture is fixed, this number is independent of the SOC-level TAM width W . Thus, core bypasses can reduce the test time of an SOC significantly for smaller TAM width values at the expense of a small increase in chip area.
Next, we present results for Design Scenario II. In Table 3 , we compare the test application time and chip area of the proposed approach to that of the method presented in [9] . In [9] , modified wrapper cells are used, which allow parallel INTEST testing of the child cores and the parent cores, at the expense of an extra flip-flop and an extra multiplexer in each wrapper cell. The chip area is quantified in terms of the total number of NAND2 gates used in the wrapper cells of the SOC. It is assumed that a 2-to-1 multiplexer and a flip-flop is equivalent in area to 3 and 7 NAND2 gates, respectively. The derived lower bounds on test time are also presented for the proposed approach in Table 3 .
In Table 3 , the increase in test time for the proposed approach is accompanied by a decrease in the chip area compared to [9] for all the SOCs. In SOC p22810, the test time does not decrease as much with an increase in W due to hierarchical Core 1. Although, the SOC-level TAM width increases, the increase in the TAM width of Core 1 is not sufficient to reduce the test time of its child core architecture significantly. As a result, Core 1 dominates the test time of p22810. In p34392, the lower bounds on test time are reached for several TAM width values due to the bottleneck Core 18 that dominates the SOC test time. In this SOC, the reduction in chip area is significant compared to the increase in test time for all TAM width values. From these results, we conclude that for most SOCs, the gains in area are much higher compared to [9] , and these gains can be achieved at the expense of a small increase in the overall SOC test application time.
Conclusion
We have addressed the problem of test infrastructure design for hierarchical SOCs. We have shown how a hierarchy-aware test planning method can be used for TAM optimization and test scheduling for hierarchical SOCs in two practical design scenarios. We have derived lower bounds on test time and presented experimental results for four ITC'02 SOC test benchmarks. We have shown that, in the first design scenario, core bypasses can reduce the test time of hierarchical SOCs significantly with a small increase in chip area. We have also shown that for the second design scenario, the wrapper area is much less compared to [9] , but there is only a small increase in test application time.
