Test application time and core accessibility are two major issues in System-On-Chip (SOC) 
Introduction
Recent advances in IC design methods and manufacturing technologies have led to the integration of a complete system onto a single IC, called system on chip (SOC). These system chips offer advantages such as higher performances, lower power consumption, and decreased size and weight, when compared to their traditional multichip equivalents. Many system chips are designed by embedding large reusable building blocks, commonly called cores. Such a design reuse approach speeds up the design process, and allows import of external design expertise. However, the increased system complexity leads to high test data volumes, which means long testing times [Harr99] . Furthermore, traditionally, the chips are tested before they are integrated into a system. The interconnections are tested separately in system test when fault-free chips are already integrated. For system chips, on the other hand, testing of cores and interconnections is performed in a single system test step. Test access becomes also a problem for system chips since the cores are not directly accessible via chip inputs/outputs.
Power consumption, interconnection tests, test conflicts and precedence relations are important issues that must be addressed during the design of the test schedule. The power dissipation during the test mode is often higher than that in system mode because of higher number of signal switches. Thus, it is necessary to organize the tests in such a way that power dissipation does not exceed a given threshold. It is also important to test interconnections between cores. One problem here is test resource sharing since interconnections usually do not have a direct interface to the test data transportation mechanism. Therefore an important conflict that must be considered for interconnection test is the sharing of test resources. On the other hand, each core (e.g. hard core) may come with its dedicated BIST (Built-In Self Test) resource. The BIST resources can in some cases be shared between several cores, but this is not always the case. If they are shared, it is usually not possible to test more than one core at a time for a given resource. Furthermore, depending on the design, some tests may have to be applied before others. The order in which the tests are applied is therefore important, and precedence constraints must be considered.
In [MaGo00] , the authors present two heuristics aiming at designing a wrapper for cores. The presented technique minimizes the longest wrapper chain for a given number of wrapper chains (i.e. the number of connections to the TAM). Iyengar and Chakrabarty [IyCh01] , on the other hand, present a heuristic similar to the one we describe in this paper, which tries to minimize also the number of wrapper chains.
Once the wrapper is designed, a TAM must be built, and several approaches can be used. In [NoPa01] , the authors use existing resources to implement the TAM. The approach searches, in an exhaustive way, all the controllable components connecting two points in the SoC (muxes, tristates, bypass, etc.). It deals with test scheduling and TAM architecture design simultaneously. In [CoCa02] , the presented method uses different kinds of TAMs during the scheduling phase. The authors define a connection model of cores to neighbor cores with a cost for each of them. The aim of the approach is then to minimize this cost.
The more standardized approaches [IyCh01] [IyCh02] use specific test buses. The TestRail used by [IyCh01] [IyCh02a] mixes the Daisy Chained and the distributed architectures. Cores on the same TestRail are tested simultaneously. Another approach, called TestBus, is based on the mixture of the multiplexed and the distributed architectures. The cores connected to the same TestBus are tested sequentially, each core being connected to the whole TestBus bandwidth. The bus-based approaches are easy to implement and are more flexible than the other ones. They provide a direct compatibility with P1500 requirements.
In this paper we propose a technique for the design of the wrappers, the selection of the TAM configurations and the scheduling of the tests. The main advantages of our approach are that we design a test schedule to minimize the test application time while considering power consumption and test conflicts. The conflicts we consider include test resource sharing and precedence relations. Furthermore, we consider also the testing of interconnections.
The rest of this paper is organised as follows. Section 2 presents our wrapper design algorithm, which generates a set of design alternatives. Section 3 describes the implementation of the TAM architecture and the heuristic we use in parallel to minimise the total test time. Section 4 summarises the experimental results and discusses the features of our approach. Finally, section 5 presents the conclusions and limitations of the proposed algorithms.
Wrapper Design
Before designing an overall test architecture for a given SOC, each core has to be wrapped considering the P1500 restrictions [MaIy02] for signals and functionalities. This interface allows us to isolate the cores during testing and to apply test vectors in an optimal way in terms of test time.
The purpose of our wrapper design algorithm is to develop a set of wrapper chains at each core. A wrapper chain includes a set of the scanned elements (scan-chains, wrapper input cells and wrapper output cells). The main objective of the algorithm is, for a given bandwidth, to organize the wrapper chains in such a way that the test time is minimized. The test time is related to the length of the wrapper chains, which means that we should minimise the longest wrapper chain (internal or external or both), i.e. max{si, so}, where si (so) denotes the number of scan cycles required to load (unload) a test vector (test response). The test time at a core is given by:
where p is the number of test vectors to apply to the core [IyCh02] .
In our approach, a TestBus model for the TAM is used. The consequence of this is that we need connections to the TAM for the inputs and the outputs of the wrapper (Fig. 1) . Our heuristic can be divided in two main parts; the first one for combinational cores and the second one for sequential cores. For combinational cores, there are two possibilities. If the TAM bandwidth limit, W, is above or equal to I+O (where I is the number of functional inputs and O the number of functional outputs), then nothing is done and the number of connections to the TAM is I+O. If W is below I+O, then some of the cells on the I/Os are chained.
For sequential cores (one example is given in Fig.1 ), if W is above or equal to {#SC×2 + 2} (#SC is the number of scan chains), then a pre-process 'Internal Chaining' will chain the internal scan chains. The value given by {#SC×2 + 2} defines the minimal number of bits needed to connect a core with scan chains using our approach (2 per scan chain, one to link the functional inputs and one to link the functional outputs). Then a 'fill' process will connect wrapper cells to internal scan chains until the total number of FFs (flip-flops) reaches the number of FFs of the longest scan chain. In Fig. 1 , the scan chain of length 6 and that of length 2 are chained together. Two wrapper input cells are then connected to it, so that the total number of FFs is 10, which equals the number of FFs of the longest scan chain. If W is below {#SC×2 + 2}, the internal scan chains are chained together in order to reduce the number of needed connections to the TAM. Then the 'fill' process is applied. The simplified algorithm is presented in Fig. 2 . The algorithm defines the whole curve T = f(W) which has the form of a staircase (several architectures lead to the same test time). It returns as results the Pareto-optimal points that are the ones on the left most edges at each staircase level.
We have applied the proposed algorithm to the benchmarks from ITC'02 [Benc02] . Fig. 3 shows the results for core 5 from the d695 system, which contains 32 scan chains, 38 inputs and 304 outputs with 110 vectors. We obtain the different stages on the curve corresponding to an identical test time for several W values (i.e. several architectures). For example, the Pareto-optimal point {t=10100, W=35} is optimal with respect to all points that have the same test time (t=10100) but different numbers of TAM connections, such as {t=10100, W=51}. The algorithm computes the curve for each core on a system in a very short computation time (typically a few seconds). Depending on the system test requirements, the designer can then choose among the different widths and test times for each core. We have applied this algorithm to the whole ITC'02 benchmarks [MaIy02] [Benc02], and obtained these curves for every core [Poug02] .
W=1000000; //limit for the number of used bits

TAM Architecture and Test Scheduling
Once the wrapper design is completed or a set of wrapper design alternatives is available, as provided by the wrapper design algorithm, the designer has to deal with two issues, namely test scheduling of the cores, and the design of test access architecture. The test access architecture is responsible for the transportation of the test data from the system inputs to the core inputs and from the core outputs to the system outputs. For this purpose, we make use of a generalization of the TestBus architecture.
Test Schedule
The test scheduling problem consists of two interleaved NP-complete problems (bin-packing and minimal graphcoloring) [FlPo01] . The graph-coloring and bin-packing problems cannot be approximated in bounded limits when the graph has no special structure [GaJo79] . One way to simplify the test scheduling problem is to organize tests for the target modules into so-called test sessions [Mur00] [RaVe99] [ChSa97] . An alternative is to use a no session scheme, which allows minimizing the test time at the expense of area overhead and scheduling complexity. To reduce the scheduling time, a very fast heuristic based on the work by [FlPo01] has been used in the proposed approach (Fig. 4) .
The complexity of this algorithm is O(n 3 ). It can handle, for example, a schedule for more than 100 cores within one second. The algorithm works in the following way.
L1 = list of cores sorted by decreasing Di values (Di = test time of core i) L2 = Ø Tmax=0
While L1 ≠ Ø First, the core tests are sorted in decreasing order of test time in a list, L1. While all the tests are not fully scheduled, it checks the characteristics for each core test in order, and places them as soon as possible making sure that all the constraints are satisfied. If a core cannot be scheduled due to the violation of certain constraints, it is moved to an auxiliary list L2 to be scheduled later. When the L1 list is empty, L2 moves to L1 and the process is re-iterated.
TAM Design -Pseudo Exhaustive Approach
We have first developed an approach based on a generalization of the TestBus architecture to build the TAM. The wrapper design algorithm presented in section 2 allows us to obtain different possible costs for each core. We use a hyper-graph representation for the incompatibility of the tests we have to schedule. Our approach generates all the possible incompatibility configurations (i.e. all the hyper-graphs) in order to calculate their minimal cost using a mapping heuristic ('mapping' denotes the assignment of cores to busses), and the associated schedule. Each hypergraph corresponds to a different mapping and schedule. The algorithm memorizes and finally returns the best solution in terms of test time under a given TAM constraint. It chooses the Pareto-optimal points for each core (from the wrapper design algorithm), and then generates in an exhaustive way all the possible hypergraphs (configurations) for the TAM.
The vertices of the hyper-graph correspond to the tests and are weighted by the number of connections they need on the TAM. In a similar approach in [ChSa97] , the authors consider the test resources on the system and build the incompatibility based on them. Thus, their graph is just an expression of the existing resources. In our scheme, incompatibilities, represented by the hyper-edges, can be used to capture different test conflicts and constraints. An example of hyper-graph is given in Fig. 6 , where a vertex is labeled by a unique number and its weight in a parenthesis. If cores 1, 2, 3 and 4 cannot be tested pair-wisely at the same time, this information is captured by a hyper-edge connecting vertices 1, 2, 3 and 4.
We generate in theory the whole set of possible hypergraphs for a system, and therefore all the possible architectures for the test bus. In practice, a strategy based on space pruning is used to cut the branch-and-bound search to reduce the computation time. For each solution, we generate the mapping and consider the new incompatibility set to schedule the tests. The simplified algorithm is given on Fig. 5. Fig. 6 and Fig.  7 illustrate a hyper-graph example and its corresponding TAM architecture. We can also add incompatibilities (hyper-edges) to capture the conflicts due to the test of the glue logic between cores. For example, we can add a hyperedge between cores 5 and 9, and another between core 6 and 9, with the new core 9 representing the glue logic between cores 5 and 6.Once the mapping is ready, the algorithm uses a fast heuristic to schedule the tests. The scheduling heuristic takes into account all the incompatibilities, including the added ones.
We have applied the pseudo-exhaustive algorithm to the ITC'02 benchmarks. The used width w i and test time t i for each core are chosen manually on one of the Pareto-optimal points of the curves of each core. In Fig. 8, we show the results for the system h953 from [Benc02] . In this example, power dissipation of cores during test is taken into account and a width limit (32 bits) is fixed for the system test bus. Our scheduling algorithm takes also into the system power limit, even though the power data are not shown in the figure. Note that the test time in this example is determined by the largest core. Nevertheless, it is possible to choose other Pareto-optimal points for each core, which may be carried out manually. Checking all the possibilities for a system with ten cores with ten Pareto-optimal points for each core means checking 10 10 possibilities. Automatic checking for all these possibilities is therefore only efficient for systems involving a small number of cores. For larger systems, the computation time becomes too prohibitive.
TAM Design -Efficient Heuristic
We have implemented a fast heuristic in order to reduce the computation complexity. The heuristic first sorts the design alternatives obtained with the wrapper design algorithm using the cost function C i = t i ×w i . Then, it uses the scheduling approach presented in section 3.1 taking into account now also constraints on W. The algorithm, illustrated in Fig. 9 (named ScheduleW) selects appropriate wrapper designs to schedule the tests. In this algorithm, "i" is the wrapper design index. If the constraint on W is not respected, the algorithm searches the next value for (w i t i ) with the lowest possible cost C i . It schedules the test as soon as possible with respect to the whole set of constraints, and maximizes the TAM use (i.e. the number of TAM bits used).
…. For all intervals
For 
Experimental Results and Discussions
We have applied our wrapper design, TAM design and test scheduling algorithms to the ITC'02 benchmarks. The results for systems q12710, d695, p22810, and p34392 are given in Tables 2-5, respectively. For each system, results for a set of design alternatives, corresponding to different TAM widths, are reported. A row in the tables corresponds to a given TAM width. For each TAM width, the first group of columns gives the results of the pseudo-exhaustive approach, in terms of the test time indicating the quality of the test schedule, and the CPU time used to generate the solution. The last group of columns gives the corresponding results of our efficient heuristic, as well as how they compare to the other approaches.
For comparison, we have implemented another fast way to solve the TAM design and test scheduling problems by adding multiplexors to the test bus. The schedule is generated in a straightforward way: the tests are performed sequentially using the maximum TAM width available. Therefore, the Pareto-optimal points are chosen as close as possible to the TAM width limit. The results of this multiplexed approach are given in the middle column of Tables 2-5. The experimental results show clearly that our approach outperforms the multiplexed approach in terms of test scheduling lengths (on average, the test time of our approach is 31,1% smaller than that of the multiplexed approach). When compared with the pseudo-exhaustive approach, our algorithm consumes only a tiny fraction of CPU time needed, while producing relatively comparable test scheduling results. Note also that the pseudoexhaustive approach does not work for larger systems, such as p22810 and p34392. Therefore there are not pseudoexhaustive results in Tables 4 and 5, nor are there any in  Table 3 for TAM widths between 12 and 48 bits.
Let us look closely at one example, the system d695, which is illustrated in Fig. 10 . The TAM width limit is 32 for this example and the power limitation is 1300mW. The characteristics of the cores are given in Table 1 . In the ITC'02 benchmark specification, no power data are given for this system. Therefore, we add power values for each core. We have also added two precedence constraints for this system: cores 7 and 5 have to be tested before core 10, and core 6 can only be tested after cores 7 and 8 have been tested. These constraints are given in the last column of Table 1 . One possible motivation for the constraints is that core 5 has to be tested first because it is a core containing potentially more faults than other cores. The generated schedule, given in Fig. 10 , respects the TAM width constraint (32 bits) on the system test bus, as well as the specified precedence constraints. The dashed line in Fig. 10 represents the instantaneous dissipated power during test.
In [IyCh02] and [IyCh02a] , a similar approach is presented for wrapper/TAM co-optimisation assuming that all tests are compatible. Therefore, that approach cannot deal with the cases when some tests cannot be carried out simultaneously, due to, for example, design hierarchy constraints.
Our algorithms deal, on the other hand, with the incompatible properties explicitly. Additional features of our approach are that it addresses the issues of power dissipation, precedence constraints, incompatibilities from test resource sharing, and interconnection test in a systematic manner. The issue of interconnection test is in particular important considering the increasing importance of interconnection in SoC designs. Using the wrappers, it is possible to test interconnections between two cores (which is included as a part of the wrapper design algorithm) and to add this test as a new item into the schedule (no extra wires are necessary for this test), with our approach.
For example, on the resulting schedule of the system d695 in Fig. 10, 164 cycles are necessary to test interconnections between cores 5 and 6 and 19 wires are used (these data are extracted from the wrapper design algorithm).
Conclusions
We have presented several techniques developed to help the system designers to implement and optimise a test structure for the whole system on silicon. Firstly, the wrapper design is built by a fast algorithm, optimising the test time of a core for a given number of connections to the TAM. This algorithm gives all the possible implementations of the wrapper for a given core. We can then select one of the three approaches for TAM design and test scheduling. The first one uses a pseudo-exhaustive search, which entails prohibitive computation times for large systems. The other two are fast heuristics, one is based on an implementation of a multiplexed approach, and the other is a heuristic developed by us. We have demonstrated, with experiments on the ITC'02 benchmarks, the efficiency of our approach, and compared it with other approaches. Additionally, our algorithms have several interesting features, such as that it addresses precedence and power constraints and interconnection test, which are not all implemented on other approaches. In particular, the interconnection test issue is becoming more and more important for the next generation SoCs.
Cores Ci ti wi Pi Prec. i   1  416  2  30mW  -2  7992  3  150mW  -3  5167  2  250mW  -4  11129  6  100mW  -5  10100  19  400mW  -6  9869  19  950mW  7, 8  7  12959  10  700mW  -8  4605  11  450mW  -9  2820  19  350mW  -10 7106 17 550mW 7, 5 
