SUMMARY Connection of internal scan chains in core wrapper design (CWD) is necessary to handle the width match of TAM and internal scan chains. However, conventional serial connection of internal scan chains incurs power and time penalty. Study shows that the distribution and high density of don't care bits (X-bits) in test patterns make scan slices overlapping and partial overlapping possible. A novel parallel CWD (pCWD) approach is presented in this paper for lowering test power by shortening wrapper scan chains and adjusting test patterns. In order to achieve shift time reduction from overlapping in pCWD, a two-phase process on test pattern: partition and f ill, is presented. Experimental results on d695 of ITC2002 benchmark demonstrated the shift time and test power have been decreased by 1.5 and 15 times, respectively. In addition, the proposed pCWD can be used as a stand-alone time reduction technique, which has better performance than previous techniques.
Introduction
Pre-designed intellectual property (IP) cores are increasingly used in System-on-a-chip (SOC) integration. Complex SOC integration creates some serious challenges for test. These challenges include test time, test data volume, testing power, and expensive tester, etc. [1] . Among these factors, test time is dominant in high-volume testing of ICs. Hence test time reduction is highly desirable.
Test time reduction can be obtained through scan chain reconfiguration and test data compression. Lee [2] proposed a methodology to use one scan-in signal to feed multiple circuits. The Illinois scan architectures in [3] [4] showed a broadcast architecture in which one scan-in signals is fed to multiple scan chains. [5] proposed a type of reconfigurable shared scan-in architecture. Using one scan-in to drive multiple scan chains can shorten scan chains, thus reducing the shift time of scan data. Test data compression is another way to reduce test application time. Golomb and VIHC codes have been presented in the work [6] and [7] to compress test data of embedded cores in SOCs. [8, 9, 10] presented a tree architecture to organize the scan cells to reduce the input data volume and test application time. [11, 12] described the methods of test data compression using statistical encoding by converting some specified input values in the test set to unspecified logic values first. Reda [13] presents a mutation decoder to compress test data into data stream that indicates which bits need to be flipped in current test slice to obtain the subsequent one. [14] uses a linear mapping network to drive a high number of internal scan chains with a short external scan chain. [15] attempts to overlap consecutive test vector seeds based on the technique [14] which can reduce both test time and test volume.
Another problem associated with testing SOCs is the high testing power. Testing of SOC cores in parallel poses the risk that the parallelism could lead to exceeding the certain power thresholds which places the chip at a risk of damage. Reconfiguration of scan architecture can reduce testing power drastically. The utilization of externally controlled gates [16] and modification of scan chains through logic gate insertion between the scan cells [17] [18] have shown the effectiveness of testing power reduction. The techniques of partitioning [19] and freezing [20] scan chains have been proposed to decompose the scan chains into several partitions and only some partitions are active at any time. [21] presents a 2D scan tree to shorten the scan chains to reduce the testing power. The modification on the hardware is at the price of performance degradation. On the other hand, some techniques [22] [23] don't modify the scan chains but adjust test patterns.
Reduction of test time and power can be achieved simultaneously in [24] [25] and [26] . However, these techniques are only suited for pre-design core providers, not integrators since they aim to reconfigure the scan architecture of embedded core. In SOC, reduction of test time can be achieved by core wrapper design (CWD). CWD is a kind of DFT logic to provide test access for both core-internal testing, as well as coreexternal testing. Width adaptation is the main function of CWD. When the width of TAM is insufficient for parallel loading test patterns, serialization is necessary at the input and output of internal scan chains (ISCs) [27] [28] . However, serialization will prolong the path of loading a bit into ISCs, which multiplies the scan-in power and shift time as we investigated in the next section. To overcome the power problem, a novel architecture of parallel wrapper scan chains is presented in this paper, which focuses on the connection of ISCs. Parallel connection can reduce testing power drastically. The high density of X-bits in test data will be exploited to make consecutive test slices overlap, therefore reduce shift time and testing power simultaneously.
The rest of this paper is organized as follows: Section 2 discusses the test power and time penalty of serial connection when the system integrator wraps the bare cores. Section 3 introduces the overlapping and partial overlapping of scan slices. Section 4 presents the parallel CWD and a two-phase process. Section 5 shows experimental results to demonstrate the effectiveness of the new wrapper architecture. Fig. 1 shows a conventional serial connection in a core wrapper design. Because the available scan width of the TAM is 2, eight internal scan chains (ISCs) are connected into two wrapper scan chains (WSCs). The length of the new WSCs is 4 times longer than that of the ISCs in the bare core. Prolonged scan path not only increases the test application time but also the power dissipation in scan chains during shift, as follows. In this paper, the Weight Transition Metric (WTM) proposed in [23] is used to estimate test power dissipation in scan chains during scan test. Only the scan-in power is considered since the analysis of the scan-out power is similar to the scan-in power.
Power and Time Penalty of Serial Connection
We assume the length of all N ISCs is l. A scan vector for an internal scan chain s in the core is 
We define transition probability of two consecutive bits in one test vector as the probability of transition occurring in these two consecutive bits when the whole test pattern is applied. If the test patterns are randomly filled, the transition probability of any two consecutive bits is considered as the same.
If the N is the number of internal scan chains and M is the number of wrapper scan chains, the following property shows the relation between power consumption and the width of TAM in a wrapped core:
Property 1: With the assumption that the transition probability of any two consecutive bits is the same in serial connection (N > M ), the following equation holds,
core is the scan in power dissipated of a wrapper core. We use P R to denote the transition probability of any two consecutive bits in test vectors. For a bare core, the hamming distance between two slices now can be formulated as: HD = N × P R. Then the formulation of the scan-in power of the bare core can be transformed into:
The power of wrapped core can be calculated as:
1 means that, when N ISCs are wrapped into M WSCs, the test power of the wrapped core will become (N/M ) times of the bare core. Property 1 is based on the assumption of using random patterns. However, the patters generated by MINTEST to some ISCAS'89 circuits show the same property as Property 1 in our experiments.
Test application time is also related to the width of a TAM. The test application time for an embedded core can be obtained by: [28] , where P is the number of test patterns, and (S i , S o ) denotes the length of the longest scan-in/scan-out chains. [29] shows the test application time is also approximately inverse proportion with the width of TAM. However, there are usually numerous X-bits in test patterns. Since X-bits do not contribute to fault coverage, the time to shift the X-bits is useless and should be reduced.
Slices Overlapping and Partial Overlapping
Shift time is the major part of the test application time, which is the number of cycles needed to transfer the test data to internal scan chains. Since a test pattern may contain a large number of X-bits, the X-bits in a test pattern will deteriorate the shift efficiency. Shift efficiency metric is defined as the ratio of the total number of specified bits to the total number of shift cycles to describe the effectiveness of overall shifting. For the test patterns shown in Fig. 2 , there are 16 specified bits and 24 X-bits. If these test patterns are shifted into scan chains using serial CWD (sCWD) through one input, 40 cycles are needed. In this case, the shift efficiency is only about 0.4. That is, more than half of time is wasted in shifting X-bits. 
Fig. 2 Slices overlapping in test patterns
Observing the X-bits distribution in test patterns can help us find a method to handle the wasting problem. Some experiments on the large ISCAS'89 circuits are conducted to study the distribution of X-bits. Fig.  3 illustrates the distribution of X-bits in test patterns of s13207 that has 700 PIs. The number of X-bits in each test pattern of the whole test data is listed. The test data of the other ISCAS'89 circuits are similar to s13207. From Fig. 3 , we can observe the following two facts:
• Fact 1: The density of X-bits in test patterns is high. To s13207, X-bits occupy 93.2 % of total test bits.
• Fact 2: X-bits are clustered and asymmetrical.
The X-bits ratio in test patterns generated later is higher than those generated earlier.
These facts indicate that two consecutive scan slices in test patterns have a high probability to be overlapping. Scan slice overlapping means that scan slices are equal. In the example shown in Fig. 2 , the test patterns contain 10 scan slices. Each slice consists of 4 bits. The first and second slices in the original test pattern are {1 X X 1} and {X X 1 1}. If the second and third bits in the first slice and the first and second bits in the second slice are assigned to 1, the resulting scan slices will be {1 1 1 1} and {1 1 1 1} and they are overlapping. If consecutive slices are overlapping, an efficient method for shifting the slices to scan chains is that we load the first slice serially to an additional scan chain and then freeze this scan chain. In the following cycles, we only need to load other test data to internal scan chains in parallel through additional scan chain. Thus, many redundant cycles can be avoided. The corresponding scan architecture is shown in Fig. 4 . Four inputs of internal scan chains are connected to one external scan chains (ESC) in parallel. If the overlapping signature of test patterns shown in Fig. 2 Fig. 2 , after partition, there are three slices to seed: {1 1 1 1}, { 0 1 0 0}, and {0 0 0 1 }. Note that in the second and third slices, the first and second bits of the second slices are overlapping with the third and fourth bits of the third slice, respectively. Instead of fully scan-in every different slice, partial overlapping can be applied to two adjacent and different slices to reduce shifting cycles. In this example, if partial overlapping is used, the third slice can be shifted into input external scan chains (iESCs) in only two cycles based on the second slice. The number of total shift cycles will become 17. That is, the shift efficiency is improved to 0.94.
Proposed Approach

Proposed Core Wrapper Design
In order to load test data in parallel, each scan-in or scan-out terminal should be also wrapped with a scan cell. These scan cells, also called items, are proposed in the normal architecture of IEEE P1500 standard. The parallel CWD (pCWD) is shown in Fig. 4 . For the purpose of clarity, other hardware included in IEEE P1500 is not shown in this figure. Some XOR gates are inserted between the standard wrapper cells of scan-out terminals. XOR gates and scan cell will be combined into some MISRs. All these MISRs are concatenated to form a large MISR. Large MISR can help us to reduce the error aliasing.
Logic of U (width=2) Fig. 4 The proposed pCWD architecture
The concatenated MISR is controlled by a normal clock: CLK. It is always active during test application. A simple logic unit, U , is used to generate the clock of input external scan chains (iESCs) and internal scan chains (ISCs). The logic test application of pCWD then proceeds by looping over the following two steps:
Step 1: Seed the iESCs: First, CLK1 is activated and CLK2 is disabled. Then the scan slices are shifted into iESCs from TAM and at the same time MISR is reconfigured into multiple scan chains. The results accumulated in the concatenated MISR are unloaded into TAM and transported to ATE for comparison. 2.
Step 2: Load the scan slices: CLK1 is disabled and CLK2 is activated. Then the scan slices stored in iESCs are parallel loaded into ISCs. If multiple consecutive scan slices are overlapping, this step is repeated until a collision rises. In this procedure, the concatenated MISR is configured as a signature generator that accumulates and compacts the contents of scan chains to be scanned out.
If ISCs and iESCs are triggered by the different edges of a clock, the last cycle of Step 1 and the first cycle of Step 2 can be overlapped in one same scan clock cycle. This condition can be easily satisfied by inserting an inverter in the clock tree.
The pCWD is fully compatible with the IEEE P1500 standard. The bypass mechanism can be easily added into our design though it does not appear in Fig.  4 . The clock CLK can be the normal scan clock needed in the conventional serial connection. Compared with the standard architecture, an additional signal "mode" is needed to transform the states. It can be included in TAM.
Two-Phase Processing on Test Patterns
To achieve the minimum shift time, test patterns need to be partitioned into some overlapping slice sets. 
From this formulation, we can see that two attempts can be conducted to achieve minimum shift time. The first one is to minimize the number of blocks. The second one is to maximize the partial overlapping between two adjacent and different slices. To accomplish these two attempts, we use a two-phase process: First, a test pattern is partitioned in order to find a minimal number of blocks to cover all slices; second, fill the remaining X-bits in blocks to find the maximum partial overlapping results.
The first phase is to find a minimal number of blocks. By checking the compatibility between all pairs of scan slices, we can create an incompatibility graph. The incompatibility graph G(V, E) is an undirected graph, which consists of a set of notes V and a set of edges E. Each node corresponds to a scan slice. The index of a node is the order of corresponding slice in the test pattern. There is an edge E ij between the nodes So the first maximal c-clique is B 1 = {s 1 , s 2 , s 3 }. The step 3 in MBLP fills all X-bits with 1 in B 1 in order to keep overlapping. In step 3, the same partition procedure is applied and the second maximal c-clique B 2 : { s 4 , s 5 , s 6 , s 7 }is found. All X-bits in the set v 2 (the second row in b 2 ) remain since no specified bit in this row is found.
Step 4 finds the third maximal c-clique B 3 : { s 8 , s 9 , s 10 } and fill them. The X-bits in v 1 (the first row in B 3 ) remain. Next, the second phase process is conducted to fill the remaining X-bits in the seed slice after the first partition phase is finished. Because the seed slices with Xbits of each block have been determined by the MBLP algorithm, the X-filling procedure of the second phase will be local. An efficient way of filling the X-bits in seed slices can result in shorter seeding time. A simple longest string match algorithm can be used to find the maximum compatible bits between two adjacent and different seeds. The time complexity of this algorithm
Example 2: An example of the second phase Xfilling process is shown in of Fig.5 (2) . After the first phase process, 3 seed slices with X-bits are obtained: {1 1 1 1}, { 0 X 0 0}, and {X 1 0 1}. Examining the second seed slice and the third seed slice, {0 X} is compatible with {0 1}. Thus, the remaining X-bits in the second seed slice are filled with 1 to achieve two partial overlapping bits.
The partitions which can lead to the minimal number of c-cliques are not unique. Some other partition schemes may achieve a better result than MBLP because they consider the optimization space of the second phase when doing the first phase optimization. In the example of Fig. 5 , if all test patterns are divided into these three blocks: {s 1 , s 2 , s 3 }, {s 4 , s 5 , s 6 }, and  {s 7 , s 8 , s 9 , s 10 }, three seed slices will be {1 1 1 1}, {0 X X 0}, and {X 0 0 1}. If the second seed slice is filled as {0 0 1 0}, there are three overlapping bits between the second and third seed slices. The total shift time will be reduced to 16. To achieve a solution of the highest time reduction effect, a branch-and-bound algorithm can be used. We experimented with three algorithms: MBLP, branchand-bound, and the theoretical bound, and the results are shown in Fig. 6 . Each core has 8 scan chains. The maximum theoretical bound is achieved by assuming all consecutive slices are overlapping. From the figure, for five cores in d695, an average 5% of reduction improvement by the branch-and-bound algorithm was achieved compared to MBLP. However, its search space is exponential with respect to the number of scan slices in the worst-case. The CPU running time of MBLP and branch-and-bound algorithms is reported in Table 1 . It is clear that the tradeoff between time complexity and efficiency must be considered when a large number of test patterns need to be considered in practice.
Test Power Reduction (Structural and Algorithmic)
As described in Section 2, test power will increase when wrapping multiple ISCs into smaller WSCs. High test power will increase the risk of heat damage and limit the number of cores that can be tested in parallel. In sCWD, the major part of test power dissipation is caused by the longer wrapper scan chains. Our pCWD addresses this problem through shortening the shift path of a transition. The next property will show the fact that the test power of pCWD is close to the test power of a bare core.
Property 2:
If the length of iESCs is far smaller than the length of ISCs, then in parallel connection, the test power of wrapped core approximates to original test on bare core. The test power of a bare core for all test patterns is:
The scan-in power of the wrapped core is:
where d is the length of iESCs. If d << l, we can omit the scan power consumed in iESCs, so the resulting relation is:
The above property indicates that the reduction of scan-in power will benefit from structural modification. The further optimization of scan-in power can be obtained from proper assignment of X-bits.
To minimize test power, a looking-forward algorithm [22, 30] can be used to specify X-bits. It has been shown that scan-in power depends not only on the number of transitions in it but also on their relative positions. For example, considering a vector t 1 t 2 t 3 t 4 t 5 = 10010, where t 1 is first shifted into scan chains, the 1→0 transition between t 1 and t 2 causes more switch activity than the 1→0 transition between t 4 and t 5 . This example shows that we need to avoid the transition in the front part of vector as possible as we can. The lookingforward algorithm demands that, to minimize scan-in power, an X-bits must be assigned with the value of a specified bit which is ahead of and nearest to the X-bits in the test vector.
The goals of minimizing test time and test power can cause conflicts sometimes. For example, there are consecutive bits in two blocks: { . . ., ( 0 0 ), ( X X 1), . . .}. To get overlapping, two X-bits in the second block must be assigned with 1. However, to minimize test power, they must be assigned with 0 since the nearest and ahead specified bit of the X-bit is 0. However, this collision is rare since the X-filling procedure in the twophase process has a trend to avoid transitions within a block. Avoidance of transitions in consecutive slices will help to reduce test power drastically. [30] shows the conflict between the test data compression and test power reduction. It indicates that filling 0 (0-Fill) into all X-bits can achieve a better tradeoff between test data volume reduction and test power reduction. Our experimental results in Fig. 7 show that the two-phase X-fill process can reduce more test power than the 0-Fill scheme. The number of scan chains in s13207 and s15850 is 16, while the number of scan chains in any other circuit is 32. The 0-Fill algorithm is not very efficient since they result in more test power dissipation. The results of our X-filling procedure are similar to the looking-ahead algorithm. The remaining space of power optimization using the X-bits filling scheme is smaller than 10% after our two-phase X-filling process is conducted. 
Experimental Results
This section shows some experimental results to verify efficiency of the proposed wrapper design methodology. The total source code in C language to implement the algorithm in this paper is about 3000 lines. The program is executed on a PC with a P4 1.6G processor and 384Mb RAM. The first series of experiments is performed on some large cores of an academic SOC benchmark d695 from Duke University [33] . The specification of these cores is listed in Table 2 . The Pat#, X-ratio and SC# in Table 2 are the number of patterns, the ratio of Xbits in all test patterns and the number of scan chains. These test patterns of all these cores are obtained from the MINTEST ATPG program [31] . Table 3 shows the experimental results of d695. In this table, ST # is shift time and SIP # is scan in power. w is the input width of data TAM. We consider the mode signal as a data line. It is included in w. Thus, the number of external scan chains will be w − 1. We use scan% to measure the area overhead, where scan%= (the number of scan cells in external scan chains) / (the number of scan cells in internal scan chains ). In sCWD, the X-bits are randomly filled. In our pCWD scheme, two-phase process based on MBLP algorithm is used to optimize the overlapping. From the last column of Table 3 , we can see test time of pCWD has been decreases to 75% while the area overhead is very low. The test power of pCWD has been reduced to 5% with scan chains reconfiguration and X-Fill algorithm. From the data of Table 2 and Table 3 , we found that the test time reduction is related to the ratio of X-bits and the distribution of X-bits in test patterns. High density of X-bits can obtain better results. The density of X-bits in s35932 and s38417 is lower than that of other cores, so less test time reduction is obtained.
The proposed architecture of pCWD can be a stand-alone time reduction technique. We will show some experimental results compared with previous techniques, such as Golomb encoding and mutation encoding. The test patterns were generated by MINTEST ATPG tool, which are the same as in [6, 13] . Table 4 shows the experimental results. In pCWD, the scan chains are divided into 8 scan chains and wrapped with one external scan chains. Different from the method presented in the first series of experiments, in order to achieve the best results, the two-phase process based the branch-and-bound algorithm is used here. Our technique can get higher time reduction for three of five cores. As for total test time, our pCWD can achieve 11.7% improvement when compared with the FDR coding method.
In Table 5 , the results of pCWD are compared with the previous techniques, which are based on different test sets. The optimality of the ATPG and compaction procedures are used in these previous techniques which strongly affects the results. In our pCWD, we still use MINTEST patterns. However, we try to get the best results through selecting an optimal length of ESCs to minimize the number of blocks and maximize the partial overlapping bits. l in Table 5 is the optimal length of ESCs to get the data listed in ST#(shift time) col- umn. SE# is the shift efficiency. It is clear that, even though the set of test cubes of our pCWD(MINTEST) is not optimized, the results still compare favorably. The average shift efficiency of each core is improved to 1.11.
Conclusions
A new approach for core wrapper design based on parallel connection of internal scan chains has been proposed. It is compatible with IEEE P1500 and could be viewed as a partial extension of IEEE P1500 wrapper. Because the scan slices overlapping and partial overlapping, pCWD can reduce test time since this architecture can load multiple useful bits into internal scan chains in one cycle. In order to obtain this reduction, a two-phase process is presented to minimize the overlapping partitions and maximize the partial overlapping.
The parallel connection will reduce test power through shortening the path of a transition propagating. Furthermore, the X-filling procedure in two-phase process will reduce test power at the algorithmic level. If all the efforts are conducted, about 1.5 times reduction of test time and about 15 times reduction of test power can be obtained compared to conventional serial connection. 
