Abstract-We present a selective encoding method that reduces test data volume and test application time for scan testing of intellectual property (IP) cores. This method encodes the slices of test data that are fed to the scan chains in every clock cycle. Unlike many prior methods, the proposed method does not encode all the specified (0s and 1s) and unspecified (don'tcare) bits in a slice. For example, if a slice contains more 1s than 0s, only the 0s are encoded and all don't-cares are mapped to 1. We use only c tester channels, where c = log 2 (N + 1) + 2, to drive N scan chains. In the best case, we can achieve compression by a factor of N/c using only one tester clock cycle per slice. We derive an upper bound on the density of care bits (either 1s or 0s) that allows us to achieve the best-case compression. The pattern decompression is of the continuous-flow type because no complex handshakes are required between the tester and the chip. Unlike popular compression methods such as EDT and SmartBIST, the proposed approach is suitable for IP cores because it does not require structural information for fault simulation, dynamic compaction, or interleaved test generation. The on-chip decoder is small, independent of the circuit under test and the test set, and it can be shared between different circuits. We present compression results for a number of industrial circuits, and compare our results to other recent compression methods targeted at IP cores. We show that up to 28x reduction in test data volume and 20x reduction in testing time is obtained for these circuits.
I. INTRODUCTION
Test data volume is now recognized as a major contributor to the cost of manufacturing testing of integrated circuits (ICs) [1] - [4] . Recent growth in design complexity and the integration of embedded cores in system-on-chip (SoC) ICs has led to a tremendous growth in test data volume; industry experts predict that this trend will continue over the next few years [5] . For example, the test data volume in 2014 is expected to be as much as 150 times the data volume in 1999 [6] .
High test data volume leads to an increase in testing time. In addition, high test data volume may also exceed the limited memory depth of automatic test equipment (ATE). Multiple ATE reloads are time-consuming because data transfers from a workstation to the ATE hard disk, or from the ATE hard disk to ATE channels are relatively slow; the upload time ranges from tens of minutes to hours [7] . Test application time for scan testing can be reduced by using a large number of internal scan chains. However, the number of ATE channels that can directly drive scan chains is limited due to pin count constraints.
Logic built-in self-test (LBIST) [8] has been proposed as a solution for alleviating the above problems. LBIST reduces dependencies on expensive ATEs and allows precomputed test sets to be embedded in test sequences generated by BIST hardware to target random-pattern resistant faults. However, the memory required to store the top-up patterns for LBIST can exceed 30% of the memory used in a conventional ATPG approach [8] . With increasing circuit complexity, the storage of an extensive set of ATPG patterns on-chip becomes prohibitive [1] . Moreover, BIST can be applied directly to SoC designs only if the embedded cores are BIST-ready; considerable redesign may be necessary for incorporating BIST in cores that are not BIST-ready.
Test data compression offers a promising solution to the problem of increasing test data volume. A test set T D for the circuit under test (CUT) is compressed to a much smaller data set T E , which is stored in ATE memory. An on-chip decoder is used to generate T D from T E during test application. A popular class of compression schemes relies on the use of a linear decompressor. These techniques are based on LFSR reseeding [9] - [12] and combinational linear expansion networks consisting of XOR gates [13] - [15] , and they have been implemented in commercial tools such as TestKompress from Mentor Graphics [1] , SmartBIST from IBM/Cadence [3] , and DBIST from Synopsys [16] . These compression schemes exploit the fact that scan test vectors typically contain a large fraction of unspecified bits even after compaction. However, the on-chip decoders for these techniques are specific to the test set, which necessitates decompressor redesign if the test set changes during design iterations. Moreover, since the decoder is not generic, it cannot be shared by multiple cores in an SoC. Finally, in order to achieve the best compression, these methods resort to fault simulation and test generation. As a result, they are less suitable for test reuse in SoC designs based on IP cores.
Another category of compression methods uses statistical coding, variants of run-length coding, dictionary-based coding, and hybrid techniques [17] - [26] . These methods exploit the regularity inherent in test data to achieve high compression. However, most of these schemes target single scan chains and they require synchronization between the ATE and CUT.
We present a selective encoding method that reduces test data volume and test application time for scan testing of intellectual property (IP) cores. This method encodes the slices of test data that are fed to the scan chains in every clock cycle. Unlike many prior methods, the proposed method does not encode all the specified (0s and 1s) and unspecified (don'tcare) bits in a slice. For example, if a slice contains more 1s than 0s, only the 0s are encoded and all don't-cares are mapped to 1. We use only c tester channels, where c = log 2 (N + 1) + 2, to drive N scan chains. The logarithmic reduction in the number of tester channels allows us to use a large number of internal scan chains, thereby reducing test application time significantly. In the best case, we can achieve compression by a factor of N/c using only one tester clock cycle per slice. We derive an upper bound on the density of care bits (either 1s or 0s) that allows us to achieve the best-case compression. The pattern decompression is of the continuousflow type because no complex handshakes are required between the tester and the chip, and there is no need to introduce tester stall cycles. Unlike popular compression methods such as EDT and SmartBIST, the proposed approach is suitable for IP cores because it does not require structural information for fault simulation, dynamic compaction, or interleaved test generation. The on-chip decoder is small, independent of the circuit under test and the test set, and it can be shared between different circuits. We present compression results for a number of industrial circuits, and compare our results to other recent compression methods targeted at IP cores. We show that up to 28x reduction in test data volume and 20x reduction in testing time is obtained for these circuits.
The steady increase in clock frequencies over recent years has led to designs with a small number of gates between latches, or between latches and I/O pins. As a result, logic circuits today have very short combinational logic depth, and many logic cones with very little overlap. This is in contrast to older circuits such as the ISCAS-85 benchmarks that tend to have a smaller number of overlapping logic cones. A consequence of the shallow logic depth is that test patterns in present-day circuits contain many don't-care bits; e.g., it has been reported recently that test sets for industrial circuits contain only 1%-5% care bits [27] . After a desired stuckat coverage is obtained, a commercial test pattern generator typically uses random fill to increase the likelihood of surreptitious detection of unmodeled faults. However, if the test sets for the cores are delivered with the don't-care bits to the system integrator, an appropriate compression method can be used at the system level to reduce test data volume and testing time. This imposes no additional burden on the core vendor. Unmodeled faults can still be detected if the compression method does not arbitrarily map all don't-cares to either 1s or 0s.
We do not address the problem of output compaction in this work. The proposed input compression method can be used with recent output compaction methods such as X-compact [2] , convolutional compaction [28] , and i-compact [29] to further reduce test data volume.
The rest of the paper is organized as follows. The details of the proposed approach are described in Section II. Section III presents the decompression architecture and Section IV Control-code K-bit data-code presents compression results for industrial circuits.
II. PROPOSED APPROACH
As shown in Fig. 1 , the proposed approach encodes the slices of test data (scan slices) that are fed to the internal scan chains. The on-chip decoder contains an N -bit buffer, and it manipulates the contents of the buffer according to the compressed data that it receives. After all the compressed data for a single slice is received, the data in the buffer is delivered to the scan chains.
Each slice is encoded as a series of c-bit slice-codes, where c = K + 2, K = log 2 (N + 1) and N is the number of internal scan chains in the CUT. As shown in Fig. 2 , the first two bits of a slice-code form the control-code that determines how the following K bits, referred to as the data-code, are interpreted.
As described in Section I, the proposed approach only encodes a subset of the specified bits in a slice. First, the encoding procedure examines the slice and determines the number of 0-and 1-valued bits. If there are more 1s (0s) than 0s (1s), then all X's in this slice are mapped to 1 (0), and only 0s (1s) are encoded. The 0s (1s) are referred to as target-symbols and are encoded into data-codes in two modes: 1) single-bit-mode and 2) group-copy-mode.
In the single-bit-mode, each bit in a slice is indexed from 0 to N − 1. A target-symbol is represented by a data-code that takes the value of its index. For example, to encode the slice "XXX10000", the X's are mapped to 0 and the only target-symbol of 1 at bit position 3 is encoded as "0011". In this mode, each target-symbol in a slice is encoded as a single slice-code. Obviously, if there are many target-symbols that are adjacent or near to each other, it is inefficient to encode each of them using separate slice-codes. Hence the group-copy-mode has been designed to increase compression efficiency.
In the group-copy-mode, an N -bit slice is divided into M = N/K groups, and each group (with the possible exception of the last group) is K-bits wide. If a group contains more than two target-symbols, the group-copy-mode is used and the entire group is copied to a data-code. Two data-codes are needed to encode a group. The first data-code specifies the index of the first bit of the group, and the second data-code contains the actual data. In the group-copy-mode, don't-cares can be randomly filled instead of being mapped to 0 or 1 by the compression scheme.
For example, let N = 8 and K = 4, i.e., each slice is 8-bits wide and consists of two 4-bit groups. To encode the slice "X1110000", the three 1s in group 0 are encoded. The resulting data-codes are "0000" and "X111", which refer to bit 0 (first bit of group 0) and the content of the group, respectively.
Since data-codes are used in both modes, control-codes are needed to avoid ambiguity. Control-codes "00", "01" and "10" are used in the single-bit-mode and "11" is used in the groupcopy-mode. Control-codes "00", "01" are referred to as initialcontrol-codes and they indicate the start of a new slice.
The encoding procedure is summarized in Fig. 3 . In
Step (1), each test vector is divided into a series of slices. We then encode each slice as a series of slice-codes. In Steps (3)- (7), the numbers of 0s and 1s are calculated, and the target-symbol as well as the control-code of the first slice-code are set. The first slice-code of each slice must contain an initial-controlcode.
Steps (8)- (14) encode all the groups of a slice. For each group of a slice, if it contains more than two target-symbols, it is encoded using the group-copy-mode; otherwise it is encoded using the single-bit-mode. A single-bit-mode slicecode contains a control-code of "00", "01", or "10". The datacode for the single-bit-mode ranges from 0 to N . However, since there are only N scan chains, a data-code of N is interpreted as a dummy and it implies that no bits are to be set. For example, if a slice contains no target-symbols, it is encoded as an initial-control-code and a dummy data-code.
Step (15) generates slice-codes for the entire slice. The first slice-code is a single-bit-mode code and it can encode any group that contains only one target-symbol. To improve compression efficiency, if n ≥ 2 adjacent groups are to be encoded using the group-copy-mode, they are merged together. Only one data-code is used to specify the index of the first bit of these groups, followed by n data-codes carrying the contents of the n groups. Obviously, group-copy-mode slicecodes should be interleaved with single-bit-mode slice-codes. Since the data-code carrying the bit index of the first group and the data-codes carrying the actual data are all associated with the same control-code of "11", only the occurrence of a single-bit-mode slice-code can terminate a series of groupcopy-mode slice-codes.
As can be seen from Fig. 3 , the encoding procedure consists of two nested loops: the outer loop is for scan slices, and the inner loop is for groups of a slice. Hence its time complexity is O(S · N ) = O(V ). Table I shows a complete example to further illustrate the encoding procedure.
The following two theorems provide more insights into the proposed compression method based on selective encoding of scan slices.
Theorem 1: Let the test data volume for the CUT with N internal scan chains be V bits. Let the compressed test data volume obtained after selective encoding of scan slices Encoding procedure:
(1) Format the given test vectors into slices; (2) for each slice (3) Determines the number of 0s (k0) and 1s (k1) in the slice; (4) If k0 > k1 then (5) target-symbol := 1, 1st control-code := 00; (6) else (7) target-symbol := 0; 1st control-code := 01; (8) for each group of the slice (9) calculate the number of target-symbols; (10) if number-of-target-symbols > 2 then (11) encode the group using the group-copy-mode; (12) else (13) encode the group using the single-bit-mode; (14) end for (group); (15) generate slice-codes for the current slice. be U bits. The maximum value of the compression factor V /U that can be achieved is given by f = N/c, where c = log 2 (N + 1) + 2 corresponds to the number of ATE channels used to drive the decompression logic. Proof: The maximum compression is achieved when each slice is encoded as a single slice-code. Since a total of c = log 2 (N + 1) + 2 ATE channels is used in the proposed method, the test data volume for every scan slice is reduced by a factor of N/c.
Theorem 2: Let the number of scan slices be S, and let k 0,i (k 1,i ) be the number of 0s (1s) in scan slice i, 0 ≤ i ≤ S − 1. A necessary condition for the maximum compression factor of Theorem 1 to be achieved is given by:
Proof: The maximum compression factor of Theorem 1 can only be achieved if every slice is encoded as a single slicecode. This in turn is only possible if every scan slice contains either zero or one target-symbol, i.e., min(k 0,i , k 1,i ) ≤ 1, 0 ≤ i ≤ S − 1. The necessary condition given by (1) therefore follows.
The significance of Theorem 2 lies in the fact that the maximum compression factor of N/c can only be achieved if the test generator is tailored to satisfy (1). Test set relaxation methods as in [30] can also be used to satisfy (1) . Such methods however require structural information about the circuit.
A property of the proposed compression method is that consecutive c-bit compressed slices fed by the ATE are often identical or compatible. Therefore, ATE pattern-repeat can be used to further reduce test data volume after selective encoding of scan slices. In the uncompressed data sets, especially among the test vectors that lie near the end of a test set, there are a large number of consecutive slices that contain no targetsymbols. These slices are encoded as identical single slicecodes that have only dummy data-codes. With ATE patternrepeat, these slice-codes can be further compacted. Additionally, consecutive group-copy-mode slice-codes can also be compacted if they are compatible. Fig. 4 shows how a set of scan slices are encoded. The example shows that some slice-codes, e.g., the first two in the encoded test set, can be combined and applied using ATE pattern-repeat. Fig. 5 shows the state transition diagram of the decoder. The decoder enters designated states and performs different operations as specified by the control-codes that it receives. Initially the decoder is in the init state; when it receives an "initial-control-code", it enters the single-bit-mode and performs a series of operations referred to as P1. Table II explains the five groups of operations (P1-P5) in Fig. 5 . Fig. 6 shows the block diagram of the decoder. The finitestate machine (FSM) generates control signals for the other components. The K-bit address register is used in the groupcopy-mode to store the index of the first bit of the target group. This register can be incremented by K to address a series of adjacent groups. The K-to-N address decoder generates selection signals to address a single bit of the buffer. The N -bit buffer contains combinational logic that provides the following functionalities: (1) each bit in the buffer can be individually If the control-code is 00 (01), the target-symbol is 1 (0). Set the bit specified by the associated data-code to the target-symbol and all other bits to the complemented value. P2
III. DECOMPRESSION ARCHITECTURE
Shift the current content of the buffer to the internal scan chains, then perform P1.
P3
Save the value of the data-code to an address register, which is the index of the first bit of the group.
P4
Copy the value of the data-code to the group specified by the address register, then increment the address register by K. P5 Set the bit specified by the data-code to the target-symbol.
addressed, and (2) all bits in the same group can be addressed in parallel. These two functions are used in the single-bit-mode and the group-copy-mode, respectively. The 2-bit input signal control is the control-code from the tester. The signal rst, when asserted to 0, resets the FSM to its initial state. The signal v is set to high when the decoding process for a slice is finished and the content of the buffer is shifted to the internal scan chains.
If the signal is grp is asserted, the decoder works in the group-copy-mode. The inc grp is used to increment the address register by K. In the group-copy-mode, the K-to-N address decoder receives input from the address register; in the single-bit-mode, it receives input from the data-code input. The N -bit selection (sel) signal is used to address a single bit in the buffer. At any given time only one of the N wires is asserted.
The second bit of the control-code (control), i.e., the targetsymbol, is latched to ts. In the single-bit-mode, the specified bit is set to ts. If the control-code is "00" or "01", the signal set buf is asserted, and all other bits except the specified bit are set to the complement of ts (ts). The signal den is asserted whenever the buffer contents are to be changed. The signal den is set to 0 only during the first clock cycle of the group-copy-mode; at the same time the address of the first bit of the group is loaded to the address register.
The sel signal from the address decoder can only address a single bit of the buffer. However, in the group-copy-mode, all the K bits in the target group should be addressed (the last group may contain less bits). Therefore, in the group-copymode, additional combinational logic is needed to address the other bits together with the first bit. For each bit i of the buffer, an N -bit signal en(i) is defined, where
where i = 0, 1, . . . , N − 1. Fig. 7 shows the structure of a single bit of the buffer. Each bit is represented by a falling edge-triggered D-flipflop (DFF) with enable input (EN ). The DFF receives input data from a multiplexer and receives the enable signal from the combinational logic. In the group-copy-mode, the input is from the data-code from the tester. Bit i of the buffer is connected to bit i mod K of the data-code. In the single-bit-mode, the input is either ts or ts, as determined by en(i). The multiplexer implements the following function:
if is grp = 0, set buf = 1. The DFF is changed only when EN is high. The combinational logic implements the boolean function
The DFF's are falling edge-triggered and the FSM is rising edge-triggered. Therefore, upon the rising edge of the clk signal, all control signals are generated and become stable before the falling edge of clk. The buffer is updated at the falling edge of clk.
The state diagram of the FSM is shown in Fig. 8 . The state S0 is the initial state. States S1 and S2 correspond to the single-bit-mode and the group-copy-mode, respectively. We simulated the decoder using VHDL and Synopsys tools to ensure its correct operation. We also synthesized the decoder using Synopsys Design Compiler to assess the hardware overhead. The synthesized FSM contains only 5 flip-flops and 23 combinational gates. For the lsi 10k library, the reported area is 55 units. The other parts of the decoder are synthesized separately since they depend on K and N . For N = 64 and K = 7, the synthesized circuit contains 536 gates and 71 flipflops, and the area is 1341 units. If N = 1024 and K = 11, the synthesized circuit contains 6409 gates and 1035 flip-flops, and the area is 18,877 units. For the larger than million-gate designs considered in our experiments, this corresponds to an area overhead of only 1%. The schematic of the FSM is shown in Fig. 9 .
IV. EXPERIMENTAL RESULTS
In this section, we apply the proposed approach to eight representative industrial circuits. These circuits vary in size from approximately 50K gates to over 1.4M gates. For each circuit, we compress test sets with high fault coverage that were provided to us by industrial partners. These test sets are generated by commercial ATPG tools with dynamic compaction turned on and random-fill turned off. The percentages of specified bits in these test sets are approximately in the range 1%-4%. We do not report compression results for the ISCAS-89 benchmark circuits because they are too small to be representative of today's designs. Moreover, their test sets contain too many care bits, due in part to their large logic depth. Table IV shows the compression results obtained using the proposed method for the different test cases. Column N refers to the number of internal scan chains and column c denotes the number of ATE channels. We consider a varying number of internal scan chains N to show how an appropriate value of N can be determined for designs with flexible scan chains. The test application time and the size of compressed data are shown in columns T AT and T E , respectively. The parameter Υ V = |T D | / |T E | refers to the data volume reduction factor. The parameter Υ T AT is the test application time reduction factor over standard scan testing based on c ATE channels. Without ATE pattern-repeat, Υ V and Υ T AT are as high as 22.96x and 22.91x, respectively. With ATE pattern-repeat, Υ V is as high as 28.82x. For ckt-8 with 255 internal scan chains, the compression of 20.12x is close to the theoretical maximum of 25.5x (without ATE pattern-repeat) predicted by Theorem 1. The CPU time for generating the compressed data is at most two minutes, even for the largest circuits.
We compare the proposed compression method to two other recent compression methods that have been proposed for IP cores. These methods also do not require fault simulation or test generation. To ensure fairness of comparison, we do not consider compression methods that require structural information. Table V presents comparative data for two-dimensional compression [25] . The compression method in [25] was implemented and applied to a number of industrial test cases. In every case considered, the number of ATE channels required is much less for the proposed method compared to [25] . Out of the 21 cases considered, |T E | for [25] is higher in 20 cases. The value of Υ T AT in [25] is smaller in 18 cases.
The test sets described in Table III were obtained using dynamic compaction during ATPG. As is the case of other compression methods, these test sets were not compressed further using static compaction after ATPG. In some cases, e.g. ckt-3, a commercial ATPG tool was given a maximum number of care bits per vector as a constraint. It was reported in [25] that the compressed test sets for ckt-1 and ckt-2 are an order of magnitude smaller than the compacted test sets used during production testing for these circuits. Therefore, the proposed method can achieve significant reduction in data volume over ATPG-compacted test sets.
Table VI compares the proposed method to the recent compression technique based on dictionaries with corrections [21] . We implemented the procedures from [21] and applied this technique to several industrial test cases. Table VI is  similar to Table V , with an additional column mem that shows the size of the on-chip memory. Out of the 24 cases considered, |T E | in [21] is higher in 15 cases. Note that for the 9 cases where |T E | in [21] is less, an excessive amount of on-chip storage (as high as 8M bits) is needed for [21] . Hence it is difficult to use [21] in practice for these cases. T AT is lower in [21] in most cases, but it also requires a much larger number of ATE channels. If the number of ATE channels and the amount of on-chip storage are limited (or constrained), the proposed method outperforms [21] both in terms of T E and T AT .
Finally, we determine for each circuit, the number of internal scan chains N that leads to the maximum data volume reduction factor Υ V as well as the value of N that leads to the maximum T AT reduction factor Υ T AT . For IP cores with flexible scan chains, this information can allow appropriate scan chain configurations. Fig. 10 and Fig. 11 show how Υ V and Υ T AT vary with N for the test cases. For some circuits, namely ckt-1, ckt-2, ckt-3-200, and ckt-7, the best value of N for the highest Υ V also leads to the highest Υ T AT . However, for the other test cases, the best value of N for the highest Υ V does not maximize Υ T AT . For example, for ckt-8, Υ T AT is maximum for N = 255 while Υ V is maximum for N = 511.
V. CONCLUSION
We have presented a test data compression technique for designs with multiple scan chains. This method does not require detailed structural information about the circuit under test (CUT), and utilizes a generic on-chip decoder that is independent of the CUT and the test set. While the hardware overhead depends on the number of internal scan chains, we have seen that for an industrial circuit with over 1M gates, the overhead is only 1% for as many as 1024 internal scan chains. If a small amount of circuit redesign is permitted, we can reduce the hardware overhead by modifying the first scan cells of each scan chain such that they can be used as the Nbit on-chip buffer. The clock inputs of these scan cells need to be appropriately gated so that they can be triggered separately from other cells in the same scan chain. Experimental results for eight industrial circuits show that compared to dynamically compacted test sets, up to 28x reduction in test data volume and 20x reduction in test application time can be obtained. [25] . [25] Proposed method [21] . [21] Proposed method 
