Abstract-We present a data compression method and decompression architecture for testing embedded cores in a system-on-a-chip (SOC). The proposed approach makes effective use of Golomb coding and the internal scan chain(s) of the core under test and provides significantly better results than a recent compression method that uses Golomb coding and a separate cyclical scan register (CSR). The major advantages of Golomb coding of test data include very high compression, analytically predictable compression results, and a low-cost and scalable on-chip decoder. The use of the internal scan chain for decompression obviates the need for a CSR, thereby reducing hardware overhead considerably. In addition, the novel interleaving decompression architecture allows multiple cores in an SOC to be tested concurrently using a single ATE I/O channel. We demonstrate the effectiveness of the proposed approach by applying it to the ISCAS 89 benchmark circuits.
I. INTRODUCTION
Embedded cores are becoming commonplace in large system-on-a-chip (SOC) designs [1] . Along with the benefits of higher integration and shorter time to market, intellectual property (IP) cores pose several difficult test challenges. The volume of test data for an SOC is growing rapidly as IP cores become more complex and an increasing number of these cores are integrated in a chip. In order to effectively test these systems, each core must be adequately exercised with a set of precomputed test patterns provided by the core vendor. However, the input/output (I/O) channel capacity, speed and accuracy, and data memory of automatic test equipment (ATE) are severely limited.
The testing time for an SOC depends on the test data volume, the time required to transfer the data to the cores, and the rate at which it is transferred (measured by the cores test data bandwidth and ATE channel capacity). Lower testing time increases production capacity as well as reduces test cost and time to market for an SOC. New techniques are therefore needed for decreasing test data volume in order to overcome memory bottlenecks and to reduce testing time.
An attractive approach for reducing test data volume for SOCs is based on the use of data compression techniques [2] - [4] . In this approach, the precomputed test set T D provided by the core-vendor is compressed (encoded) to a much smaller test set T E and stored in the ATE memory. An on-chip decoder is used for pattern decompression to Publisher Item Identifier S 0278-0070(02)04700-0.
generate TD from TE during pattern application. Test data compression using statistical coding of test sequences for synchronous sequential (nonscan) circuits was presented in [2] and for full-scan circuits in [3] . While the compression method in [2] is restricted to sequential circuits with a large number of flip-flops and relatively few primary inputs, the work presented in [3] does not conclusively demonstrate that statistical coding provides greater compression than standard ATPG compaction methods for full-scan circuits [5] , [6] . Test data can be more efficiently compressed by taking advantage of the fact that the number of bits changing between successive test patterns in a test sequence is generally very small. This observation was used in [4] , where a "difference vector" sequence T di determined from T D was compressed using run-length coding. A drawback of the compression method described in [4] is that it relies on variable-tofixed-length codes, which are less efficient than more general variable-to-variable-length codes [7] , [8] . Furthermore, it is inefficient for cores with internal scan chains that are used to capture test responses; in these circuits, separate CSRs must be added to the SOC, thereby increasing hardware overhead. A more efficient compression and decompression method was used in [9] , where T di was compressed using variable-to-variable-length Golomb codes. However, this approach requires separate CSRs and is therefore also inefficient for cores that use the same internal scan chains for applying test patterns and capturing test responses.
In this companion paper to [9] , we present an improved test data compression and decompression method for IP cores in an SOC. The proposed approach makes effective use of Golomb codes and the internal scan chain(s) of the core under test. No separate CSR is required for pattern decompression. The difference sequence T R di is derived from the given precomputed test set T D using the fault-free responses R of the core under test to T D . Golomb coding is then applied to T R di .
The resulting encoded test set TE is much smaller than the original precomputed test set TD. We apply our compression approach to test sets for the ISCAS 89 benchmark circuits and show that T E is not only considerably smaller than the smallest test sets obtained using ATPG compaction [5] , but it is also significantly smaller than the compressed test sets obtained using Golomb coding in [9] .
The proposed compression approach for reducing test data volume is especially suitable for system-on-a-chip containing IP cores since it does not require gate-level models for the embedded cores. Precomputed test sets can be directly encoded without any fault simulation or subsequent test generation. This is in contrast to other recent techniques, such as LFSR-based reseeding for BIST [10] and scan broadcast [11] , which require structural models for fault simulation and test generation. The mixed-mode BIST technique in [10] relies on fault simulation for identifying hard faults and test generation to determine test cubes for these faults. The scan broadcast technique in [11] also requires test generation.
We extend the decompression architecture of [9] to an interleaving scheme that allows multiple cores to be tested in parallel with a single ATE I/O channel. We also present analytical results for test data compression and testing time. Finally, we show that test data compression not only reduces the volume of test data but it also allows a slower tester to be used without any penalty on testing time. 
II. COMPRESSION METHOD AND TEST ARCHITECTURE
We first review Golomb coding and its application to test data compression [9] . The first step in the encoding procedure is to select the Golomb code parameter m. The choice of m has received a lot of attention in the information theory literature; for certain distributions of the input data stream (T R di in our case), the group size m can be optimally determined. For example, if the input data stream is random with 0 probability p, then m should be chosen such that p m 0:5 [8] .
However, since the difference vectors for precomputed test sets do not satisfy the randomness assumption, the best value of m for test data compression must be determined experimentally. Nevertheless, it has been shown in [9] that the best value of m can be approximated analytically.
Once the group size m is determined, the runs of zeros in precomputed test set are mapped to groups of size m (each group corresponding to a run length). The number of such groups is determined by the length of the longest run of zeros in the precomputed test set.
The set of run lengths f0; 1; 2; . . . ; m 0 1g forms group A1; the set fm; m + 1; m + 2; . . . ; 2m 01g, group A 2 ; etc. In general, the set of run lengths f(k 0 1)m; (k 0 1)m + 1; (k 0 1)m + 2; . . . ; km 0 1g comprises group A k [8] . To each group A k , we assign a group prefix of (k 0 1) 1s followed by a zero. We denote this by 1 (k01) 0. If m is chosen to be a power of two, i.e., m = 2 N , each group contains 2 N members and a log 2 m-bit sequence (tail) uniquely identifies each member within the group. Thus, the final code word for a run length L that belongs to group A k is composed of two parts, a group prefix and a tail. The prefix is 1 (k01) 0 and the tail is a sequence of log 2 m bits. It can be easily shown that (k 0 1) = (L mod m) i.e., k = (L mod m) + 1. The encoding process is illustrated in Fig. 1 for
The next step in the compression procedure is to derive the difference vector set T di from T D , where T D = ft 1 ; t 2 ; t 3 ; . . . ; t n g is the (ordered) precomputed test set. The ordering is determined using a heuristic procedure described later. T di is defined as follows: where a bit-wise exclusive-or operation is carried out between patterns t i and t j . This assumes that the CSR starts in the all-zero state.
(Other starting states can be considered similarly.) The details of the Golomb coding procedure are presented in the companion paper [9] , hence omitted here.
The proposed method differs from [9] in that no separate CSR is used; instead the internal scan chain is used for pattern decompression and the fault-free responses of the core under test are used to generate a difference vector set T R di . Given an (ordered) precomputed test set T D , di is shown in Fig. 2 . As observed in [9] , test data compression is more effective if TD consists of test cubes containing don't-care bits. In order to determine T R di in such cases, we need to assign appropriate binary values to the don't-care bits and perform logic simulation to obtain the corresponding fault-free responses. (In general, the simulation model for the core provided by the core vendor can be used to obtain the fault-free responses.) First, we set all don't-care bits in t 1 , the first pattern in T D , to zeros and use the logic simulation engine of FSIM [12] to generate the fault-free response r 1 . The ordering algorithm described below is then used to generate the successive test patterns.
The problem of determining the best ordering is equivalent to the NP-Complete Traveling Salesman problem. Therefore, a greedy algorithm is used to generate an ordering and the corresponding T R di . Suppose a partial ordering t1t2 1 11ti has already been determined for the patterns in TD . To determine ti+1 , we first determine ri using FSIM and then calculate the Hamming distance H D(r i ; t j ) between r i and all patterns t j that have not been placed in the ordered list. We define An on-chip decoder decompresses the encoded test set T E and produces T R di . The exclusive-or gate and the internal scan chain are used to generate the test patterns from the difference vectors. As discussed in the companion paper [9] , the decoder can be efficiently implemented by a log 2 m-bit counter and a finite-state machine (FSM). The synthesized decode FSM circuit contains only four flip-flops and 34 combinational gates [9] . For any circuit whose test set is compressed using m = 4, the given logic is the only additional hardware required other than the two-bit counter.
III. ANALYSIS OF TEST APPLICATION TIME AND TEST DATA COMPRESSION
In this section, we analyze the testing time for a single scan chain when Golomb coding is employed with the test architecture shown in Fig. 2 . From the state diagram of the Golomb decoder [9] , we note the following.
• Each "1" in the prefix part takes m cycles for decoding.
• Each separator "0" takes one cycle.
• The tail part takes a maximum of m cycles and a minimum of = log 2 m + 1 cycles.
Let n c be the total number of bits in T E and r be the number of ones in T R di . T E contains r tail parts, r separator zeros, and the number of prefix ones in TE equals nc 0r(1+log 2 m). Therefore, the maximum and minimum testing times (T max and T min , respectively), measured by the number of cycles, are given by We will make use of this result in Section IV. A major advantage of Golomb coding is that on-chip decoding can be carried out at scan clock frequency fscan while TE can be fed to the core under test with external clock frequency f ext < f scan . This allows us to use slower testers without increasing the test application time. The external and scan clocks must be synchronized, e.g., using the scheme described in [13] , and f scan = mf ext , where the Golomb code parameter m is usually a power of 2. This allows the bits of T R di to be generated by the decoder at the frequency of f scan . We now present an analysis of testing time using f sys = mf ext and compare the testing time for our method with that of external testing in which ATPG-compacted patterns are applied using an external tester.
Let the ATPG-compacted test set contain p patterns and let the length of the scan be n bits. Therefore, the size of the ATPG-compacted test set is pn bits and the testing time T ATPG equals pn external clock cycles. Next, suppose the difference vector T R di obtained from the uncompacted test set contains r ones and its Golomb-coded test set T E contains n c bits. The maximum number of scan clock cycles required for applying the test patterns using the Golomb coding scheme is Tmax = mnc 0 r(m log 2 m 0 1). Now, the maximum testing time (seconds) when Golomb coding is used is given by 
We next analyze the amount of compression that is achieved using Golomb coding of a precomputed test set T D . The following three lemmas will lead to the main result in Theorem 1. As in [9] , we assume without loss in generality here that the difference sequence always ends in one.
Lemma 1: Let TD be the given precomputed test set, and let T R di be the bit stream derived from T D and the set of fault-free responses. Let the number of don't cares in TD be n . The number of zeros in T R di is at least n . Proof: The lemma follows from the fact that every don't care in TD can be mapped to a zero in T R di , while ones and zeros in TD must be selectively mapped to ones or zeros in T R di , depending on the fault-free response.
Lemma 2 [9] : If an n-bit data stream S containing r ones is encoded using Golomb code with parameter m, an upper bound on the length G S of the encoded sequence is given by Lemma 3: Let S be any binary sequence and let S ? be a binary sequence derived from S by replacing one or more ones in it by zeros. Let S E (S ? E ) be the Golomb-coded sequence corresponding to S (S ? ). Then len(S E ) > len(S ? E ), where len(S E ) and len(S ? E ) are the number of bits in SE and S ? E , respectively. Proof: Suppose we complement a 1 in S that separates two runs of zeros of length l 1 and l 2 (l 1 , l 2 0), respectively, to obtain S ? . We now have a run of (l1 + l2 + 1) zeros in S ? . The number of bits N required to encode the two runs of zeros of length l 1 and l 2 is given by Similarly, the number of bits in N ? required to encode the single run of (l1 + l2 + 1) zeros in S ? is given by Therefore, complementing a single one to a zero always decreases the length of the Golomb-coded sequence. This argument can be easily extended using transitivity to show that len(S E ) > len(S ? E ) whenever one or more ones in S are changed to zeros to obtain S ? .
We now present an upper bound on the amount of expression that is obtained via Golomb coding of T R di . The proof of the theorem follows from Lemmas 1-3.
Theorem 1: Let T D be the given precomputed test set, and let T R di be the n-bit data stream derived from T D and the set of fault-free responses. Let the number of don't cares in TD be n . If T R di is encoded using Golomb code with parameter m, an upper bound on the length G of the encoded sequence is given by Theorem 1 provides an easy-to-compute bound on the size of the encoded test set T E . This bound depends only on the precomputed test set TD and is independent of the fault-free response. It can therefore be obtained without any logic simulation. We list these bounds for several ISCAS 89 circuits in Section V.
IV. INTERLEAVING DECOMPRESSION ARCHITECTURE
We now present a novel interleaving decompression architecture, which enables testing of multiple cores or the loading of multiple balanced scan chains in parallel. The same decoder can be used to drive equal-length scan chains in one or more cores in parallel. An important constraint here is that the same value of m must encode test sequences for all the scan chains. The proposed decompression architecture not only reduces the testing time and the size of the test data to be stored in the ATE memory, but also allows testing of multiple cores using a single ATE I/O channel, thereby increasing the ATE I/O channel capacity.
As discussed in Section II, when Golomb coding is applied to a block of data containing a run of 0s followed by a single 1, the code word contains two parts-a prefix and tail. For a given code parameter m (group size), the length of the tail (log 2 m) is independent of the run length. Note further that every one in the prefix corresponds to m zeros in the decoded difference vector. Thus the prefix consists of a string of ones followed by a zero, and the zero can be used to identify the beginning of the tail.
As shown in [9] , the FSM in the decoder runs the counter for m decode cycles whenever a one is received and starts decoding the tail as soon as a zero is received. The tail decoding takes at most m cycles. During prefix decoding, the FSM has to wait for m cycles before the next bit of the prefix can be decoded. Therefore, we can use interleaving to test m cores together, such that the decoder corresponding to each core is fed with encoded prefix data after every m cycles. (This can also be used to feed multiple scan chains in parallel as long as the capture cycles of the scan chains are synchronized.) Whenever the tail is to be decoded (identified by a zero in the encoded bit stream), the respective decoder is fed with the entire tail of log 2 m bits in a single burst of log 2 m cycles. The SOC channel selector consisting of a demultiplexer, a log 2 m counter, and an FSM is used for interleaving; see Fig. 4 . This interleaving scheme works as follows. First, the encoded test data for m cores are combined to generate a composite bit stream T C that is stored in the ATE. Next, T C is fed to the FSM, which is used to detect the beginning of each tail and to feed the demultiplexer. An i-bit counter (i = log 2 m) is used to select the outputs to the decoders of the various cores. T C is obtained by interleaving the prefix parts of the compressed test sets of each core, but the tails are included unchanged in TC . An example is shown in the Fig. 5 where compressed data for two cores (generated using group size m = 2) have been interleaved to obtain the final encoded test set to be applied through the decompression scheme for multiple cores.
We now describe the SOC channel selector in more detail. The FSM, the i bit counter, and the demultiplexer together constitute the SOC channel selector. The FSM is used to detect the beginning of the tail and generates the clk stop signal to stop the i-bit counter. The data in is the input to the FSM, data out is the output, and signals vin and vout are used to indicate that the input and output data is valid. The i-bit counter is connected to the select lines of the demultiplexer and the demultiplexer outputs are connected to the decoders of the different scan chains. Every scan chain has a dedicated decoder. This decoder receives either a one or the tail of the compressed data corresponding to the various cores connected to the scan chain. If the FSM detects that a portion of the tail has arrived, the zero that is used to identify the tail is passed to the decoder and the clk stop goes high for the next m cycles. The output of the demultiplexer does not change for this period and the entire tail of length log 2 m-bits is passed on continuously to the appropriate core.
The state diagram of the FSM for m = 4 and the corresponding timing diagram are shown in Figs. 6 and 7 , respectively. The FSM is fed with T C corresponding to four different cores. It remains in state S0 as long as it receives the ones corresponding to the prefixes. As soon as a zero is received, it outputs the entire tail unchanged and makes clk stop high. This stops the i-bit counter and prevents any change at demultiplexer output. It is shown in the timing diagram (Fig. 7) that whenever a zero is received, the SOC channel selection remains unchanged for the next (1 + m) cycles.
As discussed in Section III, the difference in T max and T min is given by T = r(m 0 log 2 m 0 1). Therefore, the difference between maximum and minimum testing times for a single tail is t = (m 0 log 2 m 0 1). If we restrict m to be small, m 8, t 4. In this case, the decode FSM can be easily modified by introducing additional states to the Golomb decoder FSM of [9] such that the tail decoding always takes m cycles and t = 0. To make tail and prefix decoding equal for m = 4, three additional states are required as shown in Fig. 8 . The additional states do not adversely affect the testing time and the hardware overhead significantly. There are m cores in parallel and each separator zero and tail takes (1 + m) cycles to decode. Therefore, for m cores, the decoding time t tail for the separator and the tail is given by rj . Since all the prefixes of the cores are decoded in parallel, the number of cycles t prex required for decoding all the prefixes in T C is equal to the number of ones in the prefix of the core with the largest encoded test data. Therefore t prex = maxf(nC; i 0 ri(1 + log 2 m))mg = (n C; max 0 r max (1 + log 2 m))m where nC; i and ri are the number of encoded bits in TE and number of ones in T di for the ith core, respectively, and n C; max and r max are the number of encoded bits in TE and number of ones in T di for the core with the largest encoded test data. Therefore, total testing time T I for m cores when tested in parallel using the interleaving architecture is given by T I = t prex + t tail = (n C; max 0 r max ( Consider a hypothetical example of four cores with encoded test data size equal to n C; 1 = 40; n C; 2 = 60; n C; 3 = 80; n C; 4 = 100 and number of ones equal to r 1 = 4; r 2 = 6; r 3 = 8; r 4 = 10. Therefore, nC; max = 100; rmax = 10; m = 4; R = 28 and jTCj = 280. Therefore T N I 0T I = 4((280 0 100) 0 (1 + 2)(28 0 10)) = 504. It is evident from the above analysis that interleaving architecture reduces testing time and increases the ATE channel bandwidth. We developed a Verilog model for the FSM for m = 4 and simulated it for several T C sequences. The gate-level schematic (derived using Synopsys Design Compiler) of the channel selector FSM consists of only four flip-flops and 17 gates. The additional hardware overhead is therefore very small.
V. EXPERIMENTAL RESULTS
In this section, we present experimental results on Golomb coding of the precomputed test sets for the six largest ISCAS 89 benchmark circuits. We used test cubes (with dynamic compaction) obtained using the Mintest ATPG program [5] . The difference between the size of the test sequences here and in [9] can be explained as follows. Since the number of inputs driven by the scan chains is less in every case than the number of outputs that feed the scan chains, additional (dummy) zeros are inserted in the difference vector sequence T R di . This procedure was explained in Section II.
The results shown in Table I demonstrate that significant amount of compression is achieved if Golomb coding is applied to difference vectors obtained from the test set and the fault-free responses. In five out of six cases, we achieve better results than ATPG compaction using Mintest. In addition, the proposed method outperforms [9] in five out of the six cases. The upper bound values (derived from Theorem 1) represent the worst case compression that can be achieved using Golomb codes. The upper bound is an important parameter which can be used to determine the suitability of the proposed method. Table II demonstrates that Golomb coding allows us to use a slower tester without incurring any testing time penalty. As discussed in Section III, Golomb coding provides three important benefits: 1) it significantly reduces the volume of test data; 2) the test patterns can be applied to the core under test at the scan clock frequency f scan using an external tester that runs at frequency fext = fscan =m; and 3) in comparison with external testing using ATPG-compacted patterns, the same testing time is achieved using a much slower tester. The third issue is highlighted in Table II. We next compare our results with a recent parallel scan design technique aimed at reducing test data volume and testing time [11] . A direct comparison is difficult since the two methods employ different strategies. Nevertheless, a comparison with the published results in [11] shows that the proposed method outperforms [11] for five out of the six largest ISCAS 89 benchmark circuits; see Table III . Moreover, the scan broadcast approach in [11] requires a structural model of a core for test generation and for determining aliased faults, a restriction that does not affect the proposed compression technique. [11] Testing time comparison between the two methods is especially difficult. The testing time in [11] is presented in terms of clock cycles, whereas the proposed method employs two different clock rates: a faster on-chip decode clock and a slower off-chip tester clock for feeding TE. Hence, even though four to six times more clock cycles are required here compared to [11] , we claim that the testing time is less since f scan is much larger than f ext and the use of 16 scan chains as in [11] can offer significantly more parallelism. (We assumed single scan chains for all the benchmarks in our experiments.)
The compression method presented here is directed at IP cores in SOCs, which are not BIST-ed and whose structural models are not available. Only the precomputed test sets are available to the system integrator. Greater compression can be achieved with LFSR-based reseeding in a BIST environment [10] . This, however, requires that the cores be BIST-ed and structural models be made available for fault simulation to identify easy faults and for test generation to determine test cubes for hard faults.
VI. CONCLUSION
We have presented a new test data compression and decompression method for testing embedded cores in an SOC. We have shown that the proposed scheme makes efficient use of Golomb codes and the internal scan chain(s) of the core under test to achieve high test data compression for SOCs and to save ATE memory and testing time.
We have also presented a novel interleaving decompression architecture that allows testing of multiple cores in parallel using a single ATE I/O channel. This reduces the testing time of an SOC further and increases the ATE I/O channel capacity. The additional logic for the SOC channel selector is small and easy to implement. In addition, it is independent of the multiple cores under test and their corresponding precomputed test sets. We also show that apart from reduction in the volume of test data, test data compression also allows a slower tester to be used without any reduction in testing time.
Experimental results for the ISCAS benchmarks show that the proposed scheme is very efficient for compressing test data. The results also show that ATPG compaction may not always be necessary for saving ATE memory and reducing testing time.
