We present a data compression method and decompression architecture for testing embedded cores in a system-ona-chip (SOC) 
Introduction
System-on-a-chip (SOC) designs consisting of intellectual property (IP) cores present a number of difficult test challenges [ 13. The volume of test data for an SOC is growing rapidly as IP cores become more complex and an increasing number of these cores are integrated in a chip.
However, the I/O channel capacity, speed and accuracy, and data memory of automatic test equipment (ATE) are severely limited. New techniques are therefore needed for decreasing test data volume in order to overcome memory bottlenecks and to reduce testing time.
A promising approach for reducing test data volume for SOCs is based on data compression techniques [2, 31. In this approach, the precomputed test set TD provided by the core-vendor is compressed (encoded) to a much smaller test set TE and stored in ATE memory. An on-chip decoder is used for pattern decompression to generate TD from TE during pattem application. Test data can be more efficiently compressed by exploiting the fact that the number of bits changing between successive patterns in a test sequence is generally very small. 'This observation was used in [3] , where a "difference vector" sequence Tdlff determined from TD was compressed using run-length coding. A test architecture employing dif-'This research was supported in part by the National Science Foundation under grant number CCR-9875324.
1530-1591/01$10.00 0 2001 IEEE ference vectors and based on cyclical scan registers (CSRs) is sketched in Figure 1 . A drawback of the compression method described in [3] is that it relies on variable-to-fixedlength codes, which are less efficient than more general variable-to-variable-length codes [5, 61. Furthermore, it is inefficient for cores with intemal scan chains that are used to capture test responses; in these circuits, separate CSRs increase hardware overhead.
A more efficient compression and decompression method was used in [7, 81, where TdLff was compressed using variable-to-variable-length Golomb codes. However, this approach also requires separate CSRs and is therefore inefficient for cores that use the same internal scan chains for applying test patterns and capturing test responses.
In this paper, we present an improved test data compressioddecompression method that makes effective use of Golomb codes and the internal scan chains of the core under test. No separate CSR is required for pattem decompression. The resulting encoded test set TE is much smaller than the original precomputed test set To. We apply our compression approach to test sets for the ISCAS 89 benchmark circuits and show that TE is not only considerably smaller than the smallest test sets obtained using ATPG compaction [4] , but it is also significantly smaller than the compressed test sets obtained using Golomb coding in [7, 81. We extend the decompression architecture of [7] to an interleaving scheme that allows multiple cores to be tested in parallel with a single ATE 110 channel. Finally, we present analytical results to show that test data compression not only reduces the volume of test data but it also allows a slower tester to be used without any penalty on testing time.
Compression method and test architecture
We first review Golomb coding and its application to test data compression [5, 71. members and a log2 m-bit sequence (tail) uniquely identifies each member within the group. The proposed method differs from [7] in that no separate CSR is used; instead the internal scan chain is used for pattern decompression and the fault-free responses of the core under test are used to generate a difference vector set As observed in [7] , test data compression is more effective if TD consists of test cubes containing don't-care bits.
In order to determine Tzff in such cases, we need to assign appropriate binary values to the don't-care bits and perform logic simulation to obtain the corresponding fault-free responses. (In general, the simulation model for the core pro- vided by the core vendor can be used to obtain the fault-free responses.) First, we set all don't-care bits in t l , the first pattem in To, to Os and use the logic simulation engine of FSIM [SI to generate the fault-free response rl.
The problem of determining the best test ordering is equivalent to the NP-complete Traveling Salesman problem. Therefore, we use a greedy procedure. Suppose a partial ordering t l t 2 . . . ti has already been determined for the patterns in TO. TO determine ti+l, we first determine r, using FSIM and. then calculate the Hamming distance HD(Ti, t j ) between Tj and all patterns t, that have not been placed in the ordered list. We select the pattern ti for which HD(ri, t j ) is minimum and add it to the ordered list, denoting it by ti+l. All don't-care bits in t,+l are set to the corresponding specified bit in r j . We continue this process until all test patterns in TD are placed in the ordered list. Figure 3 illustrates the procedure for obtaining T i f f from
To.
For most scan chains in IP cores, the number of inputs driven by the scan cells is not equal to the number of outputs which feed the scan chain. The compression procedure can be easily augmented to handle these cases. However, the details are not presented here due to lack of space.
An on-chip decoder decompresses the encoded test set TE and produces TZfr. 
Analytical results
In this section, we analyze the testing time for a single scan chain when Golomb coding is employed with the test architecture shown in Figure 2 . From the state diagram of the Golomb decoder [7] , we note that each '1' in the prefix part takes m cycles for decoding, each separator '0' takes one cycle, and the the tail part takes a maximum of m cycles and a minimum of y = log, m + 1 cycles.
Let nc be the total number of bits in TE, and r be the number of Is in TEf f . TE contains r tail parts, r separator Os, and the number of prefix Is in TE equals n, -r ( 1 flog, m) . Therefore, the maximum and minimum testing times (TmaX and Tmin, respectively), measured by the number of cycles, are given by:
Hence, the difference between Tmax and Tmin is given by 6 7 = TmaX -T,,, = r(m -log, m -1). We will make use of this result in Section 4.
A major advantage of Golomb coding is that on-chip decoding can be carried out at scan clock frequency fs,,, while TE can be fed to the core under test with external clock frequency f e x t < fscan. This allows us to use slower testers without increasing the test application time. The external and scan clocks must be synchronized, e.g. using the scheme described in [IO] , and fscan = mfext, where the Golomb code parameter m is usually a power of 2 . We now present an analysis of testing time using f s c a n = m f e x t .
Let the ATPG-compacted test set contain p patterns and let the length of the scan be n bits. Therefore, the size of the ATPG-compacted test set is pn bits and the testing time If testing is to be accomplished in r* seconds using Golomb coding, f s c a n must equal Tmax/T*, i.e. f s c a n = (mn, -r(mlog,m -1 ) ) /~* . This is achieved using a slow external tester operating at frequency fext = fscan/m. On the other hand, if only external test is used with the p ATPG-compacted patterns, the required external tester clock frequency fLxt equals pn/r*. Let us take the ratio of f e x t between f e x t : We next analyze the amount of compression that is achieved using Golomb coding of a precomputed test set To. Theorem 1 provides an easy-to-compute bound on the size of the encoded test set T E . (The proof is omitted due to lack of space.) This bound depends only on the precomputed test set TD and is independent of the fault-free response. It can therefore be obtained without any logic simulation. We list these bounds for several ISCAS 89 circuits in Section 5.
Theorem 1 Let the number of don't cares in TD be n$. If

TEff is encoded using Golomb code with parameter m, an upper bound on the length G of the encoded sequence is given by G 5 n/m+ ( n -n b ) log, m+ ( n -n + ) ( l -l / m ) .
Interleaving decompression architecture
We next present an interleaving decompression architecture which allows the concurrent testing of multiple cores using a single ATE I/O channel, thereby increasing the ATE U 0 channel capacity.
As shown in [7] , the FSM in the decoder runs the counter form decode cycles whenever a 1 is received and starts decoding the tail as soon as a 0 is received. The tail decoding takes at most m cycles. During prefix decoding, the FSM has to wait for m cycles before the next bit of the prefix can be decoded. Therefore, we can use interleaving to test m cores (or feed m scan chains) together, such that the decoder corresponding to each core is fed with encoded prefix data after every m cycles. Whenever the tail is to be decoded (identified by a 0 in the encoded bit stream), the respective decoder is fed with the entire tail of log, m bits in a single burst of log, m cycles. The SOC channel selector consisting of a demultiplexer, a log, m counter and a FSM is used for interleaving; see Figure 4 .
First, the encoded test data for m cores are combined to generate a composite bit stream Tc that is stored in the ATE. Tc is obtained by interleaving the prefix parts of the compressed test sets of each core, but the tails are included unchanged. Next, Tc is fed to the FSM, which is used to detect the beginning of each tail and to feed the demulti-
plexer. An i-bit counter (i = log, m) is used to select the outputs to the decoders of the various cores. The FSM generates the clk-stop signal to stop the i-bit counter. The data-in is the input to the FSM, data-out is the output and signals win and wout are used to indicate that the input and output data is valid. The i-bit counter is connected to the select lines of the demultiplexer and the demultiplexer outputs are connected to the decoders of the different scan chains. Every scan chain has a dedicated decoder. If the FSM detects that a portion of the tail has arrived, the 0 that is used to identify the tail is passed to the decoder and the clk-stop goes high for the next m cycles.
The output of the demultiplexer does not change for this period and the entire tail of length log, m-bits is passed on continuously to the appropriate core.
The state diagram of the FSM f o r m = 4 is shown in Fig- ure 5. The FSM is fed with TC corresponding to four different cores. It remains in state S O as long as it receives the 1s corresponding to the prefixes. As soon as a 0 is received, it outputs the entire tail unchanged and makes clk-stop high. This stops the i-bit counter and prevents any change at de- It is evident from the above analysis that the interleaving architecture reduces testing time and increases the ATE channel bandwidth.
In this section, we present experimental results on The upper bound values (derived from Theorem 1) represent the worst-case compression that can be achieved using Golomb codes. The upper bound is an important parameter which can be used to determine the suitability of the proposed method. Table 2 demonstrates that Golomb coding allows us to use a slower tester without incurring any testing time penalty. As discussed in Section 3 , Golomb coding provides three important benefits: (i) it significantly reduces the volume of test data, (ii) the test patterns can be applied at the scan clock frequency fsc,, using an external tester that runs at frequency fest = fscan/m, and (iii) in comparison with extemal testing using ATPG-compacted patterns, the same testing time is achieved using a much slower tester. The third issue is highlighted in Table 2 .
Conclusions
We have shown that the use of Golomb codes and the internal scan chains of the embedded cores offers significant test data compression for SOCs, leading to reduction in ATE memory and testing time. We have also presented a novel interleaving decompression architecture that allows testing of multiple cores in parallel using a single ATE I/O channel. This reduces the testing time of an SOC further and increases the ATE U 0 channel capacity. We have shown that test data compression also allows a slower tester to be used without any reduction in testing time.
Experimental results for the ISCAS benchmarks show that the proposed scheme is very efficient for compressing test data. We are currently extending the test architecture to ensure that certain pattems are not applied to the core under test due to constraints such as bus contention. 
