effect on the production schedule is also becoming increasingly difficult. High transistor counts and aggressive clock frequencies require expensive automatic test equipment (ATE). More important, they introduce many problems into test development and manufacturing test that decrease product quality and increase cost and time to market. ATE costs have been rising steeply. A tester that can accurately test today's complex ICs costs several million dollars. According to the 1999 International Technology Roadmap for Semiconductors (http://public.itrs.net/Files/1999_SIA_ Roadmap/Home.htm), the cost of a high-speed tester will exceed $20 million by 2010, and the cost of testing an IC with conventional methods will exceed fabrication cost. Conventional directprobe testing methods have become inadequate and are no longer commercially practical. The increasing ratio of internal node count to external pin count makes most chip nodes inaccessible from system I/O pins, so controlling and observing these nodes and exercising the numerous internal states in the circuit under test is difficult. ATE I/O channel capacity, speed, accuracy, and data memory are limited. Therefore, design and test engineers need new techniques for decreasing data volume.
Test resource partitioning offers a promising solution to these problems by moving some test resources from ATE to chip. Our new TRP approach, based on test data compression and on-chip decompression, reduces test data volume, decreases testing time, and uses slower testers without decreasing test quality.
Overview
There are three main TRP techniques:
I Test set compaction. This technique reduces test data volume by compacting the partially specified test cubes generated by automatic test pattern generation (ATPG) algorithms. It requires no additional hardware investment. The test set is compacted through dynamic or static compaction procedures. 1, 2 However, test set compaction results in the application of fewer patterns to the SOC. Because every modeled fault is thus detected by fewer patterns, this approach can reduce unmodeled-fault coverage. 3 I Built-in self-test. BIST, an alternative to ATEbased external testing, offers several advantages: It lets precomputed test sets be embedded in test sequences generated by on-chip hardware, supports test reuse and at-speed testing, and protects intellectual property. Although BIST is now extensively used for memory testing, it is not as common for logic testing. This is particularly true for nonscan and partial-scan designs in which test vectors cannot be reordered and applying pseudorandom vectors can lead to serious bus contention problems during testing. Figure 2 shows test architectures based on T D and T diff and cyclical scan registers. However, using T diff and CSRs is not always necessary. Directly encoding T D can also achieve significant compression.
Our TRP approach uses the third technique, which reduces test data volume more than test set compaction and is less expensive than BIST.
Run-length codes
To encode SOC test data, we first decompose it into either fixed-length or variable-length blocks. We then assign each block a code 81 word, also of either fixed or variable length. Assigning a fixed-length code word to fixedlength data blocks doesn't lead to significant compression, so we must consider variable-tofixed-length and variable-to-variable-length encoding.
Variable-to-fixed-length: conventional runlength coding The first step in encoding test set T D is to generate a fully specified test set with long runs of 0s followed by a single 1. Run-length codes can be used to compress both difference-vector sequence T diff and T D . Let T D = {t 1 , t 2 , t 3 , …, t n } be the (ordered) precomputed test set. A straightforward heuristic procedure determines the ordering. 6 We say that
where a bitwise exclusive-or operation is carried out between patterns t i and t i+1 . If uncompacted test set T D is used for compression, all the don't-care bits in T D are mapped to 0s to obtain a fully specified test set before compression.
The next step is to select block size b. Once b is determined, the runs of 0s are mapped to groups of size M + 1 = 2 b . The length of the longest run of 0s determines the number of groups. The set of run lengths {0, 1, 2, …, M − 1} and a run of M 0s form group A 1 ; the set {M, M + 1, M + 2, …, 2M − 1} and a run of 2M 0s form group A 2 ; and so on. In general, the set of run
kM − 1} and a run of kM comprise group A k . The code word size for the kth group is k(M + 1). Table 1 shows the encoding.
Variable-to-variable-length: Golomb coding
The first step in the encoding procedure is to select Golomb code parameter m. For certain distributions of the input data stream (T diff , in our case), group size m can be optimally determined. For example, if the input data stream is random with 0-probability p, then m should be chosen such that p m ≈ 0.5. However, because the difference vectors for precomputed test sets do not satisfy the randomness assumption, the best value of m for test data compression must be determined experimentally.
Once group size m is determined, the runs of 0s in the precomputed test set are mapped to groups of size m (each group corresponding to a run length). The length of the longest run of 0s in the precomputed test set determines the number of groups. The set of run lengths {0, 1, 2, …, m − 1} forms group A 1 ; the set {m, m + 1, m + 2, …, 2m − 1} forms group A 2 ; and so on. In general, the set of run lengths
To each group A k , we assign a group prefix of (k − 1) 1s followed by a 0. We denote this by 1 (k−1) 0. If m is determined to be a power of 2 (that is, m = 2 N ), each group contains 2 N members, and a sequence (a tail) of log 2 (m) bits uniquely identifies each member in the group. Thus, the final code word for run length L that belongs to group A k is composed of two parts-a group prefix and a tail. The prefix is 1 (k−1) 0, and the tail is a sequence of log 2 (m) bits. Thus, Table 2 shows an example of Golomb encoding.
Variable-to-variable length: FDR coding
The need for FDR coding arises from the distribution of runs of 0s in typical test sets. We conducted a series of experiments on the large benchmark circuits from the International Symposium on Circuits and Systems (ISCAS) and studied the distribution of runs of 0s in T diff obtained from complete single stuck-at test sets for these circuits. Figure 3 illustrates this distribution for benchmark s9234. We found that the distributions were similar for other circuits' test sets. Figure 3 shows that the frequency of runs of 0s of length l I is high for 0 ≤ l ≤ 20, I is very low for l ≥ 20, and I decreases rapidly with decreasing l even within the range 0 ≤ l ≤ 20.
If we use conventional run-length coding with block size b for compressing such test sets, every run of l 0s, 0 ≤ l ≤ 2 b−1 , is mapped to a bbit code word. This is clearly inefficient for the large number of short runs of 0s. Likewise, if we use Golomb coding with code parameter m , a run of l 0s is mapped to a code word with l/m + 1 + log 2 (m) bits. This is also inefficient for short runs of 0s. Clearly, test data compression is more efficient if the more frequently occurring runs of 0s are mapped to shorter code words. This leads us to the notion of FDR codes.
FDR code is constructed as follows: The runs of 0s are divided into groups A 1 , A 2 , A 3 , …, A k , where k is determined by length l max of longest run (2
. Also, a run of length l is mapped to group A j , where j = log 2 (l + 3) − 1. The ith group's size equals 2 i -that is, A i contains 2 i members. Each code word consists of two parts-a group prefix and a tail. The group prefix identifies the group to which the run belongs, and the tail identifies the group's members. Table  3 shows an example of FDR encoding. FDR code has the following properties:
I For any code word, prefix and tail are of equal length. For example, they are each one bit long for A 1 , two bits long for A 2 , and so on. I The length of the prefix for group A i equals i.
For example, the prefix is 2 bits long for group A 2 . I For any code word, the prefix is identical to the binary representation of the run length corresponding to the group's first element. For example, run-length 8 is mapped to group A 3 , and this group's first element is run-length 6. Hence, the prefix of the code word for run-length 8 is 110.
83
September-October 2001 Run lengths are also mapped to groups in conventional run-length and Golomb coding. In run-length coding with block size b, the groups are of equal size, each containing 2 b elements. The number of code bits to which runs of 0s are mapped increases by b bits as we move from one group to another. On the other hand, in Golomb coding, the group size increases as the runs of 0s grow-that is, A i is smaller than A i+1 . However, tails for Golomb code words in different groups are of equal length (log 2 (m), where m is the code parameter), and the prefix increases by only 1 bit as we move from one group to another. Hence, Golomb coding is less effective when the runs of 0s spread far from an effective range determined by m. Figure 4 compares the three codes, showing the number of bits per code word for differentlength runs of 0s. Conventional run-length code's performance is worse than that of Golomb code when run-length l exceeds 7. Golomb code's performance is worse than that of FDR code for l ≥ 24. FDR code outperforms the other two types for runs of lengths 0 and 1. Since these runs' frequencies are very high for precomputed test sets (Figure 3) , FDR codes outperform run-length and Golomb codes for SOC test data compression.
Test data compression and decompression
Although the on-chip decoder designs are similar for the three codes we've described, we discuss only the Golomb decoder in this article. The decoder is simple, scalable, and independent of the core under test and the precomputed test set. Moreover, because it is small, it does not introduce significant hardware overhead.
The decoder decompresses encoded test set T E and outputs T diff . The exclusive-or gate and the CSR generate test patterns from the difference vectors. A counter of log 2 (m) bits and a finitestate machine can efficiently implement the decoder. Figure 5 shows the decoder's block diagram. The input to the FSM is bit_in, and enable signal en is used to input the bit whenever the decoder is ready. Signal inc increments the counter, and rs indicates that the counter has finished counting. Signal out is the decoder output, and v indicates when the output is valid.
The decoder operates as follows:
I When the input is 1, the counter counts up to m. Signal en is low while the counter is busy counting and enables the input at the end of m cycles to accept another bit. The decoder outputs m 0s during this operation and makes v high. Figure 6 shows the FSM state diagram corresponding to the decoder for m = 4. States S0 to S3 and S4 to S8 correspond to prefix and tail decoding, respectively. We synthesized the FSM using the Synopsys Design Compiler to access the decoder's hardware overhead. The synthesized circuit contains only four flip-flops and 34 combinational gates. For a circuit whose test set is compressed using m = 4, the logic shown in the gate-level schematic is the only additional hardware required other than the counter. Thus, the decoder is independent of not only the core under test but also its precomputed test set. The amount of extra logic required for decompression is small and can be implemented easily-in contrast to the runlength decoder, which is not scalable and becomes increasingly complex for higher values of block length b.
Test application time
Here, we analyze the testing time for a single scan chain using Golomb coding with the test architecture shown in Figure 2 . The Golomb decoder's state diagram indicates that I each 1 in the prefix takes m cycles for decoding, I each separator 0 takes one cycle, and I the tail takes a maximum of m cycles and a minimum of γ = log 2 (m) + 1 cycles.
We let n C be the total number of bits in T E , and r the number of 1s in T diff . T E contains r tails and r separator 0s, and the number of prefix 1s in T E equals n C − r[1 + log 2 (m)]. Therefore, maximum and minimum testing times T max and T min , measured by the number of cycles, are
Therefore, the difference between T max and T min is
A major advantage of Golomb coding is that on-chip decoding can be carried out at scan clock frequency f scan while T E is fed to the core under test with external clock frequency f ext < f scan . This lets us use slower testers without increasing test application time. The external clock and the scan clocks must be synchronized such that f scan = mf ext , where Golomb code parameter m is usually a power of 2.
8 Therefore, the decoder can generate the bits of T diff at f scan .
Next, we analyze testing time, using f scan = mf ext , and compare our method's testing time with that of external testing applying ATPG-compacted patterns. We let the ATPG-compacted test set contain p patterns and the scan chain's length be n bits. Therefore, the ATPG-compacted test set's size is pn bits, and testing time T ATPG equals pn external clock cycles. Next, we let difference-vector T diff obtained from the uncompacted test set contain r 1s and its Golomb-coded test set T E contain n C bits. Therefore, the maximum number of scan clock cycles required for applying test patterns in the Golomb coding scheme is T max = mn C − r[mlog 2 If we are to accomplish testing in τ* seconds with Golomb coding, scan clock frequency f scan must equal T max /τ*. That is, f scan = {mn C − r[mlog 2 (m) − 1]}/τ*. We meet this requirement using a slow external tester operating at frequency f ext = f scan /m. On the other hand, if we use only external testing with p ATPG-compacted patterns, the required external tester clock frequency f ′ ext is pn/τ*.
The ratio between f ′ ext and f ext is
Our experimental results show that in all cases the ratio is greater than 1, demonstrating that Golomb-code-based TRP lets us decrease test data volume and use a slower tester without increasing testing time.
Interleaving decompression architecture
Our interleaving decompression architecture based on Golomb coding enables testing of multiple cores in parallel with a single ATE I/O channel. The architecture reduces testing time and test data volume stored in ATE memory and increases ATE I/O channel capacity.
As discussed earlier, when Golomb coding is applied to a data block containing a run of 0s followed by a single 1, the code word contains two parts-prefix and tail. For given code parameter m (group size), the tail's length log 2 (m) is independent of the run length. Every 1 in the prefix corresponds to m 0s in the decoded difference vector. Thus, the prefix consists of a string of 1s followed by a 0, and the 0 identifies the tail's beginning.
Whenever the decoder FSM receives a 1, it runs the counter for m decode cycles and starts decoding the tail as soon as it receives a 0. Tail decoding takes at most m cycles. During prefix decoding, the FSM must wait m cycles before the next prefix bit can be decoded. Therefore, we can test m cores in parallel by using interleaving to feed each core's decoder with encoded prefix data every m cycles. Interleaving can also be used to feed multiple scan chains in parallel, as long as their capture cycles are synchronized. Whenever a tail is to be decoded, the respective decoder is fed with the entire tail of log 2 (m) bits in a single burst of log 2 (m) cycles. Figure 7 shows the SOC channel selector used for interleaving. It consists of a demultiplexer, an i-bit counter, and an FSM. Interleaving proceeds as follows: First, the SOC integrator combines the encoded test data for m cores to generate composite bit stream T C , which is stored in the ATE. Next, T C is fed to the FSM, which detects the beginning of each tail and feeds the demultiplexer. An i-bit counter, where i = log 2 (m), selects the outputs to the various cores' decoders.
T C is obtained by interleaving the prefixes of each core's compressed test sets, but the tails are included unchanged. In the example in Figure 8 , compressed data for two cores (generated using group size m = 2) have been interleaved to obtain T C . The final encoded test set will be applied to multiple cores through decompression.
Let's describe the SOC channel selector in greater detail. The FSM detects the tail's beginning and generates clk_stop to stop the i-bit counter. Signals v in and v out indicate that data_in and data_out are valid. The i-bit counter connects to the demultiplexer's select lines, and the demultiplexer's outputs connect to the different scan chains' decoders. Each scan chain has a dedicated decoder. This decoder receives either a 1 or the tail of the compressed data corresponding to the various cores connected to the scan chain. If the FSM detects that a portion of the tail has arrived, the 0 that identifies the tail passes to the decoder, and clk_stop goes high for the next m cycles. The demultiplexer's output doesn't change during this period, and the entire tail passes continuously to the appropriate core.
Figures 9a and 9b (next page) show the FSM state diagram for m = 4 and the corresponding timing diagram. The FSM is fed T C corresponding to four different cores. It remains in state S0 as long as it receives 1s corresponding to the prefixes. As soon as it receives a 0, it outputs the entire tail unchanged and makes clk_stop high. This stops the i-bit counter and prevents any change at the demultiplexer output. The timing diagram shows that whenever the FSM receives a 0, SOC channel selection remains unchanged for the next m + 1 cycles.
The difference between maximum and minimum testing times for a single tail is δt = [m − log 2 (m) − 1]. If we restrict m to be small, then m ≤ 8 and δt ≤ 4. In this case, we can easily modify the decoder FSM by introducing additional states to the Golomb decoder FSM such that the tail decoding always takes m cycles and δt = 0. As Figure 10 shows, three additional states are required to equalize tail and prefix decoding for m = 4. The additional states don't increase testing time and hardware overhead significantly. There are m cores in parallel, and each separator 0 and tail takes m + 1 cycles to decode. For m cores, therefore, the decoding time for the separator and tail is where Because all cores' prefixes are decoded in parallel, the number of cycles t prefix required for decoding all the prefixes in T C equals the number of 1s in the prefix of the core with the largest amount of encoded test data. Therefore,
where n C,i and r i are the number of encoded bits in T E and the number of 1s in T diff for the ith core, respectively. Moreover, n C,max and r max are the number of encoded bits in T E and the number of 1s in T diff for the core with the largest amount of encoded test data. Therefore, total testing time for m cores tested in parallel with interleaving is (NI denotes noninterleaved) required if we test all the cores one by one using a single ATE I/O channel:
where T C  denotes the number of bits in T C . The difference between the interleaved and noninterleaved testing times is because n C,max >> r max and T C >> R. Consider a hypothetical example of four cores with an encoded test data size equal to n C,1 = 40, n C,2 = 60, n C,3 = 80, n C,4 = 100, and r 1 = 4, r 2 = 6, r 3 = 8, and r 4 = 10. Therefore, n C,max = 100, r max = 10, m = 4, R = 28, and T C  = 280. Finally, T NI − T I = 4[(280 − 100) − (1 + 2)(28 − 10)] = 504. This analysis shows that the interleaving architecture reduces testing time and increases ATE channel bandwidth.
We developed a Verilog model of the FSM for m = 4 and simulated it using T C = 1010110011011. We also synthesized the gatelevel circuit of the channel selector FSM with the Synopsys Design Compiler. It contains only four flip-flops and 17 gates. Thus, additional hardware overhead is small.
Experimental results
We performed TRP on the large ISCAS benchmark circuits. We considered full-scan circuits for the proposed compression and decompression schemes. For full-scan circuits, we reordered patterns to achieve higher compression. For all full-scan circuits, we considered a single scan chain. We computed the compression percentage as C P = (T D  − T E  / T D ) × 100, where T D  is the test set size and T E  is the encoded test set size.
For our first experiment, we used differencevector sequences (T diff ) obtained from partially specified test sets (test cubes). Table 4 presents results for test cubes obtained using dynamic compaction with the Mintest ATPG program. 9 The table compares the fully compacted Mintest test sets with the compression obtained from FDR, Golomb, and conventional run-length coding. The table lists precomputed (original) test set sizes (T D ), encoded test set sizes (T E ), and smallest ATPG-compacted (Mintest) test set sizes. We used a Sun Ultra 10 workstation with a 333-MHz processor and 256 Mbytes of DRAM. Table 4 shows that FDR codes provide better compression than Golomb and conventional run-length codes in all cases. (Golomb code results reported here are better than those reported in an earlier publication 6 because we used an improved pattern-reordering heuristic for these experiments.) For circuit s38417, the increase for FDR codes was as much as 7% over Golomb codes. In all but one case, the encoded test set (T E ) size is much smaller than that of the Mintest-compacted test set.
The test cubes we used for s35932 were already highly compacted, so we didn't obtain high compression for this circuit. Nevertheless, in contrast to FDR codes, Golomb codes provided insignificant compression, and run-length codes provided no compression for this circuit. On average, the compression obtained with FDR codes was 7.49% higher than that obtained with Golomb codes and 19.56% higher than that obtained with conventional run-length codes. Test data compression always leads to encoded test sets smaller than ATPG-compacted test sets. 6 Moreover, test data compression decreases testing time by several orders of magnitude, 10 and substantially reduces power consumption during scan testing. Table 5 demonstrates that using test cubes T D (with all the don't-care bits mapped to 0s) also yields high compression. The advantage of using T D for compression is that the decompression architecture for on-chip pattern generation doesn't require a separate CSR. For circuits with long scan chains, equally long additional CSRs increase hardware overhead significantly. Therefore, compressing T D to generate the encoded test set not only yields smaller test sets but also reduces hardware overhead. Table 6 shows that Golomb coding lets us use a slower tester without incurring a time penalty. In comparison with external testing using ATPGcompacted patterns, we achieved the same testing time using a much slower tester.
Overall, our experimental results for the ISCAS benchmarks show that the compression technique is very efficient for full-scan circuits and that ATPG compaction is not always necessary to save ATE memory and reduce testing time.
TEST DATA COMPRESSION offers a solution to the TRP problem for SOC designs. We are currently working on reduced-pin-count-testing (RPCT) and BIST techniques using test data compression. I 
