We showed recently that Golomb codes can be used for eficiently compressing system-on-a-chip test data. We now present a new class of variable-to-variable-length compression codes that are designed using the distributions of the runs of Os in typical test sequences. We refer to these as frequency-directed run-length (FDR) codes. We present experimental results for the ISCAS 89 benchmark circuits to show that FDR codes outperform Golomb codes for test data compression. We also present a decompression architecture for FDR codes, and an analytical characterization of the amount of compression that can be expected using these codes. Analytical results show that FDR codes are robust, i.e. they are insensitive to variations in the input data stream.
Introduction
Test data volume is a major problem encountered in the testing of system-on-a-chip (SOC) designs [ 11. A typical SOC consists of several intellectual property (IP) blocks, each of which must be exercised by a large number of precomputed test patterns. The increasingly high volume of SOC test data is not only exceeding the memory and U 0 channel capacity of commercial automatic test equipment (ATEs) but it is also leading to excessively high testing times. Data compression techniques that reduce test data volume are therefore of considerable interest.
The testing time of an SOC directly impacts test cost. It is determined by several factors, including the test data volume, the time required to transfer test data to the cores, the rate at which the test patterns are transfered (measured by the test data bandwidth and the ATE channel capacity), and the maximum scan chain length. For a given ATE channel capacity and test data bandwidth, reduction in testing time can be achieved by reducing the test data volume and by redesigning the scan chains. While test data volume reduction techniques can be applied to both soft and hard cores, scan chains cannot be modified in hard (IP) cores. New techniques are therefore needed to reduce the test data volume, ' This research was supported in part by the National Science Foundation under grant number CCR-9875324.
decrease testing time, and overcome ATE memory limitations for SOCs containing IP cores.
Built-in self-test (BIST) has emerged as an alternative to ATE-based external testing [ 2 ] . BIST offers a number of key advantages. It allows precomputed test sets to be embedded in the test sequences generated by on-chip hardware, supports test reuse and at-speed testing, and protect; intellectual property. While BIST is now extensively used for memory testing, it is not as common for logic testing. Test vectors for non-scan and partial-scan designs cannot be reordered, and they are harder to embed in a BIST generator. For full-scan designs, pseudorandom vectors can lead to serious bus contention problems during test application. Moreover, BIST can be applied to SOC designs only if the IP cores in it are BIST-ready. Since most currently-available IP cores are not BIST-ready, BIST insertion in SOCs containing these circuits is expensive and requires considerable redesign.
An alternative approach for reducing test data volume for SOCs is based on the use of data compression techniques such as statistical coding, run-length coding, and Golomb coding [3-81. In this approach, the precomputed test set To provided by the core-vendor is compressed (encoded) to a much smaller test set TE and stored in the ATE memory; see Figure 1 . An on-chip decoder is used to generate TD from TE during pattern application. It was shown in [5, 6, 71 that compressing a "difference vector" sequence Tdfff determined from TD results in smaller test sets and reduced testing time. Figure 2 shows the test architecture based on T d i p f and cyclical scan registers (CSRs).
While previous research has clearly demonstrated that data compression offers a practical solution to the problem of reducing test data volume, the compression codes used in prior work were derived from other application areas. For example, the statistical codes used in [3] and [4] are motivated by pattern repetitions in large text files. Similarly, the run-length and Golomb codes used in [5, 6 , 71 are more effective for encoding large files containing image data. None of these codes are tailored to exploit the specific properties of precomputed test sets for logic circuits. While an attempt was made in [6, 7] to customize the Golomb code by choosing an appropriate code parameter, the basic structure of the Figure 1 . A conceptual architecture for testing a systemon-chip by storing the encoded test data TE in ATE memory and decoding it using on-chip decoders.
CSR

Figure 2. Decompression architecture based on a cyclical scan register (CSR).
code was still independent of the test set. We can therefore expect even greater reduction in test data volume by crafting compression codes that are based on the generic properties of test sets.
In this paper, we present a new class of variable-tovariable-length compression codes that are designed using the distributions of the runs of Os in typical test sequences. In this way, the code can be tailored to our application domain, i.e. SOC test data compression. We refer to these as frequency-directed run-length (FDR) codes. For simplicity, we also refer to an instance of this class of codes as an FDR code. We show that the FDR code outperforms both Golomb codes and conventional run-length codes. We also show that the FDR code can be effectively applied to both the difference vector sequence Tdi f f and the precomputed test set To. The latter is especially attractive since it eliminates the need for a separate CSR for decompression. Additional contributions of this paper include a novel decompression architecture for FDR codes, and an analytical characterization of the amount of data compression that can be expected using these codes.
The organization of the paper is as follows. In Section 2, we first motivate the new FDR code and then describe its construction. In Section 3, we determine the best-case and the worst-case compression that can be achieved given some generic parameters of the precomputed test set. We describe some extensions to the basic FDR code and the decompression architecture in Section 4. Finally, in Section 5, we present experimental results for the large ISCAS 89 benchmark circuits. The ordering is determined using a heuristic procedure described later. Tdif f is defined as follows: Tdzff = where a bit-wise exclusive-or operation is carried out between patterns ti a nd ti+l. This assumes that the CSR starts in the all-0 state. (Other starting states can be considered similarly). If the uncompacted test set To is used for compression, all the don't-care bits in To are mapped to Os to obtain a fully-specified test set before compression.
We now present some important observations about the distribution of runs of Os in typical test sets. We conducted a series of experiments for the large ISCAS benchmark circuits and studied the distribution of the runs of Os in Tdiff obtained from complete single stuck-at test sets for these circuits. bits. This is also inefficient for short runs of Os. Clearly, test data compression is more efficient if the runs of Os that occur more frequently are mapped to shorter codewords. This leads us to the notion of FDR codes. ... Each codeword consists of two parts-a group prefix and a tail. The group prefix is used to identify the group to which the run belongs and the tail is used to identify the members within the group. The encoding procedure is shown in Figure 4 . The FDR code has the following properties:
e For any codeword, the prefix and tail are of equal length. For example, the prefix and the tail are each one bit long for AI, two bits long for A*, etc.
e The length of the prefix for group Ai equals i. For example, the prefix is 2 bits long for group A,.
e For any codeword, the prefix is identical to the binary representation of the run-length corresponding to the first element of the group. For example, run-length 8 is mapped to group A3, and the first element of this group is run-length 6 . Hence the prefix of the codeword for run-length 8 is 110. The codeword size increases by two bits (one bit for the prefix and one bit for the tail) as we move from group Ai to group Ai+l.
Note that run-lengths are also mapped to groups in conventional run-length and Golomb coding. In run-length coding with block size b, the groups are of equal size, each containing 2b elements. The number of code bits to which runs of Os are mapped increases by b bits as we move from one group to another. On the other hand, in Golomb coding, the group size increases as we consider larger runs of Os, i.e. Ai is smaller in size than Ai+l. However, the tails for Golomb codewords in different groups are of equal length (log, m, where m is the code parameter), and the prefix increases by only one bit as we move from one group to another. Hence Figure 5 shows the number of bits per codeword for runs of Os of different lengths. It can be seen from the figure that the performance of the conventional run-length code is worse than that of the Golomb code when the run-length 1 exceeds seven. The performance of the Golomb code is worse than that of the FDR code for 1 2 24. We also note that the new FDR code outperforms the other two types of codes for runs of length zero and one. Since the frequencies of runs of length zero and one are very high for precornputed test sets (Figure 3) , FDR codes outperform run-length and Golomb codes for SOC test data compression.
Analysis of FDR codes
In this section, we develop an analysis technique to determine the worst-case and best-case compression that can be achieved using FDR codes for some generic paramete.rs of precomputed test sets. Suppose Tdiff (or TO if it is encoded directly) contains T 1s and a total of n bits. We first determine Crnax, the number of bits in the encoded test set TE in the worst case, i.e. when the compression is the least effective. In doing so, we also determine a distribution of the runs of Os that gives rise to this worst-case compression.
Suppose T d i f f contains ki runs of length i with maximum run-length l , , , .
Let the size of the encoded test set TE be F bits, and let b = F -(n -r ) measure the amount of compression achieved using FDR codes. If the FDR coding procedure of Figure 4 is applied to Tdiff then
This can be explained as follows: for each run of 0 of length i, we compare the size of the run-length (i) with the size of the corresponding codeword. For example, the codeword corresponding to a run of length 0 contains two bits (one more than the original run), the codeword for runlength 1 is of the same size as the original run-length, and so on. The difference between these two quantities contributes to 6, and it appears as the coefficient of the appropriate ki term in the equation for S.
We next use the following simple integer linear programming (ILP) model to determine the maximum value of s.
This yields the worst-case compression (CmaZ) using FDR codes.
Maximize:
and (2) cfz;" Ici = r .
This ILP model can be easily solved, e.g. using a solver such as lpsolve [9] , to obtain the worst-case values for the ki's. Note that even though l , , , appears in the above ILP model, we do not make any explicit use of it. Our goal here is to determine a worst-case distribution of the runs of Os. Generally, short run lengths yield the worst-case compression; however, if l, , , must exceed a minimum value to satisfy constraints (1) and (2) The last column shows a distribution of runs for which the worst-case compression is achieved ( a / b indicates a runs of length b). Note that this distribution is not unique since a number of run-lengths can yield the worst-case distribution. Note also that the worst-case percentage compression is negative when T is high relative to n-this is unlikely to be the case for test sets (don't-cares mapped to Os) or difference vector sequences for which r is generally very small. It was shown in [6] that for the 1SCAS89 benchmark circuits, the typical value of r is only 10% of n.
Next we analyze the best-case compression achieved using FDR codes for any given n and r . Since the compression is better for longer run-lengths, we also need to constrain the maximum run-length in this case. As before, we formulate this problem using ILP, and the following model can be solved using lpsolve to obtain a best-case distribution of runs and emin, the number of bits in the encoded test set in the best case.
Minimize: S = 2 k o + k l + 2 k 2 + k 3 -k g -k 7 -2 k g -3 k g~. . .
(upto l, , , )
subject to: ( 1 ) CfEis iki = n -r , and (2) ~f z ; = ki = r . Table I (b) lists the run-length distributions corresponding to the best case compression using FDR codes. The corresponding percentage compression values are also listed. In Figure 6 , we plot the lower and upper bounds on the per- frequency of the runs in any group Ai exceeds the cumulative frequency of the runs in group Ai+l. However, for precomputed test sets, the run-length frequencies do not always decrease monotonically. For such non-monotonically decreasing run-lengths, the compression can be increased by extending the basic FDR code as described below.
For each group Ai, we calculate the cumulative frequency of the run-lengths in that group. This is done by simply adding the frequencies of the run-lengths in that group. Next, instead of assigning the group prefix as shown in Figure 4 , we assign the prefix based on the cumulative frequency of that group. A group with a large cumulative frequency is assigned a short prefix. In this way, the size of the encoded test set can be reduced by carrying out a small amount of pre-processing, and by using a mapping logic block (outlined later) in the decoder.
We next describe the decompression architecture and the design of the on-chip decoder. The decoder is simple and scalable, and independent of both the core under test and the precomputed test set. Moreover, due to its small size, it does not introduce significant hardware overhead.
The decoder design is similar to the FSM-based decoder of [6, 71 . Issues related to data synchronization are described in [7] . The decoder decompresses the encoded test set TE and outputs To. It can be efficiently implemented by a k-bit counter, a log, k-bit counter and a finite-state machine (FSM). The block diagram of the decoder is shown in Figure 7 . The bit-in is the input to the FSM and an enable (en) signal is used to input encoded data when the decoder is ready. The FSM output counterin is used to shift in the prefix or the tail into the k-bit counter and the signals s h i f t , decl and T S~ are used to shift the data in, to decrement, and to indicate the reset state of the counter, respectively. The second counter of log, k-bits is used to count the length of the prefix and the tail so as to identify the group. The signals inc and dec2 are used to increment and decrement the counter, respectively and rs2 indicates that the counter has finished counting. Finally, the signal out is the decoder output and v indicates when the output is valid. The operation of the decoder is as follows: 0 The tail part is shifted in until the log, k-bit counter resets to zero. The dec:! signal then goes high, the counter is decremented, and the signal rs2 indicates;
when it is in the zero state.
0 The FSM output Os corresponding to the tail followed by a 1 at the end of tail decoding.
The state diagram for the FSM (not shown here due t o lack of space) contains only 9 states. We synthesized the FSM using Synopsys design compiler-the synthesized circuit has only 4 flip-flops and 38 gates. Therefore, the additional hardware needed for the decoder is very small, and existing counters on the SOC can be reused for decompression.
The above decoder can be easily modified for decompressing data encoded using cumulative frequencies for the groups. Since the use of the cumulative frequencies affects only the prefix, and not the tail, we only need to add a mapping logic block between the encoded data stream and the decode FSM. Thus the mapping logic feeds the decode FSM and transforms the prefixes in the encoded data to the prefix assignment of Figure 4 .
In our experiments with ISCAS 89 benchmark circuits, we observed that the run-lengths were never long enough to exceed group Ala. Therefore, in the worst case, the mapping logic is required for only ten prefixes. Moreover, as we show in Section 5, significant compression is almost always achieved without using the mapping logic.
Experimental results
We now present experimental results on test data compression for the large ISCAS benchmark circuits. We considered both full-scan and non-scan circuits for the proposed compression/decompression scheme. For full-scan circuits, patterns were reordered to achieve higher compression whereas no ordering was done for the non-scan circuits. For all the full-scan circuits, we considered a single scan chain. We use ATPG test cubes generated by Mintes8t ATPG program with dynamic compaction for our experiments.
The first set of experimental data that we present is based on the use of difference vector sequences Tdzff obtained from partially-specified test sets (test cubes). Table 2 presents results for test cubes using FDR coding, Golomb coding, and fully-compacted test sets generated using Mintest. The table lists the percentage compression, sizes of the precomputed (original) test sets, sizes of Table 3 . Compression obtained using TO.
the encoded test sets, and the sizes of the smallest ATPGcompacted test sets. Table 2 shows that FDR codes provide better compression than Golomb codes in all cases. For the benchmark circuit ~38417, there is as much as 10% increase in compression. We also note that that in all but one case, the size of the encoded test set TE is much smaller than the compacted test set obtained using Mintest. The test cubes that we used for ~3 5 9 3 2 were already highly compacted, hence we did not obtain very high compression for this circuit. Nevertheless, it is significant that in contrast to FDR codes, no compression was obtained using Golomb codes for this circuit in [ 6 , 7 ] . Table 3 demonstrates that the use of test cubes TD (with all the don't-cares mapped to 0) often yields higher compression then over the use of T d l f f . Moreover, in this case, the decompression architecture for on-chip pattern generation does not require a separate CSR. For circuits with long scan chains, addtional CSRs of lengths equal to the scan chain lengths increase the hardware overhead significantly. Therefore, compressing TD to generate the encoded test set not only yields smaller test sets but also reduces the hardware overhead.
Finally, in Table 4 , we present experimental results on test data compression for non-scan cicuits. We obtained the test sequences for these circuits from HITEC [ 1 11. No reordering of test patterns was done during compression. Not surprisingly, we found out that that more compression is obtained using the mapping logic. The results also show that very high compression is achieved for non-scan circuits.
Conclusions
We have presented a new class of compression codes, termed frequency-directed run-length (FDR) codes. Table 4 . Compression obtained using TD and FDR codes for non-scan circuits, with and without mapping logic.
imental results for the ISCAS 89 benchmark circuits show that FDR codes outperform Golomb codes for test data compression. In addition to difference sequences derived from scan vectors, these codes can also be used directly with precomputed test sets and ordered test sequences for non-scan circuits. We have presented a decompression architecture for FDR codes, as well as an analytical characterization of the amount of compression that can be expected using these codes. Analytical results show that FDR codes are robust, i.e. they are insensitive to variations in the input data stream.
