We present a new decompression architecture suitable for embedded cores in SoCs which focuses on improving the download time by avoiding higher internalto-ATE clock ratios and by exploiting hardware parallelism. The Bounded Huffman compression facilitates decompression hardware tradeoffs. Our technique is scalable in that the downloadable RAM-based decode table and accommodates for different SoC cores with different characteristics such as the number of scan chains and test set data distributions.
Introduction
Data compression techniques are used to alleviate the ATE test data volume problem. The idea is to compress the precomputed test set T D provided by the vendor to a smaller test set T E and store it in the ATE memory. An on-chip decoder is then used to decompress T E and reproduce T D . Huffman compression using synthesized or ROM-based decoders has been applied to test compression [2, 3] . Many parallel Huffman architectures have been developed [4] mainly for multimedia. The focus of this work is on an efficient hardware architecture implementing the Bounded Huffman compression scheme [5] for test decompression and is used to reduce the decompression chip area overhead in particular the RAM-based (not ROM-based) decode table. Hardware parallelism is exploited in order to reduce the download time, feed multiple scan chains simultaneously and avoid high internal-to-external clock ratios. The decompressor can directly use the ATE clock in a oneto-one ratio, avoiding the complexity of the frequency multiplication of the internal clock. We stress that the decompression time is just as important as the compression ratio in testing. Since ATEs test SoCs in real-time, compression is only useful if it reduces the download time or effective bandwidth and thereby reduce the SoC testing time.
Decompression Architecture
Our approach applies to embedded cores in a SoC. The major steps of our method are: a) Merging ATPG scan patterns into test character bitstreams; b) assigning Don't Care bits in test cubes using an entropy optimization technique favoring Huffman encoding; c) Huffmanbased compression; d) downloading compressed patterns through ATPG channels into scan pins of the SoC; e) hardware Huffman decompression of downloaded patterns and interfacing to the scan chains of the core under test. In order to decompress, the decompressor requires the decode table [1] which contains the mapping of compress codes to decompressed symbols, followed by the decompressed data. Fig. 1 shows an example of four characters decoded (i.e. D 1 , D 2 , D 3 and D 4 ) into four encoded prefix codes (i.e. 0, 10, 110, 111). For example, the prefix code '0' occupies four table locations. The decoder table contains two fields, P n and D, which represent the prefix length and the decoded character, respectively. The height of the tree, h, represents the maximum prefix code length (i.e. 3). The tradeoff of the decode table is decode time versus table size. The address space of the decode table is equal to the maximum number of bits for the longest prefix code. This means that RAM size is proportional to 2 h and may be impractical for some designs. Bounded Huffman compression allows one to optimally length limit the prefix code whereby allowing the decode table to be smaller. Fig. 2 shows the decompression architecture. The encoded test data from the ATE feeds the S E register using k input scan pins. The external ATE clock, C AT E drives the FSM, T E and Prefix registers. The amount of shift is determined by the current FSM state and the currently decoded prefix length, P n . The Prefix register, P , holds the currently encoded ATE data and must be large enough, P w to contain the maximum prefix length, A w , and also acts as buffer when the k is less than A w . The shifter concatenates P and S E registers and shifts left in order to advance the encoded data stream. The output of the shifter is fed to the decode table address, A via the mask register. The width of the decode table address, A w , is selected by the designer to handle the maximum prefix code desired. The mask register, M , uses bitwise ANDing and masks all bits longer than h. This minimizes initialization time of the decode table  The decode table RAM outputs the Table 1 shows the download times for ITC99 b14 benchmark. The four rows that show the download time in milliseconds for f AT E = 20MHz as a function of k and n. The download time improvements from 51.9ms to 9.8ms as k and n are increased. Clearly, the range of the download time in clock cycles is, T E ≥ t decode ≥ T E /k. T compression shows the compression ratio (i.e. (T D − T E )/T D ) achieved through the combination of Don't Care minimization and Huffman compression. T compression is a function of n but not k. T E is the total encoded data sent to the decompressor and includes the the decode table overhead. The overhead relative to T E is shown in the next row. Clearly, the overhead is very small to the size of the data. The longest encoded character as a result of the optimal Huffman compression algorithm is shown in the row as A w . The total size of decode Table 2 shows the effect of limiting the huffman code length for the case n = 4 by using a length-limited huffman algorithm [5] . For the A w = 5 case, we can reduce the RAM size requirements from 1024 words to 64 words. Thus, it is possible to reduce the RAM requirements or at the loss of some compression ratio match the existing RAM size for all. This ability to limited the Huffman length is due to the non-uniform distribution of the test set codes themselves. 
