Abstract-This paper has developed a fast variable-length decoder which uses a plane separation technique to reduce the processing time of the feedback path in the decoder. The developed decoder performs two shift processes and a decision process concurrently. Therefore, the processing time in the feedback path of our developed variable length decoder can be improved and determined by the longest time among the three processes, not by the sum of their processing times together. Our simulation results show that the total processing time of our developed decoder makes about 30% improvement from that of the Sun and Lei's decoder and their modified decoder when they are implemented with field programmable logic device.
A Fast Variable-Length Decoder Using Plane Separation Jae Ho Jeon, Young Seo Park, and Hyun Wook Park Abstract-This paper has developed a fast variable-length decoder which uses a plane separation technique to reduce the processing time of the feedback path in the decoder. The developed decoder performs two shift processes and a decision process concurrently. Therefore, the processing time in the feedback path of our developed variable length decoder can be improved and determined by the longest time among the three processes, not by the sum of their processing times together. Our simulation results show that the total processing time of our developed decoder makes about 30% improvement from that of the Sun and Lei's decoder and their modified decoder when they are implemented with field programmable logic device.
Index Terms-Huffman coding, parallel processing, plane separation, variable-length coding (VLC), variable-length decoding (VLD).

I. INTRODUCTION
T HE variable-length code (VLC), e.g., the Huffman code [1] , is a lossless code, where average code length is close to the source entropy. Transmission bandwidth and storage capacity requirements can be relaxed by using efficient representation of the VLC, especially for high-speed applications such as real-time storage, editing, and broadcasting of high-definition television (HDTV) signal. It has been chosen as a major component of a number of international standards for image and video compression of facsimile, teleconferencing systems, and HDTV.
The VLC coder can be implemented with a table-lookup process [2] and pipelined to meet the high-speed requirements. However, the VLC decoding should be performed sequentially because it is difficult for the VLC decoder to be pipelined or paralleled. The VLC decoder cannot start decoding of the next code word until the length of the current code word is determined because the code length is variable. Therefore, the variable length of the code can limit the decoding throughput because of the recursive data-dependent procedure [3] , even though it is optimized for data compression efficiency.
Various decoding methods using parallel or pipelined architecture have been developed to reduce the decoding time. The tree-based searching algorithm of MARVLE [4] , [5] decoded the input bitstream serially in one bit per cycle. Therefore, the decoding time depended on the code length, i.e., the long code word required long decoding time. Sun and Lei [2] developed a bit-parallel decoder which could decode each code word in one clock cycle by parallel matching the current bitstream with all possible code words in a lookup-table (LUT). Sun and Lei [6] , [7] improved the bit-parallel decoder by excluding an accumulator from the feedback path of the bit-parallel decoder. Chang Manuscript and Messerschmitt [3] analyzed the PLA-based pipelined treebased architecture, which combined several technologies such as flexible operation in the decoding process and the high-level optimization based on the Sun and Lei's architecture. Lin and Messerschmitt [8] proposed two methods to create concurrency and to improve the decoder throughput: 1) the concurrent finitestate machine (FSM), which extended the tree-based searching algorithm to the -bits FSM and 2) the bit-positioning method, that divided the coded bitstream into blocks with overlapping windows. The divided bitstreams were decoded concurrently using Sun and Lei's decoder as a basic decoding unit, and the decoded data were merged at the last stage. However, the hardware complexity was so high.
In this paper, a new fast VLC decoder using plane separation is developed to reduce the processing time in the critical feedback path. The bit-parallel decoder by Sun and Lei [2] and their modified decoder [6] , [7] are described briefly, and their critical feedback paths are analyzed in Section II. Section III describes the operation of our developed decoder. In Section IV, the processing time of the developed decoder is compared with those of the Sun and Lei's decoder and their modified decoder. Finally, Section V gives conclusions.
II. PARALLEL PLA-BASED VLC DECODER
A. Parallel PLA-Based Architecture (Sun and Lei's Decoder)
The parallel PLA-based architecture that was developed by Sun and Lei [2] is shown in Fig. 1 . It consists of two barrel shifters ( and ) and four registers ( , , , and ). Its operation consists of three steps as follows.
Step 1) At = 0, two 16-bit bitstreams of the input buffer are placed to two input registers ( and ). The first byte is to and the second byte is to . The accumulated code length ( ) is reset to 0. The is initially 0.
Step 2) Output ( ) of barrel shifter ( ) can be obtained through the following operations.
i) The barrel shifter ( ) shifts the input data ( ) to the left direction with amount of , that is ( . ii) Output the most significant 16 bits of , that is .
Step 3) For the forward path processing, the PLA in Fig. 1 performs parallel pattern matching of with all possible code words in the code word At the same time, if in (1), then "Carry-out" is not issued and go to Step 2). Otherwise, the "Read" signal in Fig. 1 is activated to update the input plane, such that the is loaded into the and next 16-bit bitstream of input buffer is loaded into the , and go to Step 2). In Fig. 1 , the "Ready" signal is active during the decoding process. The processing time ( ) of the parallel PLA-based architecture can be determined by the sum of following processes, where all processes are performed sequentially step by step:
where is the processing time for PLA to perform pattern matching, is the processing time of the according to (1) , is the processing time for the carry-out decision in the , is the processing time to update the and the registers with input data, and is the processing time of the .
B. High-Speed Parallel VLC Decoder (Sun and Lei's Modified Decoder)
In order to improve the speed of the decoder, an accumulator is excluded from the critical path of the parallel PLA-based architecture at the expense of a complicated interface circuitry [6] , [7] . Because the carry-out decision process and the other processes can be overlapped in the modified parallel PLA-based architecture, the processing time ( ) of the Sun and Lei's modified decoder is shorter than that of the parallel PLA-based decoder [2] as shown in (3) if carry-out if carry-out (3)
III. VLD USING PLANE SEPARATION
In the proposed architecture, all the processes in feedback path are performed in parallel, except the forward-path process that is the matching process in the PLA. The block diagram of our developed VLC decoder is shown in Fig. 2 , which consists of two separate planes of an input plane. Each plane consists of a barrel shifter, a 32-bit 2:1 multiplexer, and a 32-bit output latch ( , and for input plane, and , and for the OR plane). The developed architecture uses exactly the same matching method as the PLA-based decoders [2] , [6] , [7] . The output data ( ) from the OR plane ( ) is matched with all possible code words in the code word table. A matched symbol and the corresponding latched code length ( ) in are obtained from the matching process, where the variable is the sequence number of symbols decoded. After the matching process, the input plane rotates the data in at the and the OR plane shifts the data in at both to the left direction with the amount of the . Bits shifted out to the left side of the OR plane are lost, while those of the input plane are attached at the least significant bits of the input plane. At the same time, the bit length ( ) of the remained data in the OR plane is calculated as follows: if otherwise (4) where is the maximum code length, which is 16 in our implementation. If the remained bit length ( ) is smaller than the required codeword length ( ) for the next matching, the next matching can be performed only after updating the OR plane by loading next input bitstream. It can be simply performed by the bitwise OR operation of and , as shown in Fig. 2 . If the remained bit length is larger than or equal to , the next matching process can be repeated without the OR operation. Fig. 3 shows an example operation of our developed decoder with a simplified code table. At = 0, is loaded with two 16-bit words, i.e., (0) = 32, and the next 16-bit word is stored into the upper half of . The matching process could be repeated until without additional input to the OR plane. Until , the is rotated to the left direction with 18 bits. The remained bit length becomes smaller than 16 at , then the is reloaded with the result from the OR operation of and . Updating the input plane is concurrently performed, i.e., the most significant (32-) bits of the next 16-bit input data is placed at the least significant (32-) bits of the input plane and the remaining least significant ( -16) *MSB: most significant bit of 5-bit output of the subtracter. bits of the 16-bit input data is placed at the most significant ( -16) bits of the input plane. In the example of Fig. 3 , (6) is 27. Therefore, the most significant five bits of the 16-bit input data are placed in the least significant five bits of the input plane and the other 11 bits of the 16-bit input data are placed in the most significant 11 bits of the input plane. Fig. 3 shows the data propagation from the input plane to the OR plane and from the input buffer to the input plane. Input alignment procedure in requires additional operations. However, the additional operations do not affect the processing time in the crit- ical path because it is performed concurrently with the matching process and its processing time must be faster than that of the matching process.
As shown in the example, our developed decoder concurrently performs two shift processes on the OR plane and the input plane by using , not . In addition, the decision process, which is to calculate and to decide whether a new bitstream must be read or not, can be performed in parallel with the above two shift processes. The total processing time ( ) of the developed decoder can be determined as follows:
if carry-out if carry-out (5) where and are the processing time for shift processes on the input plane and the OR plane, respectively, is the processing time for the decision process, and is the processing time for the OR operation in the OR plane. The total processing time in the feedback path of our developed decoder is given by the longest time among , , and , not by the sum of their processing times together. The processing times of the Sun and Lei's decoder [2] , their modified decoder [6] , [7] , and our developed decoder are graphically shown in Fig. 4 . We assume that the time for latching the data to a register is much smaller than the processing time for barrel shifter, . Then, the processing time required for each shift process on each plane is assumed to be almost same as that for barrel shift in ( ). Therefore, the speed-up ratios ( , ) of the total processing time of our developed decoder to those of the Sun and Lei's decoder and their modified decoder can be described, respectively, as follows: if carry-out if carry-out (6) if carry-out if carry-out (7) IV. SIMULATION RESULTS Our developed decoder and the Sun and Lei's decoders [2] , [6] , [7] are implemented on the Altera FLEX 8000 field-programmable logic device (FPLD) operating at 100 MHz. The processing time of each decoder was estimated from several MPEG-2 compressed video sequences (Mobile and Calendar, Football, and Cheer Leaders). Each video sequence has 150 frames with image size of 720 480. The video sequences were compressed with intra-frame coding of the MPEG-2 standard. The processing times of the operations in (2), (3), and (5) were obtained from the FPLD simulations, and the numbers of required operations with and without carry-out [ (2), (3), and (5)] Fig. 4 . Processing times of the Sun and Lei's decoder [2] , their modified decoder [6] , [7] , and our developed decoder. to decode all the frames of the MPEG-2 compressed video sequences were obtained from the software decoding simulation. The ratios ( ) of total processing time of our developed decoder to those of the Sun and Lei's decoder [2] and their modified decoder [6] , [7] are shown in Figs. 5 and 6, respectively, when the quantization parameter ( ) is fixed with a value of 1. Fig. 7 shows the ratio of the processing time depends on the quantization parameter , where the processing time is an average value from 150 frames of "Mobile and Calendar" sequence.
Because of the speed limitation in the FPLD, the average throughput of our developed decoder is about 15 million samples/s. The number of logical cells required for our developed decoder is about two times as many as that for the Sun and Lei's decoder [2] .
In the VLSI implementation, the barrel shifter is extremely fast, able to perform more than one bit shift in a clock cycle. The shift process can be performed in a unit machine cycle by the high-speed variable-length rotation shifter [8] , which was designed using crossbar-switch. The processing time required for each shift process on each plane is almost the same as that for the barrel shift in ( ). In addition, the decision process in the critical path is implemented with a simple adder, so that the processing time for the decision process can be less than the shift time ( ). Therefore, (6) and (7) Each process in (8) and (9) can be implemented within a unit machine cycle in the VLSI implementation. Therefore, our developed decoder is almost two times as fast as that of the Sun and Lei's decoder, and one and a half times as fast as that of the Sun and Lei's modified decoder when any additional input is not required for the next matching process, i.e., carry-out . Also, our developed decoder requires only the three fifth of the total processing time of the Sun and Lei's decoder and the three forth of that of the Sun and Lei's modified decoder when carry-out . In order to further improve the decoding speed, our developed architecture can be used as a basic decoding unit for more sophisticated parallelism, e.g., the concurrent methods used in [3] , [8] .
V. CONCLUSION
A new fast VLC decoder was developed, in which the feedback path of the decoder could be performed in parallel. The architecture of our developed decoder was based on the plane separation, so that the decoder could perform two shift processes and the decision process concurrently. The processing times of Sun and Lei's decoder, their modified decoder, and our developed decoder were analyzed by describing each functional entity. Our developed decoder reduces the required total processing time with about 30% from those of the Sun and Lei's decoder and their modified decoder for sample images in MPEG-2 video sequences when they were implemented in FPLD.
