I. Introduction
The latest international video coding standard, H.264, has been developed by the Video Coding Expert Group of ITU-T and the Moving Picture Experts Group of ISO/IEC. It uses contextadaptive variable length coding (CAVLC) for entropy coding, which is used to encode the quantized transform coefficients [1] , [2] . In the CAVLC decoder, each syntax element of five steps is sequentially decoded using the compressed bitstreams and parameters, and the residual block is formed using the decoded syntax elements. The syntax elements of the five steps are "total_coeff" for the number of coefficients, "T1" for each trailing one, "level" for the value of the remaining non-zero coefficients, "total_zero" for the total number of zeros before the Manuscript last non-zero coefficient, and "run_before" for each run of zeros. Since the decoding of each step is done by using variable length bits, the decoding of the next step cannot be started before decoding of the current step is finished. Therefore, CAVLC has a processing time associated with calculating the bitstreams of the next step (valid bits) between decoding steps. Thus, it increases processing time and makes it difficult to implement a real-time CAVLC decoder. Also, the computational complexity of the CAVLC decoder is one of the largest parts of the decoding process [3] . Therefore, a high-speed CAVLC decoder needs to be developed.
To implement a high-speed CAVLC decoder, several hardware architectures have been proposed [4] - [6] . However, since the decoding block of each step is sequentially performed, previous architectures have latency to calculate valid bits between decoding blocks, making it difficult to implement highspeed CAVLC.
To solve this problem, this paper proposes a new lookup table for skipping the level decoding step when total_coeff is equal to the number of T1 and the architecture of barrel shifting for calculating valid bits efficiently. By applying the proposed techniques, the total processing time can be significantly reduced.
II. Proposed CAVLC Decoding Architecture two signals: Sign_T1 to decode trailing ones and Skip_mode to skip the other decoding block or level decoding block. Table 1 shows an example of the proposed lookup tables for Total_coeff and T1 blocks for the case of 0≤nC<2, where the nC is calculated from the values of the total_coeff for the previously decoded neighboring blocks. There are three cases depending on the total_coeff and T1 values. If total_coeff is zero, there is no Sign_T1, and Skip_mode is set to 1 to skip the other decoding blocks. If total_coeff is equal to T1, Sign_T1 is decoded depending on bitstreams and T1, and Skip_mode is set to 1 to skip the level block. If total_coeff is larger than T1, Sign_T1 is decoded depending on bitstreams, and T1 and Skip_mode are set to 0 because there are no skipping blocks. After decoding total_coeff, Sign_T1 (x`) is decoded to ±1 depending on the bit 'x'.
Proposed Barrel Shifting
In this section, we explain the proposed architecture and barrel shifting process for calculating valid bits. Figure 1 shows the proposed architecture, in which Bitstream_shifter and codeword length multiplexer (CL_MUX) are added to the previous architecture [5] . Bitstream_shifter transfers bitstreams to each block and CL_MUX selects the codeword length to shift Bitstream_shifter. The additional Bitstream_shifter and CL_MUX calculate valid bits quickly by shifting the Bitstream_shifter using an accumulator, controller, and barrel shifter instead of calculating valid bits.
Here we describe the decoding flow for CAVLC. First, each block performs a decoding process using the input bitstreams and then outputs the syntax elements and the codeword length. After the decoding process is finished, syntax elements are used to form 4×4 or 2×2 residual blocks, and codeword lengths are used to calculate valid bits. The process of calculating valid bits is shown in Fig. 2 . After the decoding of each step is finished, if the accumulated codeword length is less than 64 bits, the left path is followed. The CL_MUX selects the codeword length of the current decoding block and the Bitstream_shifter is shifted by the selected value. Thus, the Bitstream_shifter is aligned and used for the input bitstreams of the next decoding block. However, if the accumulated codeword length exceeds 64 bits, the right path, which is similar to the previous process, is used. The accumulator is updated by using the previous codeword length, the current codeword length, and the length of Bitstream_ shifter, which is 64. The barrel shifter is aligned by the new bitstreams in the input buffer (R1, R2) and is then shifted by the value in the accumulator. The aligned bitstreams are transferred to the next decoding block through the Bitstream_shifter. The process of obtaining valid bits is performed repetitively until the decoding process is finished.
Since the right path requires more steps to calculate valid bits than the left path, the left path must be selected more often than the right path to reduce the computation time. Because the average codeword length per 4×4 blocks is less than 64, a 64-bit Bitstream_shifter is used instead of a Bitstream_shifter with a length of 32 bits. Figure 3 shows the codeword length 
III. Performance Analysis Result
In this section, we compare the performance of the proposed architecture with that of the previous architecture [4] , [5] . The proposed CAVLC decoder was designed using Verilog HDL and was tested using Xilinx Virtex4 LX160 FPGA. The test pattern was generated by JM9.6 using QCIF-size images. It was successfully verified at 10 MHz, and the average number of cycles for one macroblock was 126. Therefore, the proposed CAVLC decoder can sufficiently decode QCIF-size images under 1 MHz. Table 2 shows the performance comparison results for the proposed decoder and previous architectures. Since the proposed CAVLC decoder can decode 1920×1088 30 fps video at 30.8 MHz, the processing time is reduced by 59%, 33%, and 8% compared with those of [4] , [5] , and [6] . The proposed architecture was synthesized with a 0.18 µm CMOS standard cell library. The hardware complexity comparisons are shown in the last column of Table 2 . Since the proposed CAVLC decoder reduces the operation clock without increasing hardware complexity, it can save power.
IV. Conclusion
In this paper, a new hardware architecture for a high-speed H.264 CAVLC decoder is proposed. Previous architectures require several cycles for the calculation of valid bits between blocks. If the number of iterations to obtain the valid bits is increased, the processing time of the CAVLC is increased. Therefore, for the high-speed CAVLC decoder, we combined the Total_coeff and T1 blocks into one step and also reduced the time to calculate valid bits by shifting the Bitstream_shifter directly without calculation in the controller. Using these two methods, the required processing time to decode a macroblock was reduced by 8 to 59% compared with previous architectures. The implementation results show that the proposed architecture can decode 1920×1088 30 fps video with 13.2k logic gates at a 30.8 MHz clock.
