A high throughput parallel decoding method is developed for context-based adaptive variable length codes. In this paper, several new design ideas are devised and implemented for scalable parallel processing, a reduction in area, and a reduction in power requirements. First, simplified logical operations instead of memory lookups are used for parallel processing. Second, the codes are grouped based on their lengths for efficient logical operation. Third, up to M bits of the input stream can be analyzed simultaneously. For comparison, we designed a logical-operation-based parallel decoder for M=8 and a conventional parallel decoder. High-speed parallel decoding becomes possible with our method. In addition, for similar decoding rates (1.57 codes/cycle for M=8), our new approach uses 46% less chip area than the conventional method.
I. Introduction
The increasing use of multimedia services and mobile terminals has resulted in requirements for higher coding efficiency. To meet this need, video coding standard H.264/AVC [1] was established by the ITU-T/ISO/IEC Joint Video Team. One of the techniques to increase efficiency is variable length coding (VLC). VLC achieves code compression by assigning short codewords to input symbols of high probability and longer codewords to those of low probability. H.264/MPEG-4 AVC includes two advanced VLC techniques, context-based adaptive variable length coding (CAVLC) and context-based adaptive binary adaptive coding (CABAC). CABAC is more complicated in computation but shows a higher compression rate than CAVLC. The H.264/AVC baseline profile only includes CAVLC [1] . When entropy_coding_mode is set to 0, residual block data is coded using CAVLC, and other variable-length coded units are coded using Exp-Golomb codes. Exp-Golomb codes are variable length codes with a regular construction. CAVLC residual coding consists of 5 syntax types: Coeff_token, Trailing_ones, Level, Total_zeros, and Run_before. Conventionally, CAVLC decoders use lookup tables that frequently require multiple memory accesses until the desired codeword is found. This heavy memory access results in high power consumption and delay in operations of multimedia terminals, such as digital multimedia broadcasting (DMB) players, portable media players (PMP), personal digital assistants (PDAs), and mobile phones with video capabilities [2] , [3] .
In [3] , techniques were reported to decode Run_before without lookup tables based on the observation that some codewords have systematic patterns. For Coeff_token, the codewords with high frequency are directly decoded by integer arithmetic operations. In [2] , memory accesses were further reduced by using arithmetic operations. However, this method still depends on a large number of table lookups.
In this paper, a logical-operation-based decoding scheme is described and used for all syntax types. Logical operation is simpler to perform than arithmetic operation and therefore consumes less power. The rest of this paper is organized as follows. In section II, typical designs using memory lookups, arithmetic operations, and logical operations are comparatively explained. In section III, parallel CAVLC decoding is described. In section IV, experimental results are shown. Finally, conclusions are summarized in section V.
II. Decoder Design Methods
In this section, two typical design methods and our method are described: memory lookups [4] , arithmetic decoding [3] , and logical operation techniques. Logical-operation-based decoding is the new method we have developed.
An example shown in Fig. 1 is used to explain the features of the three methods. In the figure, 4×4 DCT coefficients and the codes of Coeff_token, Trailing_ones, Level, Total_zeros, and Run_before are shown [5] . Zigzag scanning produces the sequence 0, -3, 0, 1, -1, -1, 0, 1. There are 5 Non-zero coefficients, 3 Trailing_ones, and 3 Total_zeros. Run_before is repeated until Zero_left becomes 0. The transmitted bitstream is shown at the bottom of Fig. 1 .
Memory Lookup
In memory lookup methods, a codeword is used as an address for the lookup shown in Fig. 1 , Run_before values are found by referring to lookup tables as shown in Fig. 2 . However, this method is difficult to parallelize.
Arithmetic Decoding
CAVLC decoding using arithmetic operations was developed to reduce memory accesses and thus reduce power consumption. The codes are grouped using the correlation of VLC codes, and arithmetic operations can replace the table lookup process [2] . This is an improved software decoding method. As an example, the Run_before codes in Fig. 1 are decoded by the following arithmetic operations. For simplicity, grouping and state checking are not shown.
The weakness of this arithmetic method is that arithmetic operations are only able to replace memory lookups, which makes parallelization difficult.
Logical-Operation-Based Decoding
For efficient hardware implementation, we use logical operations instead of memory lookup tables or arithmetic operations. As an example, the Run_before codes in Fig. 1 are decoded by using the simple logical operations shown in Fig. 3 . The logical operations are simple (fast) and thus consume less power.
Multiple-Symbol Parallel Variable Length Decoding Technique
Due to the sequential characteristics of variable length coding, parallelization is difficult. In [6] , a multiple-symbol parallel variable length decoding method was reported for MPEG 2. All possible codewords starting from every bit were searched. The first matched code was selected, and then the codeword starting from the next bit was selected. This selection (4) Run_before (3) or Run_before (2) Run_before (1) (b) Run_before(3) or Run_before (2) (c) Run_before (1) step can be repeated to decode the N-bit input. In other words, all possible codewords are found in parallel, and the correct codewords are selected. We implemented this method for H.264 for comparisons of results.
III. Parallel Decoder Design
We now explain our new parallel decoding scheme based on logical operations. CAVLC decoding is sequentially performed in 5 steps in order to decode 5 syntax elements. Coeff_token and Total_zeros appear only once per macroblock, while Trailing_ones can appear up to three times. However, the code length is not more than 3 bits, so parallel processing is not necessary. Several iterations can be repeated in Level and Run_before. Therefore, parallel processing is performed only for Level and Run_before. In parallel processing, at least one codeword is decoded every cycle, and all codewords whose sum of lengths is less than or equal to M are decoded in one cycle. When M is large, many codewords can be decoded in a cycle at the cost of a larger area. We designed the parallel CAVLC decoder with M=8.
Let 1 2 3 , , , , , i C C C C be the input bit stream and l i be the code length of C i . Case 1. 1 .
l M ≥
Only C 1 is decoded by a normal decoder. Case 2. l 1 <M.
We find the maximum integer k such that 
Then k codewords are decoded in a cycle.
When M = 8, since 
Proposed Parallel Decoding of CAVLC
CAVLC decoding is processed by using the five steps of Coeff_token, Trailing_ones, Level, Total_zeros, and Run_before. A step can be started when the previous step is completed. This sequential nature makes parallel processing among the steps difficult. The steps Coeff_token and Total_zeros occur only once during the decoding of each macroblock, while the remaining steps (Trailing_ones, Level, and Run_before) can occur several times.
The decoding of the data stream within a step is also not straightforward because decoding of a given code is dependent on the decoding of the preceding codes. However, our proposed parallel algorithm can decode in parallel by considering the dependencies. All the possible combinations of codes are considered for correct parallel decoding.
In our decoder, the five steps and tables are identified by the State_select variable as shown in Table 1 . When the most significant bit of State_select is 1, parallel decoding can then be performed. Otherwise, sequential decoding is performed. Now, we explain the proposed parallel decoding method in detail. Let us consider when M = 8. In this case, the codewords within the 8 bits are decoded.
There are 256 kinds of data streams when M = 8, from 00000000 to 11111111. Because a significant portion of these are not used as codewords, the eight-bit input data stream has many "don't care" cases, which helps logic optimization.
A. Coeff_token
Coeff_token is the first step in macroblock decoding and is executed only once. Total_coeff and Trailing_ones are decoded in Coeff_token. This step uses four tables: Num_VLC0, Table 1 . State_select for determining steps and tables.
Steps Tables  State_select   Num_VLC0  0000 Num_VLC1 0001
Num_VLC2 0010
Step 1. (Coeff_token) Num_VLC_DC 0011
Step 2. (Trailing_ones) 0100
Level_VLC0 1000
Level_VLC1 1001
Level_VLC2 1010
Level_VLC3 1011
Level_VLC4 1100
Level_VLC5 1101
Step 3. Step 5. (Run_before) run_before 1111 Num_VLC1, Num_VLC2, and Num_VLC_DC. Let N u be the number of nonzero coefficients of the upper block (of the current block being decoded), and let N l be the number of nonzero coefficients of the left block. Then, N=(N u +N l )/2. Based on the value of N, one of the tables is selected using Table 2 . Parallelization is not necessary since this step is executed only once.
B. Trailing_ones
Trailing_ones found in the previous step represent the number of Trailing_ones (1 or -1) in the macroblock. Because each Trailing_one is coded by one bit, Trailing_ones can be decoded at once.
C. Level
The Level step uses 7 lookup tables, namely, Level_VLC0, ·••, Level_VLC6, as shown in Table 3 . At the beginning of the Level step, Level_VLC0 is usually accessed. One exceptional The decoding steps can be explained by using the example shown in Fig. 1 . Because Total_coeff = 3, the initial table to access is Level_VLC0. The absolute value of the decoded Level (1) coefficient is 1 and thus is greater than the threshold 0. Therefore, the Level_VLC1 table is to be used for the next decoding. Level (0) = 0010 and the absolute value of decoded Level (0) is 3. Level_VLC1 is to be used next. This decoding process is also executed in parallel by efficient logical 
operations that consider all the possible cases. Figure 5 shows a block diagram of the Level step process when M = 8. Since all the possible 256 cases are considered in the proposed parallel logic operator shown in Fig. 5 The proposed parallel logic operation is synthesized to produce the above outputs when the input is 0010011d (d means don't care) in this case.
Case 2. State_select = Level_VLC1 Two codes, 0010 and 011, are decoded in parallel, and we get 3 and -2 using table Level_VLC1. Thus, the results are 
D. Total_zeros
Total_zeros represents the number of zero coefficients and are decoded only once. Total_zeros and Trailing_ones are decoded by the serial logic operator show in Fig. 5 .
E. Run_before
Decoding in the Run_before step is repeated until Zero_left becomes 0. The initial value of Zero_left is the Total_zeros found in the previous step. As in the decoding in the Level step, the decoding condition for the next code is determined by the current decoded codeword in the Run_before step. We also parallelized the decoding in the Run_before step by considering all the possible cases (256 cases when M = 8). Table 5 is used to update Zero_left after decoding each codeword. When a codeword is decoded, the appropriate Run_before value in Table 5 is found, and the value is subtracted from the current Zero_left. For example, let the input data stream be 10010000 and let Zero_left be 7. Then, the first Run_before code is '100' and the Run_before value is 3 from Table 5 . Now, Zero_left is updated to 4 (=7-3). The second code is '10' and the Run_before value is 1 from Table 5 . Then, Zero_left is updated to 3 (=4-1). The third code is '00' and the Run_before value is 3 from the table. Since Zero_left becomes 0 (=3-3), the Run_before step is completed. In our extensive parallel approach, these three codewords ('100', '10', and '00') are simultaneously decoded in parallel, since all the possible combinations are considered in the proposed parallel decoding step. Consider the example in Fig. 1 . Run_before (4) to Run_before(1) codewords are '10', '1', '1', and '01' (or '101101'). In our method, these four codewords are decoded at once in one cycle, while the sequential decoding method [2] takes four cycles to decode them. Since many short codewords appear in the Run_before step when compared to those in the level step, significant speed up is possible by parallelizing the decoding in the Run_before step. 
Further Optimization by Using Prefix Precomputation
In memory-lookup-based decoding methods, memory partition techniques are widely used. These decode the prefix parts separately to reduce the total memory size [2] , [3] . However, we use logic circuits instead of memory. Therefore, if the logic synthesis tool is ideal in logic optimization, precomputation is unnecessary. In practice, logic synthesis tools are not ideal, and the quality of the results is dependent on the input description. In other words, if the input format capitalizes the structures of the logical operations, better solutions can be obtained from a logic synthesizer.
Inspired by memory partitioning, we have developed prefix precomputation techniques to reduce the decoder area. Figure 7 shows an example codeword which is 12 bits long and whose prefix is 8 bits long. If we precompute this prefix as 0011, then the input to the logic operator can be reduced from 12 bits to 8 bits, as shown in Fig. 8 . Figure 8(a) shows the original logic circuit, and Fig. 8(b) shows the circuit optimized by using prefix precomputation. When we synthesize the circuits by using a well-known commercial logic synthesizer, the original circuit takes 33 instances (gates). However, the circuit optimized by using prefix precomputation takes only 23 instances (gates). Furthermore, the optimized circuit uses smaller gates due to the simpler circuit structure. Table 6 shows the precomputation table we implemented for prefix values from 00001 to 0000000000000001. Using this precomputation, the circuit area can be further reduced by 11% when compared to the original circuit.
IV. Experimental Results
CAVLC is decoded in 5 steps (states). We use logical operations instead of memory lookup tables or arithmetic operations. Proposed parallel decoding is used for Level and Run_before states. Since Run_before statistically contains many short codes, parallel processing is more effective for Run_before than for Level.
We also designed a parallel CAVLC decoder based on multiple-symbol parallel decoding [6] to compare its performance and area with those of the proposed decoder. The input stream was obtained by using JM 10.2 with QP = 24 from the Foreman video sequence data. Tables 7 and 8 show performance and area comparisons, respectively. Both of the decoders targeted to the baseline profile were designed with the Synopsys Design Analyzer using the Hynix 0.25 µm library. The decoders were synthesized for a 50 MHz clock rate. In the figure, CD indicates the number of codeword detectors used.
In the current implementation, M = 8 and 8 bits of input stream are analyzed simultaneously. At least one codeword is decoded in a cycle and multiple codewords can be decoded if the sum of their code lengths is not greater than M. For M=8, an average of 1.57 codewords are decoded in a cycle for the Foreman sequence. The results for the Mobile sequence are very similar. The performance (throughput) of our decoder is similar to that of the multiple-symbol parallel decoder with 6 codeword detectors (CD #6). However, our decoder uses 40% less area when compared to the multiple-symbol parallel decoder [6] . The decoder described in [4] can decode 0.05 or 0.04 codes/cycle and shows significantly lower decoding rates than those of our proposed decoder and that proposed in [6] . Table 8 shows a comparison in terms of area for the methods proposed in [4] and [6] , our original method, and our improved method. In [4] , gate counts excluding buffers are shown, and the area is not shown. Therefore, area cannot be directly compared with the method in [4] . When compared to our original method, our improved method using prefix precomputation requires 11% less area. When compared to the multi-symbol parallel decoding method in [6] , our improved decoder uses 46% less chip area while maintaining similar performance (CD #6).
V. Conclusion
For conventional CAVLC decoding methods, such as memory lookup and arithmetic methods, parallel decoding is difficult due to its sequential nature. In this paper, a new logical-operation-based CAVLC decoding scheme was proposed as well as a parallel decoding architecture to efficiently process Level and Run_before steps. Among the [6] , our decoder uses 46% less area to achieve the same performance in throughput.
