This paper presents a novel low-cost high-performance CAVLC decoder for H.264/AVC. The proposed CAVLC decoder generates the length of coeff token and total zeros symbols with simple arithmetic operation. So, it can be implemented with reduced look-up table. And we propose multi-symbol run before decoder which has enhanced throughput. It can decode more than 2.5 symbols in a cycle if there are run before symbols to be decoded. The hardware cost is about 12 K gates when synthesized at 125 MHz. key words : CAVLC decoder, VLSI, H.264/AVC 
Introduction
There are some needs for a low-cost and high-performance multimedia codec because high-quality multimedia data is used in various mobile devices. To meet the needs, H.264/AVC is developed by Video Coding Expert Group of ITU-T and Moving Picture Expert Group of ISO/IEC. Several new features like Quarter-pixel precision motion estimation, various intra prediction modes, integer transformation, adaptive in-loop filter, and enhanced entropy coding are adopted for higher coding efficiency. Because of these, H.264/AVC has an enhanced compression rate. But the complexity increment of H.264/AVC codec incurs a costeffectiveness problem of the development of H.264/AVC codec [13] . So, hardware implementation of H.264/AVC codec is inevitable.
Context-based Adaptive Variable Length Coding (CAVLC), which is an entropy coding method of H.264/AVC, is used to encode and decode zig-zag scanned 4 × 4 or 2 × 2 residual data. Next decoding step can't be started until current decoding procedure is finished because CALVC consists of variable length symbols. Therefore, each decoding step is processed sequentially. So, CAVLC decoder has to be implemented carefully for the real-time high-quality mobile application system. This paper presents a low-cost high-throughput CAVLC decoder architecture which exploits CALVC features. The proposed CAVLC decoder has features like efficient decoding methods for Variable Length Code Tables (VLCTs), High throughput multi-symbol run before decoding, and a novel flush unit to renew the bit-stream registers without delay. The rest of the paper is organized as follows. CAVLC decoding flow and previous works are described minutely in Sect. 2. The proposed CAVLC decoder architecture is presented in Sect. 3. In Sect. 4 , verification method and implementation results are depicted. Finally, conclusion is made in Sect. 5.
CAVLC Decoding Flow and Previous Works
Zig-zag scanned coding, run-length coding, and CAVLC are adopted to improve coding efficiency of residual data compression in H.264/AVC. CAVLC decoding is built on five sub-decoding steps which are shown in Fig. 1 .
coeff token decoding as first step of CAVLC decoding is processed to decode the number of non-zero coefficients (Tc) and the number of trailing ones (T 1s) in the reconstructed residual block as depicted in Fig. 1 . The values are used to decide the number of times that the following sub-decoding steps should be processed. In the next step, the signs of trailing ones (T 1s sign) decoding is performed to decode each sign of trailing ones. The trailing ones are last coefficients which has absolute value '1' in the zig-zag scanned block data. Each sign value is decoded with following one bit in the reverse order. That is, the T 1s sign decoding is processed T 1s times. The reverse order means the decoding process is done from the last coefficient or value in the zig-zag scanned block data.
And then Level decoding is carried out to decode nonzero coefficients except trailing ones in the zig-zag scanned residual data. Level symbols are decoded in the reverse order and the times to be processing are Tc-T 1s. That is, when Tc is equal to 0 or T 1s, the decoding process is eliminated. In the example of Fig. 1 , the first decoded level's absolute value is incremented by 1. It is conditional exception for reduction of bit-stream length. That is explained in following sub-clause 3.2. The following step is total zeros decoding to decode the number of zeros before the last non-zero coefficients in zig-zag scanned residual data. To decode total zeros, there are two different tables for 4 × 4 blocks and 2 × 2 chroma DC blocks. total zeros decoding is not carried out and the value of total zeros is set to zero when Tc is equal to maxNumCoeff and total zeros decoding is ignored when Tc is zero. The maxNumCoeff is set to sixteen, fifteen, or four depending on the type of residual blocks. In run before decoding step, the number of zeros between adjacent coefficients is decoded in the reverse order. For run before decoding, run before decoder use VLCTs which are partitioned by zeroLeft. The zeroLeft is initialized with total zeros and renewed with zeroLeft which is decreased by run before. The run before symbol decoding is processed until the zeroLeft is zero or there are no more run before symbol that means run before decoding is processed Tc-1 times. Finally, the decoded coefficients in the level decoding and run before symbols are merged to reconstruct residual data.
CAVLC symbols except trailing ones' sign (T 1s sign) are encoded by using Exp-Golomb code which consists of leading zeros, '1', and suffix (info). So, it is important to find the number of leading zeros (leading zeros) rapidly, in CAVLC decoding. In the ref. [1] , Di proposed an efficient leading zeros detector which is adopted in vast literatures. Look-up table (LUT) and memory architectures considering the number of leading zeros are proposed for coeff token, total zeros, run before decoding in ref. [3] - [10] . In the early work [6] , sequential symbol matching process is used from shorter symbol to longer symbol. This scheme is not suitable for high-performance real-time applications because the long matching process time is needed for decoding a long symbol. In the ref. [3] - [5] , Moon and Yu proposed some VLCT memory access scheme which can decode a symbol within limited cycles. The proposed architectures have some defects when VLCTs are implemented with a memory. It has unequal processing time depending on the length of the symbol and the memory has the length information of each symbol. Consequently, additional storage area is required. When memory architecture is used for VLCTs, the decoding results are generated in next cycle. So, next decoding step is not determined by skip condition within current decoding process.
To improve CAVLC decoder used memory for VLCT, various VLCT architectures using LUTs are proposed in ref. [2] , [6] , [8] . CAVLC decoding methods using LUTs require a couple of look-up table access, sequentially. So, they have long critical path and each element in LUT has symbol length information.
total zeros decoder could be designed with a cognate method employed in coeff token decoding. Because total zeros symbol has similar syntax compared with coeff token symbol. In ref. [9] , Moon proposed total zeros decoding method with simple address generation for memory access and some tables are removed with arithmetic decoding for reduced hardware (H/W). Toal zeros decoder in Moon's work generates symbol length with simple arithmetic operations. Therefore, it has more reduced memory size than others.
Run before decoder has small VLCTs compared with coeff token and total zeros VLCTs. Table removal scheme by using arithmetic operations has been proposed and moon proposed full arithmetic decoding method for run before decoder in ref. [11] . And Yu and Lee proposed multi-symbol run before decoder to decode a couple of run before symbol in ref. [8] , [15] . Yu proposed multi-symbol run before by using large memory that contains 86 elements and each element consists of contiguous two symbols. Proposed multisymbol run before decoder has increased throughput but occupied larger area. Lee adopted Moon's run before decoder and proposed symbol length prediction scheme based on the probability for multi-symbol run before decoder. So, proposed multi-symbol run before decoder achieved 2-fold increase in throughput. Multi-symbol run before decoder generates a couple of level indexes with run before values to update output array.
To improve previous work, we propose a low-cost LUT which is not contain symbol length information. The proposed LUT is accessed by generated address by using the number of leading zeros and bit-stream. We also propose a high throughput multi-symbol run before decoder and a novel flush unit.
Proposed CAVLC Decoder
The block diagram of the proposed CAVLC decoder is shown in Fig. 2 . The proposed CAVLC decoder shares a leading zeros detector which was existent in level, run before, and coeff token sub-decoder, separately, in the previous works [6] - [8] . The proposed CAVLC decoder reduces H/W cost and allows other sub-decoders to use the values of leading zeros in each decoding process. And info generator produces the suffix bit-stream by shifting current bit-stream as leading zeros + 1. With shared leading zeros detector and info generator, proposed CAVLC decoder has suitable architecture for detecting Exp-Golomb code that consists of leading zeros, '1', and suffix(info).
We also proposed a novel flush unit to renew bit-stream registers without additional cycles. The proposed flush unit calculates so far consumed bit-stream length within existing decoding process. In the previous works [1] , [3] , the flush unit consists of accumulator, shifter, and 32 bits two registers. It can't generate bit-stream request signal (bit stream req) within current symbol decoding process. So, if the length of bit-stream for residual block is over thirty two, an additional cycle is required for bit-stream register renewal. On the other hand, proposed flush unit has additional adder to generate bit-stream request signal with current symbol length (symbol len) and consumed bit-stream length until last symbol decoding. If the consumed bit-stream length is over thirty two, bit-stream request signal is generated in current decoding process. So, in the following step, the decoding process is accomplished with refreshed bit-stream.
coeff token & T 1s sign and total zeros decoders are designed with combinational logic, but level and run before decoder should store current state values because they are self-dependent decoding process. In Fig. 2 , the shade blocks mean modified blocks that are explained in the following sub-clauses. 
Proposed coeff token & T 1s sign Decoder
The proposed coeff token & T 1s sign decoder depicted in Fig. 3 has four decoding steps to decode total coefficient, trailing ones and the sign of trailing ones. The first step (suffix len decoding) and second step (addr gen) are used for address generation for VLCTs access. In the third step, the symbol length is decoded with leading zeros, suffix length, and decoded elements of LUTs. Finally, the signs of trailing ones are decoded in the fourth step.
In the first step, suffix length of coeff token symbol is decoded with logical operation. The operations to calculate suffix length (suffix len) dedicate in Eq. (1)
∼(3).
suffix len
In Eq. In second step, address for LUT access is decoded by using the number of leading zeros, suffix bit-stream (info) and suffix length which is acquired in previous step. The address decoding operations for VLCT0, VLCT1, VLCT2, and chroma DC are depicted in Eq. (4)∼ (7), respectively. After address decoding, the decoded address is adjusted depending on suffix length decoded in step 1. If suffix len is three, MSB bit of suffix bit-stream (info [0] ) is inversed and then added with address value to make final address. If not, address value acquired in Eq. (4)∼ (7) is enforced to final address.
otherwise
Where ' ' means left shifting operation. The proposed address decoder which is used for coeff token VLCT decoding is showed in Fig. 4 .
In this paper, we store the four VLCTs in four look-up table. Four elements constituted with Tc and T 1s are inserted in a row. Among the elements decoded with address, coeff token (Tc and T 1s) are made a final decision with the suffix bit-stream (info). The suffix bit-stream is used for subaddress to select one element among four elements in a row. A sub addr is selected among suffix bit-stream depending on the value of suffix len. If suffix len is three, second and third bits (info [1 : 2] ) in suffix bit-stream generated in info generator are chosen as sub addr and if not, the first and second bits (info[0 : 1]) are selected as sub addr. If valid suffix bit-stream length is shorter than two, the adjacent element is copied to decode a correct element regardless of invalid suffix bit-stream. If suffix length is three, eight elements are stored in two consecutive rows of a look-up table. Because of regularity, intuitive coeff token decoding can be done. Because of duplicated elements, there are some inefficient uses of proposed LUTs shown in Table 1∼4 but we can reduce look-up table size about 30 % because symbol length information of each element is not contained compared with ref. [8] . First number of each element in the table is total coefficient and the second one is the number of trailing ones.
In the third step, the symbol length decoding is processed. When we access proposed LUTs, we get four elements which are used to select the number of total coefficient and the number of trailing ones with addr. symbol len = leading zeros + 1 + suffix len
With proposed coeff token decoding flow, Tc, T 1s and symbol length could be obtained, however, there is a irregular symbol (nC = −1. bit-stream = 000 0000 . . .) which is remarked with shade elements in the Table 4 . To get correct results, the exception is treated by additional logics that modify leading zeros to be used for symbol len calculation.
Finally, T 1s sign decoding to transfer the sign of trailing ones to level register file is carried out. The symbol position of the sign of the trailing ones (T 1s sign) is started at suffix len -compare0 -compare1 in the suffix bit-stream (info). The following bits as T 1s are used for trailing ones sign decoding and parsed to be stored in level register file.
Fixed Length Code (FLC) decoding can be defined with arithmetic function in contrast with other VLC decoding. The symbols of FLCT (for 8 ≤ nC) have six bit fixed length. FLC decoding is defined as Eq. (9).
Where bs means current valid bit-stream generated form the 64 bits shifter and the numbers in a square bracket are the bit's position in the bit-stream used for Tc and T 1s calculation. In the following equations, the bs indicates valid bit-stream parsed from flush unit, continuously.
Level Decoder
Level symbols are decoded with not VLCTs but arithmetic decoding procedure. To decoding level symbol, maximum length of level symbol should be analyzed, precisely. The length of level symbol is defined depending on supported profile. If the profile is baseline, main, or extended profile, prefix of level symbol is below fifteen. And the length of suffix is prefix − 3 or less. Therefore, the maximum length of level symbol is twenty eight bits. In the other profiles, the length of prefix is 11+bit depth and below. The bit depth is eight more and fourteen less. Level decoder is designed to decode the symbol that its length is less than twenty eight because proposed CAVLC decoder supports up to main profile. Level decoding flow is described in Table 5. In the  Table 5 , the '∼' means bit-wise not gate operation.
There is a conditional exception in the level decoding procedure that is depicted in Sect. 2. It is applied for enhanced compression rate in H.264/AVC encoding. In the CAVLC encoding, the first non-trailing ones level has reduced absolute value by one when the number of trailing ones is less than three. If T 1s is less than three, then the first non-trailing ones level is incremented by one if negative, then decremented by one if positive so that the first non-trailing ones level closed to zero. By contrast, this exception is expressed by conditional sentence which checks level cnt and T 1s in the level decoding process. If level cnt is zero and T 1s is less than three, levelCode is incremented by 2. As a result, the last non-trailing ones level has incremented absolute value.
In the level decoding flow, the suffixLength decided in previous level decoding process is used in current level decoding. So, level decoding is self-dependent decoding process and it is implemented by sequential logic.
Proposed total zeros Decoder
Proposed total zeros decoder has similar LUTs designed same ways which is used for LUTs in coeff token decoder. Suffix length of total zeros symbol is less than two. So, address adjustment used in second step of coeff token decoding is not required. An identical symbol length decoding method is used in coeff token decoding. But total zeros decoding requires additional decoding process because there are a number of zero sequence symbols which can't be decoded by proposed symbol length decoding method. So, we calculate maximum length of zero sequence before address decoding with Eq. (10) proposed in ref. [9] .
In Eq. (10) 
Proposed run before Decoder
The length of almost run before symbols is shorter than three and run before decoder has small H/W size than other sub-decoders. Multi-symbol run before decoder is studied in various literatures based on the feature of run before symbol [8] , [15] . But there are huge increases of H/W size in contrast with its enhanced throughput. Yu proposed separated run before tables for multi-symbol run before decoder in ref. [15] . The multi-symbol run before decoder decodes two run before symbols when zeroLeft is less than six. The table has possible combination of two continuous run before symbols. It has enhanced throughput but the table size increased exponentially. In ref. [11] , Moon proposed full arithmetic decoding based run before decoder and then Lee improved the Moon's work for efficient H/W implementation. Lee also proposed multi-symbol run before decoder based on statistical analysis between the length of run before symbol and zeroLeft in ref. [8] . The multi-symbol run before decoder predicts the length of current and next run before symbol with current zeroLeft. It can decode three run before symbols in a cycle when the length of decoded symbols is equal to prediction results. But, the multi-symbol run before decoder doesn't have high prediction success ratio. The proposed run before decoder has reduced H/W size by increasing regularity of run before decoding operation. We remove three adders, three 2-input MUX with additional 5 gates. run before decoding is divided four case that are zeroLeft = (1 and 2), zeroLeft = (3, 4, and 5) , zeroLeft = 6, and zeroLeft > 6 listed in Eq. (13), (14), (15) 
The proposed multi-symbol run before decoder shown in Fig. 6 uses the proposed run before decode which is depicted in Fig. 5 . The proposed CAVLC decoder offers multisymbol run before decoder the number of leading zeros. The number of leading zeros and zeroLeft are used to decode the length of the 1 st run before symbol before the 1 st run before results are generated. As a result, we can decode correct two run before symbols when there are remained run before symbols to be decoded. Each run before decoder in multi-symbol run before decoder is executed when previous zeroLeft is larger than zero and the number of run before execution is less than Tc-1. The 3 rd run before decoder has correct result if the symbol length generated in the 2 nd run before decoder is identical to prediction results in the Table 6 .
The conditions described in the previous paragraph are checked in a run before controller. The controller generates 3 bit-width run en signal to enable to write the level coefficients stored in level register file with level index generated with run before decoders and level index register. The level index register stores prior wrote level index in the last run before decoding to make level indexes of following level coefficients. The level indexs are used as addresses to store level coefficients in output registers.
The controller, also, generates selection signal for 4 MUXs positioned on right side of run before d blocks. If the results generated in a run before d block are not equal to prediction results or there are no more symbol to be decoded, the run before and symbol len of run before d block are ignored and demmy value '0' are selected to get a correct level index for level indexes and total symbol length that is calculated by summation of each symbol len generated in run before decoder blocks.
Finally, multi-symbol run before decoder generates total length of decoded run before symbols (t symbol len) and selects valid zeroLeft by using the number of decoded run before symbols. We also proposed a simple symbol length decoding operation for the 2 nd and the 3 rd run before decoders in Eq. (17). 
Experimental Results
Test sequences used for experimental results are offered by ITU [16] . Input and output data of CAVLC decoder are generated by using JM 16.0 for the functional verifications. We eliminate misjudgments according to subtle difference of test sequence by using public test sequence.
Performance Evaluation of Proposed Multi-Symbol run before Decoder
Run before symbol length pre-decoding is categorized to twelve cases. It is used for first run before symbol length pre-decoding with zeroLeft and leading zeros. Second run before symbol length is examined with eight general purpose test sequences offered by ITU when second run before symbol decoding is performed and there are remained run before symbols to be decoded in the third Fig. 7 Occurrence rates of the second run before symbol length.
run before decoder. The eight test sequence which has various quantization parameter (Qp), and level are used to avoid local prediction results. In the Fig. 7 . In Fig. 7 , the numbers on the left and right of the underscore in the x label signify the case in the Table 6 and second run before symbol length, respectively. The maximum value of occurrence rate is eight because the executed number of third run before symbol decoding is normalized to one in each test sequence. Second run before symbol length prediction has more than 50 % success ratio in the other cases except case 12. Especially, in case 1, 3, and 5, the prediction success ratio has 100 %.
We verified the performance of the proposed multisymbol run before decoder with Oh and Lee's work. We decoded eight test sequences and compare the total consumed cycles which are used in run before decoding. The results are shown in Table 7 . The run before symbol has a low proportion of total CAVLC decoding in low-quality test sequences like BA3 SVA C and BA2 SVA F. In these case, the double-symbol run before decoder using the run before symbol length pre-decoding is comparable with Lee's multisymbol run before decoder which has three run before decoding path. When there are more level decoding and run before decoding than low-quality test sequence, Lee's multi-symbol run before decoder consumes lower decoding Table 7 Processing cycle comparison of various run before decoders.
cycles than the double symbol run before decoder. Because Lee's work utilizes 3 path run before decoding and statistics of run before symbol length for multi-symbol run before decoding. But, proposed multi-symbol run before decoder has about 7∼11 % reduced processing cycles compared with Lee's work based on run before symbol length pre-decoding and high accuracy run before symbol length prediction.
Performance Evaluation of Proposed CAVLC Decoder
Proposed CAVLC decoder requires much lower cycles for CAVLC decoding due to coeff token & T 1s sign decoding, high-throughput multi-symbol run before decoder, a novel flush unit, and look-up table based symbol decoding compared with chang and alle's work. In additionally, we can increase the throughput about 4∼9 % compared with Lee's work by using run before symbol length pre-decoding and high accuracy run before symbol length prediction. For the throughput comparison, we calculate the average cycles per macroblock that is conventionally used in CAVLC throughput comparison. The throughput comparison is shown in Table 8 and '-' used to denote the missing data.
Implementation Results
We designed proposed algorithm with Matlab and the input and output data used in software simulations are generated with JM reference software ver. 16.0. After software Table 9 . And we designed hybrid total zeros decoder with extended arithmetic decoding and proposed look-up table. Because of extended simple arithmetic decoding, we can reduce 10 % look-up table area compared with ref. [9] . We also optimize the run before decoder. Therefore, proposed CAVLC decoder can be implemented with 23 % reduced H/W size compared with ref. [8] .
Conclusion
CAVLC decoding in H.264/AVC has an important role to get high coding efficiency. But the decoding flow should be implemented sequentially because of variable length characteristics of CAVLC symbol. It is not acceptable for high quality and real-time video sequence decoding. To overcome these defects of CALVC decoding, we proposed a low-cost and high-throughput CAVLC decoder.
For the H/W reduction, we proposed simple symbol length generation method for coeff token, total zeros, and run before decoder. In the total zeros decoder, we also extend the arithmetic decoding for the high efficient hybrid decoding and smaller LUT size. In the previous works, there is individual leading zeros detector to detect leading zeros of Exp-Golomb code in each sub-decoder. On the other hand, we designed CAVLC decoder with a shared leading zeros detector to reduce H/W size and provide sub-decoders with additional information.
In addition, the proposed CAVLC decoder has increased throughput because of proposed multi-symbol run before decoder and a novel flush unit which does not require additional cycles for bit-stream buffer renewal. As a result, the proposed CAVLC decoder can process high quality video sequence effectively because of its high throughput. It is also applicable for portable media applications because of its cost-efficient design.
