Abstract-In this paper, a VLC decoder supporting to decode coefficient data in blocks of MPEG-2 and CAVLC in H.264 is presented. To achieve programmability of the VLC decoder, a memory-based architecture with improved memory efficiency is proposed. Group-based look-up table (LUT) algorithm is extended to multi-table merging (MTM) which extracts redundancy of groups further. With multi-table merging algorithm, all coding tables are integrated into memory more efficiently. While the memory access may lead to much power consumption, a low-power scheme is proposed to reduce memory access. The distributed cache is adopted to save power and improve the decoding throughput as well. Simulation results show that the cache with replacement method can reduce about 60% ~ 95% memory accesses.
I. INTRODUCTION
Variable length coding (VLC) is an important technique to achieve video compression in nowadays video compression standards. The ITU-T/ISO/IEC Joint Video Team established the newest video coding standard known as H.264/AVC [1] . This standard achieves a much more compression rate than the previous standards. There are two entropy coding types in H.264/AVC, one is CAVLC used in the baseline profile and the other is CABAC used in main profile. On the other hand, the MPEG-2 from Motion Pictures Expert Group (MPEG) [2] is widely used in video applications. Therefore, to support these two standards is necessary for a video decoder.
H.264/AVC baseline profile contains CAVLC which is decoded by looking up tables. There are five syntax elements in CAVLC: Coeff_Token, sign of TrailingOnes, Level, Total_Zeros and Run_Before. Sign of TrailingOnes is decoded by reading 3 bits at most. Level can be decoded by detecting first 1 and get certain bits from the bitstream. However, Coeff_Token, Total_Zeros, and Run_Before require to access memory frequently because of looking up tables. The coefficients data in MPEG-2 is composed of runlevel pair (RLP) symbol and also decoded by look-up-table mechanism. From the above, memory-based architecture is used so that different contents can be loaded into memory for MPEG-2 or H.264 standards respectively. Nevertheless, memory access consumes power and thus become overhead for handheld or portable devices. Consequently, a low-power design issue lies in reducing memory access.
Up to now, there has been research efforts reported to reduce memory access. In [3] , some short codewords are decoded by arithmetic operation to reduce memory access. But its sequential searching would lead to low throughput. In [4] , almost all codewords are decoded by using arithmetic operation while a few codewords are decoded by table-lookup with binary search (TLBS) to reduce memory access. However, the decoding method is not implemented with hardware and the arithmetic operations and conditions are complicated. HLLT (Hierarchical Logic for Look-up Table) and PCCF (Partial Combinational Component Freezing) were proposed in [5] to improve speed and reduce power consumption respectively. Finally, the scheme proposed in [6] is to decode short codewords with arithmetic operation while other codewords are decoded by conventional decoding.
From the above, we can find that although those proposed methods can reduce memory access, their decoder cannot support multi-standard video decoder. To reduce the memory space while maintain the programmability for different applications, multi-table merging (MTM) is used to combine all tables into memory [7] . Furthermore, a group partitioning scheme of short VLC codewords is proposed to improve memory efficiency, where cache with Least Recently Used (LRU) replacement is designed to reduce power consumption of the proposed VLC decoder.
II. MULTI-TABLE MERGING ALGORITHM
From the CAVLC decoding process and the group-based VLD algorithm [8] , it can be found that storage space is increased if we group all tables and put the information into memory separately. Therefore, we have to develop an algorithm to merge these tables into one single memory. The algorithm is explained as follows:
(1) Generate group information for multi- From the result of the above steps, several values of PCLC_mincodes which belong to different tables and different groups are the same; therefore, we can merge them into one group so that those codeword groups became a merged group. The result is shown in Figure 2 . There are four groups and the differences in every group are also needed to be stored, such as base_addr and CL. Thus, only 4 items needed to be stored, which is less than 13 items in Fig.1 . There are 17+16+11+7+8 = 59 items after step1. After this process, there are only 23 groups, which is much less than the sum of group number of individual tables. Because the shortest length of a codeword is 1-bit and the length is from 1 to 16 bits, we just store (length-1), i.e. 0~15 in the memory to save memory space. After this shifting operation, the smallest (length-1) in all groups is defined as MTM_CL-1 and stored in the group information memory. Therefore, the difference between the larger (length-1) and (MTM_CL-1) which is defined as CL_diff is stored in the table information memory. The memory space is further saved because the data redundancy among the lengths in a MTM group is exploited. The table information and group information are shown in Figure 3 .
When the MTM algorithm is applied, the symbol addresses are designed to have different offsets so that we can correctly access symbols in the memory. Take Coeff_Token as an example, because VLC 0~3 has 62 symbols, symbol address offset of VLC0 is 0 and symbol address offset is 64, 128 and 192 for VLC1, VLC2 and VLC3, respectively.
For simplification, the figures shown above are just for Coeff_Token tables (VLC0~4). In this paper, we can further merge all tables in one memory to achieve programmability by using the MTM algorithm. This is a different design as compared to the VLC decoder proposed in [3~6] which focused on Coeff_Token tables to achieve memory-efficient decoding while used hardwired circuit to decode Level, Total_Zeros, and Run_Before respectively. This paper proposes a symbol memory allocation so that one memory is used and almost supports enough memory space for symbols of TB14 and TB15 in MPEG-2 and Coeff_Token, Total_Zeros and Run_Before tables in H.264. The allocation is shown in Figure 4 . We can see that for TB14 and TB15 in MPEG-2, the run and level pair can be aligned in one entry and for H.264, there is a mask to extract different symbols for two-symbol tables (Coeff_Token tables) and one-symbol tables (Total_Zeros and Run_Before tables). For H.264, there are 248 + 14 + 135 + 9 + 42 = 448 symbols, and thus we have to use 512-entry memory with 512 -445 = 67 entries unused. However, with this allocation method, we can just use 256-entry memory instead of 512-entry memory to improve memory efficiency and also reduce the critical path for hardware implementation.
III. LOWER POWER SCHEME
After merging all tables in the memory, we can see that the symbol memory is large and thus the access time and power consumption. Therefore, we have to reduce memory access to reduce power consumption. In [9] , several LUTs implemented with ROM are separated according to group probability distribution and then one cache implemented with register files are used to reduce memory access. However, the separated LUTs lead to unequal throughput performance, that is, if the cache is missed, LUT1 is also missed, and the decoder would search until the LUTk is hit. Take throughput into account, we use two-level memory hierarchy concept. Individual register file is used as cache for the corresponding table. Thus, we can just look up cache first and if it is missed, table memory, group memory and symbol memory would be accessed. Initially, the cache update scheme is simple. The codeword and the corresponding symbol are recorded in the cache until the cache is full. And it won't be updated if all registers are written. The format of one entry in the register file is shown in Figure 5 . For Coeff_Token (nC>=0) tables, m=3, n=6, x=4 and y=7 are the width. For Coeff_Token (nC=-1) table, m=0, n=3, x=4 and y=7 are the width. For Total_Zeros or Run_Before tables, m=4, n=4, x=4 and y=4 are set as the width. Finally, m=2, n=1, x=2 and y=2 for Total_Zeros (chroma 2x2). Format of the register file Figure 6 shows the simulation results of 300 frames from the first 100 frames of Akiyo, the first 100 frames of Foreman and the last 100 frames of Foreman sequence with 3 different cases for QP = 20, QP=24 and QP=28.
The x-axis is frame number and y-axis is the cache hit rate. From this figure, as QP is lower, hit rate is lower than the hit rate under higher QP. That is, for lower QP of a video sequence, the decoder will access memory more frequently. The average hit rate is 54.8% for QP = 20, 64.2% for QP = 24 and 76.9% for QP = 28. Because total number of codewords in VLC0~3 is 62 x 4 = 248, we choose an appropriate size of register file, 64, to store codeword length, codeword value, table index, and symbols. TABLE I shows the size of register file for all tables. To reduce cache size, Least Recently Used (LRU) replacement method is used. The LRU replacement is achieved by the counter of each cache entry. The counter represents how recently the entry is used. If the entry is used (cache is hit), the counter for the entry is incremented. Otherwise, the counter is decremented. If the cache is full and cache is missed, we replace the entry where the count is minimal. That is, the entry is least recently used so it is replaced by new content. using LRU replacement. As a result, the memory access can be reduced greatly and hence the power consumption. IV. ARCHITECTURE As shown in Figure 7 , the input shifter provides 2 bytes of bitstream to the VLC decoder input and the symbol address is calculated. Finally, the symbol memory is accessed to decode the symbol.
Take efficiency into account, 30 coding tables in CAVLC are separated into two categories. We use one memory for the 16 coding tables and one for the other 14 coding tables to reduce unused locations.
The core part of the VLC decoder was implemented by Verilog HDL and the memory is generated by Artisan TSMC 0.13-um memory compiler. The design was synthesized with TSMC 0.13um cell library. The gate count is 40067 (excluded cache) at 100MHz. 
V. CONCLUSION
With the MTM algorithm, all symbols in the coding tables of CAVLC and MPEG-2 (TB14 and TB15) can be put into one 256-entry memory. By using the 2-level memory hierarchy method, we can save memory accesses at least 50~60% and most 80~95%. Also the VLD decoder can achieve programmability so that it can be used in multimode video decoding applications.
