Abstract-This paper presents a low-power variable length decoder exploiting the statistics of successive codewords. The decoder employs small look-up tables working as fixed caches to reduce the number of activations of a variable length code detector where considerable power is consumed. The power simulation results estimated using PowerMill show that 35% energy is reduced on the average compared to the previous low-power scheme.
I. INTRODUCTION
V ARIABLE length coding that maps input source data onto codewords with variable length is an efficient method to minimize average code length [1] . Compression is achieved by assigning short codewords to input symbols of high probability and long codewords to those of low probability. Variable length coding has been successfully used to relax the bit-rate requirements and storage spaces for many multimedia compression systems such as MPEG and H. 263. For example, a variable length code (VLC) is employed in MPEG-2 along with the discrete cosine transform (DCT), resulting in very good compression efficiency.
The most important objective in the early researches on variable length decoders (VLDs) is to achieve high throughput. There have been a lot of studies addressing high performance VLDs [2] - [6] , which can be classified into two groups: tree-based and parallel decoding approaches. The tree-based approach decodes input symbols bit-serially and is adopted by a preliminary VLD [5] . Although some improvements make it possible to decode more than one bit per cycle [6] , the approach is not suitable for high performance applications such as MPEG-2 and HDTV, because high clock rate processing is inevitable. As opposed to the tree-based approach, the parallel decoding approach can decode one codeword per cycle regardless of its length. As an example, Lei and Sun proposed such a VLD that consists of two major blocks, a VLC detector and a look-up table (LUT) [3] , [4] .
Since early studies have focused only on high throughput VLDs, low-power VLDs have not been received much atten- tion. This trend is rapidly changing as the target of multimedia systems is moving toward portable applications. These systems highly demand low-power operations, and, thus require lowpower functional units. Although the VLD proposed by Lei and Sun is good for achieving high throughput, it is not optimized for low-power applications. Therefore, there have been considerable efforts to reduce power consumption, which can be classified into two categories. The first is to reduce the power of LUTs based on the fact reported in [4] that LUTs consume considerable power. A number of schemes such as prefix predecoding [7] and table partitioning [8] have been presented and have reduced the power of LUTs significantly. Second, the other activities have tried to reduce the power of a VLC detector, and proposed several schemes such as VLC detector sizing [8] and barrel shifter optimizing [9] . All of these approaches assume that a codeword is independent of others, and do not consider the relation among codewords.
In this paper, we propose a new low-power VLD that considers the characteristics of successive codewords. The organization of this paper is as follows. Observations on MPEG-2 source bit-streams and the parallel decoding architecture are presented in Section II. The proposed low-power VLD scheme is described in Section III. The implementation of the proposed low-power VLD is described in Section IV. Finally, conclusions are made in Section V. Fig. 1(a) shows the parallel decoding VLD architecture that is composed of a VLC detector and a LUT. The VLC detector is further decomposed into a barrel shifter and an accumulator. Since the longest codeword is assumed to be 16 bits long, the output size of the barrel shifter is set to 16 bits. To determine the shift amount of the barrel shifter, the accumulator adds the length of the previous codeword to indicate the location of the next codeword. If the shift amount exceeds 15 (the largest number with four binary digits), the two D flip-flops (FFs) update the input data under the control of the carry output of the accumulator, which is equivalent to a 16-bit shift. The LUT contains approximately 100 entries each of which represents a decoded codeword and a code length. Each entry is arranged by treating a codeword as an address. For this reason, the LUT is usually implemented by using PLA instead of ROM to achieve good area efficiency. As a result, the parallel decoding VLD can decode one codeword in a cycle.
II. OBSERVATIONS
In the parallel decoding VLD, the average energy consumed for decoding a codeword can be modeled by the following equ- [3] and (b) M-way partitioned LUT architecture [8] .
ation [8] : (1) where denotes the average of , is the probability that codeword occurs, is the energy required to decode codeword , is the total number of codewords in the LUT, and is the energy consumption of the VLC detector whose size is . As there is one large LUT that is switched every cycle, is little related to ( , then ). Although the large LUT is a key to achieving high throughput, it is not suitable for achieving low power. Due to this fact, the follow-up studies have focused on reducing LUT power consumption. For example, the latest low-power VLD scheme [8] uses M-way LUT partitioning to exploit the codeword occurring probability, as shown in Fig. 1(b) . A single LUT is partitioned into a number of nonuniformly sized LUTs with considering the energy consumption and the occurring frequency. In this case, depends on the size of the table containing codeword . Low power is achieved by making with high small. The size of the VLC detector is changed from 16 to 8 bits, because the codewords in the most frequently accessed LUT are shorter than 8 bits. On the other hand, the throughput is lowered to 0.6 codewords per cycle. For this scheme, the average energy consumption per codeword is modeled below [8] ( 2) where is the probability that LUT is hit, is the number of LUTs, is the energy consumption of LUT when there is a hit, is the energy required for a miss, is the energy consumed by the circuit additionally required for LUTs, is the sum of the occurring probabilities of codewords whose lengths are greater than and less than or equal to , and is the smallest integer not less than the maximum code length divided by . In (2) , is valid only if the VLC detector is independent of LUTs as in [10] . As the VLC detector is related to LUTs in [8] , in (2) underestimates the energy consumption. Let us assume a codeword whose length is less than . If the codeword is not in LUT1, it takes two cycles to decode. Although the accumulator does not have to operate during the first wait cycle, it consumes some amount of energy. The input change of the adder (3-bit adder for this case) leads to energy consumption. As the adder is as complex as LUT1, it consumes as much energy as LUT1. Note that the energy consumption during the wait cycle, is smaller than , because the accumulator output remains unchanged and there are no switching activities in the barrel shifter. Considering this claim, is modified as follows: (3) where . It is assumed that LUT1 and LUT2 have all codewords whose lengths are shorter than or equal to , i.e.,
, and LUT3 and LUT4 have the codewords whose lengths are longer than and shorter than or equal to , i.e., . For example, in the first term in (3) represents that it takes one (wait cycle) and one to decode a codeword in LUT2 and in the second term in (3) represents that it takes one (wait cycle) and two to decode a codeword in LUT3. As is the largest of all, the first term in (3) greatly influences the overall VLC detector energy consumption. To achieve low energy consumption, therefore, we have to make smaller and equal to 1, which becomes more effective if is larger than . To compare with , a number of power simulations are performed for various configurations. As and represent energy consumption during one cycle, they are closely related to the average power consumption ( , ). Based on the M-way partitioned LUT architecture shown in Fig. 1(b) , six different LUT1s whose input size ranges from 2 to 7 bits are incorporated with an 8-bit VLC detector, eight different LUT1s (2-to 10-bits long) are with a 12-bit VLC detector, and twelve different LUT1s (2 bits to 15-bits long) are with a 16-bit VLC detector. The VLC used in the experiment is the MPEG-2 DCT AC coefficients that occupy more than 80% of the whole VLC bit-stream. The VLC tables are constructed from Table-B14 in the MPEG-2 standard [11] while omitting some codewords of fixed size such as the DC coefficient and common escape codes. Note that these codewords are not considered in the previous work [8] either. In this experiment, the codewords that can be decoded in LUT1 are considered. The VLD is described in Verilog, synthesized using Synopsys Design Compiler with a 0.35-m standard cell library [12] , converted into spice netlists, and then simulated using PowerMill at the typical operating condition (3.3 V, 25 C). The power consumption results are collected using the hierarchical power simulation supported by PowerMill, as shown in Table I .
The results show that the average power consumption of LUT1, , increases as the input bit-width increases, while that of the VLC detector, , varies a little (within 10 of the median) and is not monotonically proportional to the input size. For example, let us consider a VLD with a 16-bit VLC detector and a 15-bit LUT1. From the results in Table I , we expect that the power is reduced by approximately 60-W/MHz if the M-way partitioned LUT scheme results in a 5-bit LUT1, and approximately 120-W/MHz if the reduced VLC detector scheme [8] results in a 8-bit VLC detector.
As the VLC detector consumes more power than LUT1, power optimization focused on the VLC detector achieves good results. For a VLD that has an 8-bit VLC detector, further power reduction can be achieved by optimizing the VLC detector power consumption, not the LUT1 power consumption. Since LUT1 optimized by the low-power schemes [7] , [8] has input bit-width of 4 or 5 bits, it consumes approximately one fifth of the VLC detector power. Consequently, a new low-power scheme must be developed to activate the VLC detector as minimal as possible. The ratio of to is plotted in Fig. 2 with varying the LUT1 input bit-width and the VLC detector size, in which is always smaller than , . For a low-power VLD that has an 8-bit VLC detector and a 4-bit wide LUT1, the power consumption ratio is about five, which indicates that further optimization on the VLC detector can lead to considerable power reduction.
In order to develop a new low-power VLD scheme, we first examined the average occurring probability of each codeword contained in [11, . The reference implementation of the MPEG-2 standard [11] , called MPEG-2 TM 5 was modified to store all the codewords encountered in running the program into a file. Using the MPEG-2 video conformance bit-streams, we generated 20 files each of which was approximately 100 kB. Based on these files, the average occurring probability was investigated, as shown in Fig. 3 . As short codewords are closely related to the efficiency of low-power schemes, the probability has to be determined precisely. The average occurring probabilities of short codewords such as "10" and "11s" are as high as 0.15, which is similar to the previous results [8] , [10] .
The occurring probabilities of the short codewords are listed in Table II . Approximately 70% of the codewords have code lengths of shorter than 7 bits, and the most frequently occurring codewords are "10," "11s," and "011s." Regardless of the size of a VLC detector, these codewords activate the VLC detector and LUT1 as other codewords do. As the power dissipation of the VLC detector is larger than that of LUT1, considerable power saving can be achieved if two short codewords are decoded by activating the VLC detector only once. It is equivalent to lowering in the numerator of the first term in (3). To validate this idea, the statistics on two successive short codewords was examined using the same files used in the previous experiment. Table III shows the result. The short codeword (index) indicates a preceding codeword and the occurring probabilities of successive short codewords are listed according to the codeword index. The result indicates that 75% of short codewords are followed by another short codeword whose length is shorter than 7 bits. In addition, 90% of successive codewords are within 8 bits. The probability that two short codewords are located in one VLC detector window is high enough to validate the idea.
Since the basic principle of variable length coding is to assign short codewords to frequent input symbols, the observation result seems to be straightforward. The size of a VLC detector is enough to have two short codewords, even if an 8-bit VLC detector is considered. Therefore, it can be used to reduce the number of VLC detector activations by decoding two successive short codewords for a VLC detector activation. This is a major point different from the previous low-power schemes in which the VLC detector is always activated for each codeword. III. PROPOSED SCHEME As described in Section II, a low-power VLD can be achieved by reducing the number of VLC detector activations based on the successive short codeword statistics. For this purpose, an additional LUT is introduced. The LUT called Cache1 is located between LUT1 and LUT2 as shown in Fig. 4 , and accessed in the sequel of LUT1. During the next cycle, the output of the VLC detector is shifted by the length of the short codeword and then supplied to Cache1. This shift can be implemented by using a latch that points a different part of the VLC detector output. If a codeword is hit in Cache1, the power to be consumed in the VLC detector can be saved, because the VLC detector is not activated. Additional caches may be employed for the purpose of further power optimization.
The proposed scheme works as follows. The codeword aligned in the VLC detector is decoded in LUT1. The most frequent codewords such as "10," "11s," and "011s" are located in LUT1, where s . Once a target symbol is found in LUT1, a new codeword is then searched in Cache1 without invoking the VLC detector to align the VLC stream. In the next cycle, the input latch of Cache1 is clocked to latch the output of the VLC detector. In the case that a codeword is hit in Cache1, the energy required to activate the VLC detector is saved. Otherwise, the energy and the cycle time needed to access Cache1 are wasteful. However, the power consumption of Cache1 is much less than that of the VLC detector and the probability that a codeword is hit in Cache1 is as high as 0.8 for even a small-sized cache containing only 8 short codewords. Therefore we can reduce the average power of the VLC detector. In addition, letting the next codeword go directly to LUT2, instead of LUT1, can compensate the cycle penalty caused by a cache miss. This can be achieved by making Cache1 contain all the short codewords of LUT1. If a codeword is not in Cache1 that satisfies the above property, it is guaranteed that the codeword is not in LUT1. As we can skip the access to LUT1, it is possible to save power without sacrificing performance. The proposed VLD architecture is briefly presented in Fig. 5 compared to the conventional one [8] .
In the proposed scheme, LUT1 and Cache1 are accessed sequentially. However, we can imagine other configurations, because the output size of the VLC detector is enough to cover two short codewords as mentioned before. One possible configuration is to enlarge LUT1 to generate two short codewords, and the other is to access LUT1 and Cache1 in parallel. Although these configurations are more useful in increasing throughput, the former results in a larger LUT, and the latter activates Cache1 every cycle, resulting in more energy consumption than the proposed scheme in our power simulation. Another disadvantage of these configurations is that two symbols are decoded at a time. This makes the following hardware stages complex, as they have to process two symbols concurrently. The average energy consumption per codeword in a cacheequipped VLD is modeled by the following equation: (4) where indicates that caches are employed. We assume that one cache (Cache1) is employed and for the sake of simplicity. From this, three cases are considered separately.
Case 1)
. This is the case that all the entries in Cache1 are included in LUT1, but both are not the same as illustrated in Fig. 6(a) . The average energy consumption to decode a codeword in LUT1 becomes , where denotes the probability that a short codeword enables the cache in the next cycle, is the probability that Cache1 is hit, and is the energy consumption of Cache1 for a miss. Since Cache1 is smaller than LUT1, the term, , is introduced to account for the case that some codewords in LUT1 lead to a miss in Cache1. Then, the energy consumption of the accompanying VLC detector is . By the similar fashion, the average energy consumption in Cache1 and LUT2 are represented as and , and the corresponding VLC detector energy consumptions are and , respectively, where is the energy consumption of Cache1 for a hit. Then, and are formulated as in (5) and (6), shown at the bottom of the page, where, we assume . Let be ,
, and we assume , and . Then, we have , which indicates the average energy consumption of the LUTs is reduced in the cache-equipped VLD. Applying the same assumption to the numerator term in (6), we have a simplified numerator, . If , the energy is reduced for the VLC detector as well.
Case 2)
. This is the case that Cache1 and LUT1 are the same. As shown in Fig. 6(b) , the three LUTs can be examined separately. Then, and are formulated as in (7) and (8), shown at the bottom of the next page. The average energy consumption of the LUTs remains unchanged after Cache1 is incorporated into the VLD, and can be reduced by .
Case 3)
. This is the case that all the entries in LUT1 are included in Cache1, and some entries in LUT2 are also included in Cache1 as shown in Fig. 6(c) . The average energy consumptions to decode a codeword in LUT1 and Cache1 become and , and the corresponding VLC detector energy consumptions are and
, respectively. The average energy consumption in LUT2 is represented as , where the first term is for the case that LUT2 is hit after LUT1 is missed, and the second term is for the case that a codeword not in Cache1 is hit in LUT2. The corresponding VLC detector energy consumption is . Then we have a simplified form, . Finally, and are formulated as in (9) and (10), shown at the bottom of the page. Based on the similar assumption to Case 1, i.e., , ,
for (9), from which we can conclude that is close to . Applying the same assumption, the numerator in (10) can be rewritten as . Since and , is reduced by incorporating Cache1. By the same manner, the throughput of a cache-equipped VLD, , is modeled in (11) , shown at the bottom of the page, where is the throughput of the VLD in which no caches are adopted, and is the operating frequency of the VLC detector. Observe that for , for , and for .
As mentioned in Section II, another low-power approach is to make equal to 1, which means the adder in the accumulator does not consume any energy for the wait cycle. As the adder and the VLC detector do not have to be activated to access Cache1 and LUT2, we can modify the accumulator as shown in Fig. 6 to prevent unintended adder activations. Note that the accumulator in Fig. 7(a) has a D FF at the output side of the adder, and, thus the adder is activated whenever the input changes. On the other hand, the modified accumulator in Fig. 7(b) employs two D FFs at the input side of the adder. Due to this, unintended adder activations are prevented. The modified accumulator leads to some changes on the remaining part of the VLD. As the carry of the adder is generated after the length and the shift are latched, it is no longer possible to update the D FFs in the accumulator and two D FFs in front of the barrel shifter at the same clock edge. For this reason, two length values are updated at different edges as shown in Fig. 7 . If a new codeword shift is required, the enable signal is asserted using the current status of the controller and the table look-up results. Consequently, the average VLC detector energy consumption per codeword can be calculated by the following equation: (12) where implies and (7) (8) 
IV. EXPERIMENTAL RESULTS
To validate the proposed low-power scheme in practical designs, 24 VLDs were implemented for various configurations. The target VLC table, MPEG-2 DCT AC coefficient Table-B14 , was partitioned into a number of LUTs and BLKs. The partitioning scheme employed in the implementation is different from the fine-grain partitioning scheme [8] . Since a VLC detector consumes much more power than a small LUT, a maximal number of codewords permitted by a given input bit-width are used to construct LUTs. Each VLD described in structure-level Verilog was synthesized with a 0.35-m cell library [12] using Synopsys Design Compiler. The LUTs, Caches, and BLKs were implemented using random logic rather than PLA. After the synthesis, the gate-level netlist was converted into a transistor-level Spice file. Finally, power simulation was conducted using Epic PowerMill at the typical operating condition. The VLC streams obtained in Section II were also used as input stimuli for the power simulation.
The output size of the VLC detector greatly influences overall throughput and power consumption of the VLD. A small VLC detector is desirable for low power, and a large VLC detector is for high performance. Therefore we implemented 8-, 12-, and 16-bit VLC detectors. For each VLC detector, LUTs were optimized in terms of energy by repeating a number of power simulations.
The VLD architecture based on the 8-bit VLC detector is shown in Fig. 8 , which has two separated caches (Cache1 and Cache2) to further reduce power consumption. If the size of a short codeword found in LUT1 is 2 bits, Cache1 is accessed in the next cycle. Cache2 is accessed for a 3-bit codeword. The other blocks such as LUT2, BLK1, and BLK2 have the same function as those of the VLD structure presented in [8] . Fig. 9 shows the power consumption of the 8-bit VLD employing the modified table partitioning algorithm and caches. Compared to the VLD without caches, maximally 27.4% power is reduced at the cost of negligible area overhead. The power dissipation of the VLC detector is also plotted in Fig. 9(b) . The caches and the modified accumulator configuration are effective in reducing overall power. From the energy consumption result plotted in Fig. 9(d) , LUT1 is determined to have eight entries. Fig. 10 shows the low power VLD architecture for the 12-bit VLC detector. Like the case of the 8-bit VLC detector, table partitioning and cache insertion were applied together. As the codeword window is enlarged, the chance to insert caches is also increased. Therefore, four caches are employed in this architecture. This VLD operates in almost the same way as the one with the 8-bit VLC detector. The simulation results are presented in Fig. 11 . The power reduction obtained by the caches is maximally 18.9%. The VLD with the 12-bit VLC detector consumes two times or more power than the one with the 8-bit VLC detector.
The low power VLD architecture based on the 16-bit VLC detector is shown in Fig. 12 , where five caches are used. In this case, we can apply the proposed cache scheme to all codewords. Provided that long codewords are decomposed into prefixes and remaining codewords, the latter ones are also as short as those found in LUT1. The simulation result is presented in Fig. 13 . Power reduction is plotted in Fig. 13(a) and (b) . The power saving of 9.3% is obtained from the proposed cache insertion method. The power reduced in the VLC detector is as high as 22.8%. The power is mainly reduced by the fact that the 16-bit VLC detector is activated fewer times. The reduced number of VLC detector activations affects a great deal on the entire power dissipation.
The power consumption of the proposed VLD is compared to the previous VLD proposed in [8] that has been known as the best low-power architecture. To make the comparison fair, we redesigned the previous one with the same standard cell library [12] that was used for the design of the proposed VLD. The simulation results are compared in Table IV . Besides the higher throughput, approximately 30% of power reduction is achieved by employing the proposed cache scheme. Due to the two caches added, 5% area overhead is observed. In this paper, we have described a new low-power VLD scheme to reduce the power dissipation of a VLC detector where a lot of power is consumed. By exploiting the relation between two successive codewords, the number of VLD detector activations is reduced. This idea was implemented by employing small LUTs working as fixed caches. The proposed scheme was applied to three differently sized VLC detectors (8, 12 , and 16 bits). For each VLC detector, the overall power consumption was significantly reduced at the expense of a little circuit overhead. Intensive simulation results show that the proposed cache-equipped VLD consumes 35% less energy on the average than the state-of-the-art low-power VLD [8] without sacrificing throughput.
