This paper proposes a new architecture for efficient variable-length decoding (VLD) of entropy-coded data for multimedia applications on general-purpose processors. It improves on earlier proposals for low-complexity performance-enhancing hardware structures that exploit prefix/suffix properties of variable-length codes for common multimedia formats [1] . The enhanced architecture is compared to the previous architectures in terms of complexity and operating speed for FPGA implementation, and also in terms of area requirements, power consumption, and operating speed for a 0.18-µm ASIC fabrication process. Simulation results are reported for a pipelined processor with caches executing MPEG-4 software where VLD performance is doubled by incorporating the proposed architecture.
INTRODUCTION
The ubiquity of multimedia data is leading to the inclusion of performance-enhancing hardware support for encoding and decoding such data on general-purpose and embedded processors. Although instructions for bit-level parallelism can improve performance for many aspects of multimedia decoding, the variable-length decoding (VLD) portion has inherent serial characteristics. VLD on specialized hardware that is specific to a particular multimedia format has seen significant improvements since early work in this area [2] . This specialized approach cannot easily be extended to general-purpose processors that need the capability to decode multiple different formats. VLD on general-purpose processors has only seen modest gains through certain architectural improvements, even though it may account for up to 30% of the decoding time in a given application [3] . Recent work has augmented an existing media processor with programmable logic for custom VLD acceleration [4] , but there are inherent chip-area penalties with this approach. Instead, we have previously proposed the incorporation of flexible instruction extensions for VLD acceleration with modest implementation complexity that are applicable to general-purpose processors [1] .
Multimedia data is typically compressed using lossy transform coding followed by lossless entropy coding [5] . The latter commonly uses modified Huffman coding with fixed or dynamic codeword tables. The first column of Table 1 gives sample codewords for chroma block patterns in an MPEG-4 video [5] . The codewords in such a table can be grouped by their prefix of the leading number of zeros (LNZ) to enable efficient variable-length decoding. Once a codeword is classified into its LNZ group, the remaining codeword suffix can be an index into a table that contains the decoded information [1, 3, 6] . Software-based MPEG decoders have used this property for symbol-by-symbol decoding, rather than slower bit-by-bit decoding [3, 6] . Although this property has been used in hardware accelerators for a specific multimedia codec [2] , flexible and efficient mechanisms using the same property to support different multimedia codecs in general-purpose processors are lacking.
We have recently proposed two low-complexity hardware architectures that exploit codeword prefix/suffix characteristics in order to enhance VLD performance in general-purpose processors [1] . Our memory-based (MB) decoding architecture uses a small intermediate memory in the processor with a number of entries equal to the number of distinct groups in a codeword table having the same LNZ count. Each entry provides the suffix length for the group and a base address in the main memory where the group-related codeword table is located. Our Single Fixed-Length Suffix (SFLS) architecture, on the other hand, does not use a small memory in the processor and hence has reduced hardware complexity. Instead, it uses the maximum suffix length across an entire codeword table for indexing all group tables in memory. As a consequence, it requires larger group tables than the memory-based architecture.
This paper proposes an enhanced Multiple Fixed-Length Suffix (MFLS) architecture for incorporation into general-purpose processors. It exploits prefix/suffix properties of variable-length entropycoded data, similar to our previous architectures [1] , but seeks to balance hardware complexity and memory requirements. The remainder of this paper describes the proposed architecture, presents FPGA and ASIC synthesis results, and summarizes performance results from instruction-level simulation.
THE MULTIPLE FIXED-LENGTH SUFFIX ARCHITECTURE
The Multiple Fixed-Length Suffix (MFLS) architecture consists of combinational logic and an associated group control register. It would be implemented in a general-purpose processor and used by a special instruction to support variable-length decoding. The combinational logic performs the LNZ count, group selection, bit-shift, and arithmetic operations that generate the table index for variable-length decoding. The latter three operations, in particular, depend on the contents of the group control register that is configured by standard control register access instructions prior to executing VLD code. The maximum number of groups supported by a hardware implementation of the MFLS architecture is fixed. Up to that maximum number, the actual grouping of codewords into subsets with sequential LNZ counts, as explained below, is at the discretion of the multimedia programmer.
Definition of Groups
Let L − 1 represent the maximum LNZ count for a collection of variable-length codewords. In our previous proposals [1] , the memory-based (MB) architecture would require L codeword tables, whereas . From these definitions, the address offset in a group table for a codeword with LNZ value of that is contained in group m is given by
The total number of memory entries to store all of the codewords is
. The application of the above definitions is illustrated for the variable-length codewords given in Table 1 . Inspection of the codewords suggests that three groups are appropriate. Codewords with an LNZ count ranging from 0 to 2 are assigned to group 0, whose minimum LNZ value is min 0 = 0. Codewords with an LNZ count ranging from 3 to 6 are assigned to group 1, whose minimum LNZ value is min 1 = 3. Finally, codewords with an LNZ count of 7 or more are assigned to group 2, whose minimum LNZ value is min 2 = 7. The maximum suffix lengths for each of the three groups are S 0 = 1, S 1 = 2, and S 2 = 1. The group offset values are provided in Table 1 .
VLD Instruction Format
To utilize the MFLS architecture in a general-purpose processor implementation, the following instruction is proposed for a typical threeoperand instruction set architecture:
The instruction uses the left-aligned codeword in register rsrc1 and returns the index in the VLC table identified by the imm field in rdest. The internal group control register that is used by this new instruction consists of group minimum LNZ values min m , right-shift amounts W − S m , where W is the maximum suffix length supported, and offset values corresponding to each group m. The right-shift amount W − S m aligns the suffix bits to form the index for the codeword table entry. Precomputed offset values must be used in order to reduce storage requirements for a VLC table. Multiple group control registers could support multiple VLC tables, and the imm field in the instruction would select the intended one. Figure 1 provides the details of an MFLS architecture with M = 3. For commonly-used MPEG-4 look-up tables, Section 5 will show how the choice of M = 3 results in total MFLS table storage requirements that are significantly less than the requirements for the SFLS architecture and moderately more than the MB architecture. A larger value of M would further reduce the total MFLS storage requirements, but it would also increase the hardware complexity and the size of the group control register.
3FLS Architecture
The proposed instruction format and the group control register contents are depicted at the top of Figure 1 . For each group m as defined by the programmer, the group control register contains the offset, the right-shift amount W − S m , and the minimum LNZ count min m . The exception is group 0, where the offset value and the minimum LNZ count are often zero, hence these fields are omitted from the register (the right-shift amount is still required, however). Standard control register access instructions can be used to set or obtain the contents of the group control register. The combinational logic in the remainder of Figure 1 includes a block to determine the LNZ count for the input codeword in register rsrc1, and then various blocks to use the LNZ count in order to select the group, the offset, and the shift amount. These selection blocks use multiplexers with inputs from the group control register.
The logic in Figure 1 uses the bit field from processor register rsrc1 as input to the LNZ count block and the left shifter. The most significant bit from the output of the left shifter is the '1' bit that follows the leading zeros, hence it is ignored. The next 12 mostsignificant bits from the left shifter are concatenated with the output of the 4-bit subtract unit that computes − min m . The combined 16 bits are used as the input to the 16-bit right shifter that produces the codeword offset within the group. This codeword offset is then added to the group offset in order to obtain the required index in the table. The architecture in Figure 1 assumes a maximum codeword length of 20 bits. Because of the actual maximum lengths of 17 bits for MPEG-2 and 13 bits for MPEG-4, there is additional capacity for any new codes in the future. Furthermore, the maximum suffix length, W , for MPEG-2 and MPEG-4 is 12 bits, and this property is also exploited in the architecture.
The size of the group control register in an MFLS implementation depends on the number of bits needed for the grouping information. A possible allocation is 12 offset bits, 4 right-shift bits, and 4 bits for the minimum LNZ value for each group. The index For generality, the proposed vldecode instruction generates only the table index in the destination register rdest. This index value must then be added to the base address of the appropriate table in memory in order to retrieve the decoded information. Automatically performing the final calculation using a base address register could affect the cycle time.
The method provided in Figure 1 can be extended for M = 4 or higher with ease. For most cases, the VLC table size is a nonincreasing function of M. Increasing M will, however, add to the hardware complexity of the selection blocks and it will also increase the size of the group control register. The memory-based architecture introduced in our previous work [1] can be thought of as an MFLS implementation with all possible leading number of zeros or M = 16. Similarly, the single fixed length suffix (SFLS) architecture can be thought of as an MFLS implementation with M = 1 and without any group control register.
FPGA AND ASIC SYNTHESIS RESULTS
The MFLS architecture with M = 3 was synthesized for FPGA and ASIC implementation, along with the MB and SFLS architectures from our earlier work [1] . Section 2.1 explained the differences between the three approaches, specifically how the MFLS architecture is a compromise between the previous MB and SFLS architectures. The designs were implemented in VHDL and used vendor-supplied shifter and adder components to maximize performance. The 16-bit leading-zero-count block is a hierarchical radix-4 implementation. A 44-bit group control register was used in the MFLS implementation; 4 bits were used for each LNZ and right-shift field, and 12 bits for each offset field. The target FPGA is an Altera Stratix EP1S40 chip with synthesis performed by the Altera Quartus II software. Table 2 summarizes the synthesis results for all three architectures. The MB results exclude the memory, which is implemented in a predefined on-chip RAM block and not in logic elements. The 3FLS architecture has a logic complexity that is between the other two architectures, and it has a frequency of operation that is the same as the MB architecture.
For ASIC synthesis, we used Synopsys tools with 0.18 µm TSMC generic libraries. The design was optimized by using the DesignWare library components for adders with fast carry look-ahead and for shifters. The operating voltage of the design is 1.6 volts. The results are summarized in Table 3 . The area and power results for the 3FLS architecture are better than the results for the MB architecture (the MB results include the area and power for a 320-bit latch-based asynchronous RAM). The operating speeds are comparable, however. The 3FLS architecture would therefore be suited for embedded systems with area/power constraints. For higher performance, even more specialized or advanced library cells could be employed for integration with processors that would be used for decoding applications. Further optimizations could include pipelining the combinational logic in Figure 1 if its propagation delay exceeds the critical path through the execute (EX) stage of the processor with which the logic is integrated.
PERFORMANCE RESULTS FROM SIMULATION
Performance results were obtained by simulating the execution of software for MPEG-4 variable-length decoding on a model of a general-purpose pipelined processor with and without the decoding support of the MFLS architecture. The simulations were performed with tools provided by Tensilica, Inc., for their configurable Xtensa processor core [7] . Using the Tensilica Instruction Extension (TIE) language, the proposed vldecode instruction from Section 2 was added to the instruction set, along with other instructions to set or obtain the contents of the group control register with M = 3. The simulations were initially configured for a processor with 32-kbyte, 4-way-associative data and instruction caches using 32-byte blocks. The processor pipeline and the memory hierarchy were modeled, with a data cache miss penalty of 14 cycles and an instruction cache miss penalty of 13 cycles.
The simulated MPEG-4 code is the MoMuSys MPEG-4 reference software in the C language [5] . The variable-length decoding portion of the reference software was modified to use our new instruction; the Tensilica tools allow the use of special C functions to represent the operation of a new instruction, from which the compiler generates executable machine code with the new instruction. The simulated performance results are reported only for computations re- Table 4 .
The results from decoding 100 frames (352×288, 30 fps) in three commonly-used MPEG-4 test video sequences are provided in Table 5 . Use of the MFLS architecture in the processor that executes the modified software led to a doubling of performance over the original code on the base processor. This result was achieved with just one group control register.
For further insight, additional simulations were conducted with different cache configurations. Table 6 summarizes results for the Coastguard video sequence. With a smaller cache size, the performance of the modified software using the MFLS architecture is less sensitive to the cache associativity than the original software on the base processor.
MEMORY EFFICIENCY ANALYSIS
We also analyzed the codeword table sizes for the reference MPEG-4 decoder software that was used in the simulation experiments in order to characterize the memory required for VLC tables using different methods. The results are provided in Table 7 . The first column lists the tables used in MPEG-4 video coding. Table B-12 is the  motion vector table. Tables B-16 and B-17 are the intra-and intertransform coefficient tables, respectively. These three tables are the most frequently-accessed tables in the decoder software. The second column lists the number of codewords in the original tables. The last three columns indicate the number of entries in the modified tables for the three VLD architectures being compared in this section. The total memory requirements for all tables under each method are also provided (assuming 4 bytes per entry). The 3FLS architecture requires less than half of the storage of the SFLS architecture.
CONCLUSION
The multiple fixed-length suffix (MFLS) architecture proposed in this paper provides a generic programmable solution for accelerating variable-length decoding in general-purpose processors for various multimedia formats including MPEG-1/2/4, H.261/3/4, and JPEG. The hardware extension for MFLS consists of a modest amount of combinational logic and a group control register. The parameter M for MFLS defines the maximum number of codeword groups that a hardware implementation will support; larger values of M imply a larger group control register and increased hardware complexity. Compared to our earlier architecture proposals, the MFLS architecture results in lower memory requirements than the single fixedlength suffix (SFLS) architecture and reduced hardware complexity with respect to the memory-based (MB) architecture, as reflected in synthesis results for FPGA and ASIC implementation. Simulation results using MPEG-4 reference software demonstrate that the MFLS architecture with M = 3 can double performance in the computations related to the intra-and inter-transform coefficient tables.
ACKNOWLEDGMENTS

