Abstract. The paper presents a case study on augmenting a TriMedia/CPU64 processor with a Reconfigurable (FPGA-based) Functional Unit (RFU). We first propose an extension of the TriMedia/CPU64 architecture, which consists of a RFU and its associated instructions. Then, we address the computation of the ¢ IDCT on such extended TriMedia, and propose a scheme to implement an 8-point IDCT operation on the RFU. Further, we address the decoding of Variable Length Codes and describe the FPGA implementation of a Variable Length Decoder (VLD) computing facility. When mapped on an ACEX EP1K100 FPGA from Altera, our 8-point IDCT exhibits a latency of 16 and a recovery of 2 TriMedia cycles, and occupies 42% of the FPGA's logic array blocks. The proposed VLD exhibits a latency of 7 TriMedia cycles when mapped on the same FPGA, and utilizes 6 of its embedded array blocks. By using the 8-point IDCT computing facility, an ¢ IDCT including all overheads can be computed with the throughput of 1/32 IDCT/cycle. Also, with the proposed VLD computing facility, a single DCT coefficient can be decoded in 11 cycles including all overheads. Simulation results indicate that by configuring each of the 8-point IDCT and VLD computing facilities on a different FPGA context, and by activating the contexts as needed, the augmented TriMedia can perform MPEG macroblock parsing followed up by a pel reconstruction with an improvement of 20-25% over the standard TriMedia.
Introduction
A common issue addressed by computer architects is the range of performance improvements that may be achieved by augmenting a general purpose processor with a reconfigurable core. The basic idea of such approach is to exploit both the general purpose processor capability to achieve medium performance for a large class of applications, and FPGA flexibility to implement application-specific computations. Thus far FPGA-augmented processors have predominantly assumed a simple general purpose core [1] [2] [3] [4] . Considering the class of VLIW machines, two general research questions may be raised:
-What are the influences of reconfigurable arrays on the performance of commercially available VLIW processors? -What are the architectural changes needed for incorporating the reconfigurable array into the processor core?
In an attempt to answer to these questions, we will present a case study on augmenting a TriMedia/CPU64 processor with a Reconfigurable (FPGA-based) Functional Unit (RFU). With such RFU, the user is given the freedom to define and use any computing facility subject to the FPGA size and TriMedia/CPU64 organization. In order to evaluate the potential performance of the augmented TriMedia/CPU64, we chose a significant chunk of MPEG decoding as benchmark. In particular, since the video data accounts for more than 80% of the whole MPEG bit stream [5] , we considered the parsing of Variable-Length (VL) coded data at the macroblock layer followed by a pel reconstruction procedure as benchmark. That is, all the data elements corresponding to slice and higher layers are considered as being constants for our experiment.
We decided to provide hardware support for two functions of the selected benchmark: 8-point (1-D) Inverse Discrete Cosine Transform (IDCT) and Variable-Length Decoder (VLD). By developing VHDL code and mapping it with Altera tools, we evaluated the performance of these FPGA-based functions. Further, a program which is MPEG-compliant has been written in C, and then compiled, scheduled and finally simulated with TriMedia tool-chain. For a typical MPEG string with 10% intra-coded, 70% B-coded, and 20% P-coded macroblocks, we found that the augmented TriMedia/CPU64 can perform macroblock parsing followed up by a pel reconstruction with an improvement of 20-25 % over the standard TriMedia. Given the fact that TriMedia/CPU64 is a 5 issue-slot VLIW processor with 64-bit datapaths and a very rich multimedia instruction set, such an improvement within the target media processing domain indicates that the hybrid TriMedia/CPU64 + FPGA is a feasible approach.
The paper is organized as follows. For background purposes, we briefly present several issues concerning MPEG and the FPGA architecture in Section 2. Section 3 describes the architectural extension of TriMedia/CPU64. Implementation issues related to 1-D IDCT and VLD computing facilities and their corresponding instructions are discussed in Sections 4 and 5. The ¢ IDCT and entropy decoder implementations are then described in Sections 6 and 7. The execution scenario of the chosen benchmark on both standard and extended TriMedia, and experimental results are presented in Section 8. Section 9 completes the paper with some conclusions and closing remarks.
Background
Data compression is the reduction of redundancy in data representation, carried out for decreasing data storage requirements and data communication costs. A typical video codec system is presented in Figure 1 [6, 5] . The lossy source coder performs filtering, transformation (such as DCT, subband decomposition, or differential pulse-code modulation), quantization, etc. The output of the source coder still exhibits various kinds of statistical dependencies. The (lossless) entropy coder exploits the statistical properties of data and removes the remaining redundancy after the lossy coding. Fig. 1 . The block diagram of a generic video codec -adapted from [6, 5] .
In MPEG, the DCT-Quantization pair is used as a lossy coding technique. The DCT algorithm processes the video data in blocks of ¢ , decomposing each block into a weighted sum of 64 spatial frequencies. At the output of DCT, the data is also organized in ¢ blocks of coefficients, each coefficient representing the contribution of a spatial frequency for the video block being analyzed. Since the human eye cannot readily perceive high spatial frequency activity, a quantization step is carried out.
The goal is to force as many DCT coefficients as possible to zero, especially those corresponding to high spatial frequencies, within the boundaries of the prescribed video quality. Then, a zig-zag operation transforms the matrix into a vector in which the coefficients are ordered from the lowest frequencies (upper-left hand corner of the ¢ block) to the higher ones (lower-right hand corner of the matrix). Usually, this vector exhibits large numbers of consecutive zeros. The subsequent compression step is carried out by the entropy coder which consists of two major parts: Run-Length Coder (RLC) and Variable-Length Coder (VLC). The RLC represents consecutive zeros by their run lengths. Since not each and every zero is coded, the number of samples is reduced. The RLC output data are composite words, also referred to as source symbols, which describe pairs of zero-run lengths and quantized DCT coefficient values. When all the remaining coefficients in a vector are zero, they are all coded by the special symbol end-of-block. Variable length coding, also known as Huffman coding, is a mapping process between source symbols and variable length codewords. The variable length coder assigns shorter codewords to frequently occuring source symbols, and vice versa, so that the average bit rate is reduced. In order to achieve maximum compression, the coded data is sent through a continuous stream of bits with no specific guard bit assigned to separate between two consecutive symbols. As a result, decoding procedure must recognize the code length as well as the symbol itself in this case. Subsequently, we will focus on the MPEG decoding, i.e., on the inverse operation of MPEG coding. Further, we will briefly present the theoretical background of Inverse Discrete Cosine Transform (IDCT), entropy decoding, as well as some issues related to the MPEG standard.
Inverse Discrete Cosine Transform
The transformation for an N point 1-D IDCT is defined by [7] :
where Ù are the inputs, Ü are the outputs, and Ã Ù Ô ½ ¾ for Ù ¼, otherwise is ½. For MPEG, a 2-D IDCT processes an ¢ matrix [5] :
One strategy to compute the 2-D IDCT is the standard row-column separation. The 2-D transform is performed by applying the 1-D transform to each row (horizontal IDCTs) and subsequently to each column (vertical IDCTs) of the data matrix. This strategy can be combined with different 1-D IDCT algorithms to further reduce the computational complexity. One of the most efficient 1-D IDCT algorithms has been proposed by Loeffler [8] . A slightly different version of the Loeffler algorithm in which the Ô ¾ factors are moved around has been proposed by van Eijndhoven and Sijstermans [9] . In our experiment, we will use this modified algorithm (see Figure 2 ). A square block depicts a rotation which transforms a pair Á ¼ Á ½ into Ç ¼ Ç ½ . The symbol of a rotator and the associated equations are presented in Figure 4 . Although an implementation of such a rotator with three multiplications and three additions is possible [8] , we use the direct implementation of the rotator with four multiplications and two additions, since it shortens critical path and improves numerical accuracy. Therefore, multiplications by constants 
Entropy Decoder
In MPEG, the entropy decoder consists a Variable-Length Decoder (VLD) followed by a Run-Length Decoder (RLD). The input to the VLD is the incoming encoded bit stream, and the output is the decoded symbols. Since the code length of the symbol is variable, both the input and output bit rate of a VLD cannot be kept constant. Three different decoder types are possible [6] : constant input rate, constant output rate, and variable input-output rate.
The constant-input-rate VLD decodes a fixed number of bits and produces a variable number of symbols per unit time. An example of such decoder which decodes one bit per cycle is described in [11] . The decoder employs a binary tree search technique in which a token is propagated in a reverse Huffman tree constructed from the original codes. Although some improvements of the tree-based method make it possible to decode more than one bit per cycle [12] , the tree-based approaches are not suitable for high performance applications such as high-definition television, because high clock rate processing is needed.
A constant-output-rate VLD decodes one codeword (symbol) per cycle regardless of its length [13] . Generally speaking, a constant-output-rate VLD contains a look-up table which receives the variable-length code itself as the address. The decoded symbol (run-level pair or end-of-block) and the codeword length are generated in response to that address. Since the longest codeword excluding Escape has 17 bits, the LUT size could reach 131072 ( ¾ ½ ) words for a direct mapping of all possible codewords.
A variable-input-output-rate VLD is a mixture of the first two VLDs. It is implemented as a repeated table look-up, each step decoding a variable size chunk of bits. If a valid code was encountered, a run/level pair or an end-of-block is generated. If a miss is detected, a chunk size for the next look-up is generated. In this way, the short (most probable) are preferentially decoded. A variable-input-output-rate VLD exhibits an acceptable decoding throughput, while the size of the look-up table is resonable small.
The run-length decoder passes the VLC-decoded codewords through if they are not run-length codes, otherwise it outputs the specified number of zeros.
Macroblock parsing and pel reconstruction
The macroblock parsing process reads the VL coded data string from which all the headers corresponding to slice and higher layers have been removed, and outputs various symbols: decoding parameters at the macroblock layer (macroblock address increment, macroblock type, coded block pattern, and quantizer scale), motion values, and composite symbols (run/level pairs and end of block ). The decoding of the Variable-Length Codes (VLC) is performed according to a set of VLC tables defined by the MPEG standard. The motion values are used by a motion compensation process which is not considered here. However, since these values are decoded during the macroblock parsing, the overhead associated with the decoding of the motion values will be taken into consideration in the subsequent experiment.
Following the macroblock parsing, a pel reconstruction process recreates ¢ matrices of pels. The pel reconstruction module is depicted in Figure 5 . Its functionality is as follows. First, ¢ matrices of DCT quantized coefficients are recreated by a Matrix Reconstruction module. Second, an inverse quantization (InvQ) is performed. An ¢ quantization table, and a multiplicative quantization factor (quantizer scale) are used in the InvQ process. Third, a DC prediction unit reconstructs the DC coefficient in intra-coded macroblocks. Finally, an IDCT is performed. In connection with Figure 5 and the subsequent experiment, we would like to mention that the VLC decoder and IDCT will benefit from reconfigurable hardware support. We conclude this section with a review on the architecture of the FPGA we used as an experimental reconfigurable core.
The FPGA architecture.
Field-Programmable Gate Arrays (FPGA) [14] are devices which can be configured in the field by the end user. In a general view, an FPGA is composed of two constituents: Raw Hardware and Configuration Memory. The function performed by the raw hardware is defined by the information stored into the configuration memory. Generally speaking, a multiple-context FPGA [15] is an FPGA having the configuration memory replicated in order to contain several configurations for the raw hardware. That is, a multiple-context FPGA contains an on-chip cache of raw hardware configurations, which are referred to as contexts. Such a cache allows a context switch to occur on the order of nanoseconds [16] . However, loading a new configuration from off-chip is still limited by low off-chip bandwidth.
In the sequel, we will assume that the architecture of the raw hardware is identical with that of an ACEX 1K device from Altera [17] . Our choice could allow future singlechip integration, since both ACEX 1K FPGAs and TriMedia are manufactured in the same TSMC technological process. Briefly, an ACEX 1K device contains an array of Logic Cells, each including a 4-input Look-Up Table ( LUT), a relative small number of Embedded Array Blocks, each EAB being actually a RAM block with 8 inputs and 16 outputs, and an interconnection network. In order to have a general view, we mention that the logic capacity of the ACEX 1K family ranges from 576 logic cells and 3 EABs for EP1K10 device to 4992 logic cells and 12 EABs for EP1K100 device. The maximum operating frequency for synchronous designs mapped on an ACEX 1K FPGA is 180 MHz. More details regarding the architecture and operating modes of ACEX 1K devices, as well as data sheet parameters can be found in [17] .
An architectural extension for TriMedia/CPU64
TriMedia/CPU64 is a 64-bit 5 issue-slot VLIW core [18] , launching a long instruction every clock cycle. It has a uniform 64-bit wordsize through all functional units, the register file, load/store units, on-chip highway and external memory. Each of the five operations in a single instruction can (in principle) read two register arguments and write one register result. The architecture supports subword parallelism and is optimized with respect to media processing. With the exception of floating point divide and square root, all functional units have a recovery 1 of 1, while their latency 2 ranges from 1 to 4. The TriMedia/CPU64 VLIW core also supports multi-slot operations, or super-operations. Such a super-operation occupies two neighboring slots in the VLIW instruction, and maps to a double-width functional unit. This way, operations with more than 2 arguments and/or more than one result are possible.
First we propose that the TriMedia/CPU64 processor is augmented with a Reconfigurable Functional Unit (RFU) which consists mainly of a multiple-context FPGA core. A hardwired Configuration Unit which manages the reconfiguration of the raw hardware is associated to the reconfigurable functional unit, as it is depicted in Figure  6 . The reconfigurable functional unit is embedded into TriMedia as any other hardwired functional unit is, i.e., it receives instructions from the instruction decoder, reads its input arguments from and writes the computed values back to the register file. In this way, only minimal modifications of the basic architecture are required.
In order to use the RFU, a kernel of new instructions is needed. This kernel constitutes the extension of the TriMedia/CPU64 instruction set architecture we propose. It includes the following instructions: Ë Ì ÇAEÌ Ì, ÌÁÎ Ì ÇAEÌ Ì, and ÍÌ .
Loading a context information into the RFU configuration memory is performed under the command of a Ë Ì ÇAEÌ Ì instruction. The ÌÁÎ Ì ÇAEÌ Ì instruction controls the swaping of the active configuration with one of the idle on-chip configuration. The operations performed by the computing resources configured on the raw hardware are launched by ÍÌ instructions. In this way, the execution of an RFU-mapped operation requires three basic stages: set, activate, and execute [19] .
The user is given a number of gument, etc. It is the responsibility of the user to choose the appropriate ÍÌ instruction corresponding to the pattern of the operation to be executed. At the source code level, this may be done setting up an alias, as it is described subsequently. Since the ÍÌ instructions are executed on the RFU without checking of the active configuration, it is still the responsibility of the user to perform the management of the active and idle configurations. For the semantics of an operation performed by a computing facility, its latency, recovery, and slot assignment are all user definable, the source code of the application should contain information to augment the Machine Description File [20] . Assuming for example a user-defined ÎÄ instruction, a way to specify such information is to annotate the source code as follows:
.alias VLD EXEC 3 ; specifies the alias ÍÌ ¿ ; (super-op with two inputs and outputs) .latency VLD 7 ; specifies the VLD latency .recovery VLD 7 ; specifies the VLD recovery .slot VLD 1+2 ; specifies the slot assignment ; of the VLD instruction
In a similar way, the user can define as many RFU-related instructions as he/she wants. The next section will present the sintax and semantics of the 1-D IDCT and VLD instructions, as well as implementation issues of the corresponding computing facilities.
1-D IDCT instruction and computing facility
Since the standard TriMedia provides a good support for transposition and matrix storage, we expect to get little benefit if we configure the entire 2-D IDCT into FPGA. Our goal is to balance the cost of storing the intermediate 2-D IDCT results into an FPGA-resident transpose matrix memory against obtaining free slots into TriMedia. Consequently, only a super-operation computing the 1-D IDCT of eight 16-bit values packed in two 64-bit registers is considered. The sintax of such operation is:
1-D IDCT Rx, Ry Rz, Rw
where the registers Rx and Ry specify the inputs, and Rz and Rw, the outputs. All registers Rx, Ry, Rz, and Rw encompass the common format presented in Table 1 . All the operations required to compute 1-D IDCT are implemented using 16-bit fixed-point arithmetic. Since an implementation of the rotator with four multiplications is preferred [10] , the computation of 1-D IDCT requires ½ multiplications. As all the multiplications are to be performed in parallel, an efficient implementation of each multiplication is of crucial importance. For all multiplications, the multiplicand is a 16-bit signed integer represented in 2's complement notation, while the multiplier is a positive integer constant of 15 bits or less. As claimed in [21] , these word lengths in connection with fixed-point arithmetic are sufficient to fulfill the IEEE numerical accuracy for IDCT in MPEG applications [22] .
A general multiplication scheme for which both multiplicand and multiplier operands are unknown at the implementation time exhibits the largest flexibility at the expenses of higher latency and larger area. If one of the operands is known at the implementation time, the flexibility of the general scheme becomes useless, and a customized implementation of the scheme will lead to improved latency and area. A scheme which is optimized against one of the operands is referred to as multiplication-by-constant. Since such a scheme is more appropriate for our application, we will use it subsequently.
To implement the multiplication-by-constant scheme, we built a partial product matrix, where only the rows corresponding to a '½' in the multiplier operand are filled in.
Then, reduction schemes which fit into a pipeline stage running at ½¼¼ MHz are sought.
It should be emphasized that a reduction algorithm which is optimum on a certain FPGA family may not be optimum for a different family.
In connection with the partial product matrix, reduction modules which can run at ½¼¼ MHz when mapped on an ACEX 1K are presented in Figure 7 . All the designs are synchronous, i.e., both inputs and outputs are registered. The estimations have been obtained by compiling VHDL source codes with Leonardo Spectrum TM from Exemplar, followed by a place and route procedure performed by MAX+PLUS II TM from Altera.
The ½¼¼ MHz reduction modules are summarized below:
-Horizontal reductions of three, or four 16-bit lines to one line ( Fig. 7 -a) .
-Horizontal reduction of only two 30-bit lines to one line (Fig. 7 -b ).
-Vertical reductions of three or four 7-bit columns to one line (Fig. 7 -c ).
-Vertical reductions of six 5-or 6-bit columns to one line ( Fig. 7 -d ). We do not go into details about the implementations of the multipliers and we refer the reader to [10] . We still mention the latency of each multiplier:
The sketch of the 1-D IDCT pipeline is depicted in Figure 8 
VLD instruction and computing facility
As mentioned in Section 3, computing resources which can perform rather complex operations are worth to be implemented on the RFU. Also, as with all hardwired computing resources, the latency of an RFU-configured computing resource should be known at compile time. Therefore, we will subsequently consider a VLD instruction which returns a DCT symbol (run/level pair or end-of-block) per execution. That is, a constantoutput-rate VLD is to be employed. With such decoder, no benefits from preferentially decoding the short (most probable) codewords can be achieved. A super-operation pattern with two input (Rx, Ry) and two output (Rz, Rw) registers is assigned to the variable-length decoder: VLD Rx, Ry Rz, Rw 
The Rx register specifies the decoding parameters which identify the type of the symbol to be decoded: AC/DC, luminance/chrominance, intra/non-intra, as well as whether the string is an MPEG-1 or MPEG-2 one, or whether the decoding table is B14 or B15 [5] . The second register, Ry, contains 64 bits of the VL compressed data. The decoded symbol and its code length will be stored into registers Rz and Rw, respectively. Since 
the VLD does not know the start of the next variable-length codeword until the current codeword is decoded, a new VLD operation can be launched only after the previous one has completed. Consequently, a recovery lower than the latency gives no advantages, and such implementation should not be sought. The formats of the registers Rx, Ry, Rz, Rw are shown in Tables 2, 3 , 4, and 5. Generally speaking, a constant-output-rate VLD computes the codeword length by looking-up the 17 leading bits of the incoming bit stream into a look-up table. The decoder then sends the code length and the leading bits to other feed-forward circuitry for further decoding and immediately shifts the input by a number of bits equal with code length, to prepare the next decoding cycle. In cases where the number of codewords is large, there are some bits that are common to the long VLC's, called prefix. By exploiting these common prefixes, the size of the LUT can be reduced because the prefixes are no longer redundant in the LUT [23, 24] . The basic idea of prefix precoding is to group the VLC's by their common prefixes, and to provide for LUTs, one for each group, which can decode codewords only in the corresponding group.
Since a single EAB of an ACEX 1K device can implement a lookup table of 8 inputs, we partitioned the VLC table according to this FPGA architectural characteristic, as presented in Table 6 . In order to reduce the latency, the implementation of the VLD makes use of advanced computation. The run and level for each and every group were decoded in parallel, as the valid symbol would belong to that group. In parallel, the code length of the symbol along with some selection signals are determined. Then, the selection of the proper run and level pair is carried out. The implementation is presented in Figure 9 .
Regarding the groups 1, 2, and 3, one, six, and nine leading bits are shifted out from the original VLC string, respectively. The three resulted strings are each sent to a different EAB, and three run/level pairs are generated as if the shifted leading bits would have been those mentioned in the column Bypassed header. By means of combinatorial circuits, the same procedure is carried out for groups 0, end-of-block, and escape.
Each of the leading bit-sequence which define the VLC class is decoded by a multiple-input gate. Once the class is detected, a multiplexer will select the proper output from the outputs of EABs, EOB detector, Escape detector, and Group 0 decoding. The code length of the decoded symbol is generated according to the detected class.
By simulation, we found that the FPGA-based VLD operation exhibits a latency of 7 TriMedia cycles. 6 EABs of an ACEX EP1K100 device are used. 
¢ IDCT
The functionality of the ¢ IDCT can be implemented in both software and reconfigurable hardware. We will evaluate their performance subsequently.
¢ IDCT implementation on standard TriMedia
In the current implementation of the 2-D IDCT on the standard TriMedia/CPU64 architecture, all computations are done with 16-bit values, and make intense use of SIMDstyle operations. The ¢ matrix is stored in sixteen 64-bit words, each containing a half row of four 16-bit elements. Therefore, four ½ -bit elements can be processed in parallel by a single word-wide operation. Next to that, being a 5-issue slot VLIW processor, TriMedia/CPU64 can execute 5 such operations per clock cycle.
This strategy is used for both the horizontal and vertical IDCTs. First, eight 1-D IDCTs (two SIMD 1-D IDCTs) are computed using the modified 'Loeffler' algorithm [9] .
Then, the transpose of the ¢ matrix is performed by ÌÊ AEËÈÇË double-slot operations. Such a unit can generate the upper respectively lower two words of a transposed ¢ matrix in one cycle. Therefore, the ¢ matrix transpose is computed in eight basic operations. Finally, eight 1-D IDCTs (two SIMD 1-D IDCTs) are computed having the results generated by the transposition as inputs. Following the described procedure, a complete 2-D IDCT including all overheads (mostly composed of load and store operations) can be performed in cycles [18] .
6.2
¢ IDCT implementation on extended TriMedia A number of ¾ ¢ ½ ¿¾ registers are needed for this interleaved processing pattern.
The code was manually scheduled. We found that the computational performance of 2-D IDCT exhibited a throughput of ½ ¿¾ IDCT/cycle and a latency of ¾ cycles [10] .
Entropy decoder
The functionality of the entropy decoder can be implemented in both software and reconfigurable hardware. We will evaluate their performance subsequently. 
Entropy decoder implementation on standard TriMedia
The implementation of the entropy decoder in the standard TriMedia is a modified version of that proposed in [25] . The VLD has variable input-output rate, being implemented as a repeated table-lookup. Each lookup decodes a chunk of bits (8 bits at the first level lookup), and determines if a valid code was encountered. In case of a valid decode, a run-level pair is generated, or an escape or end-of-block flag is set. If a miss is detected, an offset into the VLC table and a chunk-size for a second-level lookup is generated. This process of signaling an incomplete decode and generating a new offset may be repeated three times. When a valid symbol has been encountered, it is stored into the ¢ matrix at the location defined by the run value. After compiling the C code and scheduling procedure, we evaluated that a table lookup takes 21 cycles. Consequently, the entropy decoding of a single DCT coefficient can take between 21 and 63 cycles. The size of all lookup tables is 10 KB.
Entropy decoder implementation on extended TriMedia
The entropy decoder in the extended TriMedia benefits of reconfigurable hardware support. By employing software pipelining techniques, useful computations related to runlength decoding may be performed in the delay slots of the VLD operation. That is, the ¢ empty matrix is succesively filled in with level values at the positions specified by run values. In this way, a symbol is processed completely in one (fixed latency) iteration. By simulation, we evaluated that a single DCT coefficient can be decoded in 11 cycles including all overheads.
Experimental results
In order to determine the potential impact on performance provided by the multiplecontext reconfigurable core, we will consider a benchmark which consists of a macroblock parsing followed by pel reconstruction procedures. Therefore, we operate at MPEG slice level, i.e., the data elements on slice and above layers are assumed to be constant. The computing scenario is presented in Figure 10 . First, a variable-length decoding of a macroblock (header and DCT coefficients extraction) is performed. Then, the ¢ matrices are recreated, and inverse quantization, followed by DC coefficient prediction for intra-coded macroblocks are carried out. After all macroblocks in a slice have been decoded, a burst of 2-D IDCTs is launched in order to reconstruct the initial pels. During computation, the 1-D IDCT and VLD computing resources are activated by an ÌÁÎ Ì ÇAEÌ Ì, as needed.
All the contexts of the RFU are to be configured at application load time, i.e., a number of Ë Ì ÇAEÌ Ì instructions are scheduled on the top of the program code. A sample of the code using the instructions of the architectural extension is presented subsequently. As it can be observed, the ÎÄ and Á Ì exhibit the same execution pattern: two inputs and two outputs.
. Therefore, our experiment includes two approaches: pure software and FPGA-based. As mentioned, a DCT coefficient is decoded in 21-63 cycles, and a 2-D IDCT can be computed in 56 cycles in the pure software approach. In the FPGA-based approach, a DCT coefficient is decoded in 11 cycles, and the 2-D IDCT is carried out with the throughput of 1/32 IDCT/cycle. Based on the published work in the field of multiplecontext FPGAs [16] , we make a conservative assumption and consider that the context switching penality is 10 cycles.
Pel reconstruction performance evaluation
A program which is MPEG-compliant has been written in C, compiled and scheduled with TriMedia development tools. The performance evaluation has been done assuming that, despite of the large lookup tables which are stored into memory, the standard TriMedia/CPU64 will never cope with a cache miss. In other words, we compare an 'ideal-cache" standard TriMedia with a multiple-context FPGA-augmented TriMedia.
Subsequently, we present the results according to two scenarios: worst-case 3 and average-case. In both cases we assumed that an average of 5 coefficients per block are decoded. In the worst-case scenario, we assumed that all DCT coefficients produce a hit on the first level lookup when the pure software implementation is used. In the same worst-case scenario, we also assumed that the overhead introduced by parsing the macroblock headers has the largest value (for example, the quantization value is assumed to be updated every macroblock). Since the worst-case scenario coresponds to long variable-length codes, it is statistically not relevant. Therefore, we evaluated the performances in a average-case scenario. In such scenario, we assumed that two of five DCT coefficients produce a miss at the first lookup. Also, we weighted the overhead introduced by parsing the macroblock header with the transmiting probability of different decoding parameters of the macroblock layer. The results are presented in Table 7 . The numbers indicate the improvements we get for the number of cycles. Table 7 . Performance improvement of multiple-context FPGA-augmented TriMedia/CPU64 over 'ideal-cache" (standard) TriMedia/CPU64 for a macroblock parsing followed by pel reconstruction application. 
Worst-case scenario Average-case scenario

½ ± ¾ ±
Finally, we proceeded to a global evaluation of the performance improvement. For an MPEG string with ½¼± intra-coded, ¼± B-coded, and ¾¼± P-coded macroblocks, the improvement for augmented TriMedia is ¾¼ ¾ ± in the average-case scenario.
Conclusions
We have proposed an architectural extension for TriMedia/CPU64 which encompasses a multiple-context FPGA-based reconfigurable functional unit and the associated instructions. On the augmented TriMedia/CPU64, we estimated a performance improvement of ¾¼ ¾ ± over a standard TriMedia/CPU64 for a macroblock parsing followed by a pel reconstruction application, at the expenses of three new instructions:
Ë Ì ÇAEÌ Ì, ÌÁÎ Ì ÇAEÌ Ì, ÍÌ . As future work, we intend to consider the motion compensation and to evaluate the performance improvement for a complete MPEG decoder.
Several considerations about the latency of an RFU-configured computing resource are worth to be provided. Due to realization constraints, the RFU is likely to be located far away from the Register File (RF) in the floorplan of the TriMedia/CPU64. The immediate effect is that there will be large delays in transferring data between the RFU and RF, and the RFU will not benefit from bypassing capabilities of the RF [18] . Consequently, read and write back cycles have explicitely to be provided. In such circumstances, the minimum latency of an RFU-based computing resource includes at least 1 cycle for reading the input arguments from register file, the absolute minimum combinatorial delay ¡ FPGA on FPGA, and 1 cycle for writing back the results to the register file. Assuming that the FPGA clock frequency is equal with half of TriMedia clock frequency [10] , the absolute minimum RFU latency is 4 TriMedia cycles. Since a call of an RFU is quite expensive, it would be a good idea to minimize the number of RFU calls, i.e., computing resources which can perform complex operations have to be configured on the RFU.
Constraints and freedoms in configuring a VLD computing resource (on FPGA ??):
-The latency of such computing resource should be known at compiling time. Therefore, no benefits from decoding preferentially the short (high probable) codewords can be achieved. -The latency of such computing resource should be as small as possible, as the only way to speed up the decoding process. Pipelining is of no use here vezi articolele cu sistemele cu reacţie care nu se preteazȃ la pipelining.
-There are 12 EABs (¾ ¢ ½ words) on an EP1K100 (???). Therefore, the prefix methodology and, consequently, partitioning the VLC tables should be performed according to this FPGA architectural caracteristic.
Generally speaking, a constant-output-rate VLD computes the codeword length by comparing the leading bits of the incoming bit stream against a small table. The decoder then sends the code length and the leading bits to other feed-forward circuitry for further decoding and immediately shifts the input by a number of bits equal with code length, to move to the new leading bits of the input bit stream for decoding the next codeword.
The critical path within the system is always the feedback path because other feedforward paths can be pipelined. That is, the processing speed is limited by the feedback computation time: the time for comparing and selecting the codeword length plus the time for shifting the input [pag. 198, Lin & Messerschmitt, part. II]. The latency of computing the feedback value sets the decoding cycle time, and is thus inversely proportional to the decoding rate.
The performance metric is throughput, i.e., the net decoder information rate. This rate equals the number of bits or codewords decoded per cycle multiplied by the clock rate. There is a trade-off between these two terms; the more bits or codewords we try to decode in one cycle, the more complicated the PLA (look-up table !) will become and the slower the clock rate is. Pay attention! TriMedia has a fixed clock rate, the clock frequency is constraint, it is an input datum to the design of an MPEG decoder.
In cases where the number of codewords in the table is large, there are some bits that are common to the long VLC's, which we call prefix. By exploiting these common prefixes, the size of the LUT can be reduced. A number of schemes such as prefix precoding [Choi Lee For FPGA-2002: advanced computation of the next code length. The selection of the proper result is performed simultaneously with the selection of the proper run and length of the current word. Also, in parallel, computing the run and length of the previous codeword is carried out.
