The H.264/AVC (Advanced Video Codec) is the latest standard for video coding. It assumes a scalar forward quantizer performed at the encoder which can be implemented directly in integer arithmetic. An efficient architecture for the computation of forward quantization of H.264/AVC is presented in this paper. It uses a modification of the quantization operation which reduces the arithmetic operations, and a truncated Booth multiplier based on adaptative statistical approach, which reduces the hardware. The JM reference software's C code has been re-written to analyze the effect of new algorithm and of truncated Booth multiplier. Simulations made up over popular test sequences used in video standardization show the validity of this approach. These results demonstrate that, at low QP, the PSNR is improved between a maximum of +0.81db and a minimum of 0.31db, with a slight increase in the Bit Rate being around 0.8%. Finally, a suitable architecture for VLSI implementation is presented, which reduces in a 26% the area, 32% the power and 21% the critical path delay in comparison with classical implementation. Moreover, it also reduces the area and increase the speed in comparison with architectures presented in references.
INTRODUCTION
where MF ij is the multiplication factor made up of 6x3 arrays of 14-bit positive integers, qbits=15+floor(QP/6), >> indicates a binary shift right and F is a positive number which can be expressed as F=f << qbits, f being typically in the range 0 to 0.5. Eq. (1) is similar to that used for encoding the 16x16 Intra prediction mode and 4x4 chroma components. In this case, MF ij is replaced by MF 0,0 , F by 2F and qbits by qbits+1. Eq. (1) has been implemented in JM reference software which is available on-line in
3
. In JM reference, f has assigned two values, 1/3 for Intra blocks and 1/6 for Inter blocks. The forward quantization is not specified in the standard H.264. This allows developers some flexibility in choosing a quantizer design 4 . However, some hardware implementations, as proposed in 5, 6 , apply directly the quantization expressions of Eq. (1) with any kind of optimization as shown in Fig 1. In this case, the ABS module implements the absolute value of W ij , the multipliy-add unit calculates the term ( )
and the final modules make the right-shift and assign to Z ij the same sign of W ij . This paper presents a more efficient quantizer architecture for the computation of forward quantization of H.264. In this architecture, the ABS and SIGN modules are not necessary and an adaptive truncated Booth multiplier is used to reduce hardware.
MODIFIED QUANTIZATION OPERATION
In Eq. (1), module ij W is necessary because the arithmetic operation ">> qbits" makes an integer division with truncation of the result toward zero which causes errors for W ij < 0. For example, the integer -3 in a 4-bit two'scomplement representation is 1101. The operation -3 >>2 should be 0, but 1101>>2 gives -1. To resolve this error, 1<<n must be added to the negative number, n being the number of right shifts. Thus, (1101+(1<<2)) >> 2 is 0. Note that this does not work properly when all the less significant n bits are zero, or in a similar way, when the number is negative and power of two. For example, if n=2 and the number is -4, then (1100 +(1<<2))>>2 is 0 and it should be -1 (1111).
This operation allows ij W to be eliminated from Eq. (1) assigning to F the same sign as W ij . To do this, a term 1<<qbits must be added to F when W ij <0, resulting in
Then, Eq (1) can be directly implemented as follows:
where Table I shows the values of f and the definition of F* for different options. F* can be readily generated from number 1/3 and shifted operations. It is noted that f is always positive to be 1/3 or 1/6 and no additions are necessary.
The operation of Eq. (3) provides erroneous results when all less significant qbits in Z ij are zero; it means that Z ij is negative and power of two. The probability of this event is 2 -qbit and, in the worst case (qbits=15), its value is 5 15 10 52
Simulations made with real sequences have proven that this error has an insignificant effect in quantization process and, therefore, the proposed method is valid.
Truncated Booth multiplier
In Eq. (3), the arithmetic operation W ij MF ij +F* can be implemented in a single multiplier and the shift operation ">> qbits" is a truncated operation equivalent to eliminating the less significant qbits of the multiplier. Both operations can be efficiently implemented in a truncated multiplier. We focus on the modified Booth's algorithm which is the most popular approach for implementing fast multipliers using parallel encoding. 2 shows a simple representation of an 8x8 truncated Booth radix-4 multiplier. Each dot is a placeholder for a single bit obtained by a partial product generation circuit and S is the sign conversion bit. All elements can be {0, 1} depending on the result of a partial product selector. In this case, the multiplier's output has been truncated in 7 bits. Thus, partial products are divided into a main product (MP) and a truncated product (TP). The contribution of the TP to the MP is made through the sum of all carry signals generated from the TP which are expressed as:
Sign
where K i is the sum of all column dots. This contribution is relatively low in comparison with the MP. Therefore, part of the TP circuitry can be eliminated in order to reduce area and increase the speed of the multiplier. However, an error would be introduced in the resulting product. To reduce this error, several refinements applied to a Booth multiplier have been proposed 7-10 . However, simulations made with these approaches have proven that the adaptive statistical analysis presented in 7 gives the best results. This approach allows the following approximation to be derived:
For example, Eq. (5) can be approximated for j=4 as:
In this case, the low-error Booth multiplier implementation should only require the columns K 1 , K 2 , K 3 , the dots of K 4 being added to K 3 .
Simulation results
The JM reference software's C code has been re-written to analyze the effect of Eqs. (3) and (4) and of the truncated Booth multiplier for different values of k. Table II shows the simulation results in terms of Peak Signal to Noise Ratio or PSNR (in dB) and bitrate or BR (in kbit/s) for different sequences. The majority of these sequences are popular test sequences used in video standardization. This analysis has been made considering QP=0 which corresponds to maximum bit-rate. Clearly the best PSNR results are obtained for k=5. In this case, the PSNR is improved by a maximum of +0.81dB for Highway sequences to a minimum of +0.31dB for Tempete sequences. No explication is found to justify this improvement in PSNR. The only justification for these results is related to the adaptive error-compensation method used in the multiplier which is based on the statistical approach of partial product bits of adjacent columns. However, this improvement in PSNR is relayed with a slight increase in the lower Bit Rate to 0.8%. The parameter qbits depends linearly on QP/6. For higher QP, the error introduced by the truncated Booth multiplier is drastically reduced as a consequence of the shifting operation in Eq. (1). Fig. 3 shows the rate-distortion curves for different sequences generated for j=5 and for j=15 (no truncated multiplication). Note that only a very slight difference is detected at low QP, the rest of the curve fitting perfectly. Fig. 5 shows an efficient quantizer architecture for H.264 based on a truncated 8-row Booth multiplier. The Booth algorithm is a common approach to the VLSI design of high speed multipliers because the number of additions in multiplication is halved. The modified Booth algorithm proposed by the MacSorley 16 is maybe the most used in hardware implementation because it is fast and requires less area, and their regular structure facilitates efficient implementation in VLSI. In Fig. 5 13106 ] which it can be described in 15-bit width. The output multiplexer-based shifter allows additional right shift from 0 to a maximum of 8, according to the value of QP/6 and the 16x16 luma/chroma mode. In this scheme, the input data W ij is of 16-bit width, MF ij is of 14-bit (MF ij >0) F* of 14-bit (F* > 0) and output data P ij is of 15-bit arranged for qbits=15. A multiplexer-based shifter is used to generate F* from the term 1/3<<qbits.
QUANTIFIER ARCHITECTURE
A detailed scheme of truncated Booth multiplier used in quantized is shown in Fig. 6 . It is composed by Booth encoders from groups the three bits, partial-product circuit (labeled as SEL), carry save structure based on full adders (FA) and half adders (HA), and a final adder. In Booth encoders, MF ij is partitioned into overlapping groups of three bits and each group is converted into a set of signed digits {±2, ±1, 0} specified by three signals {M, 2M, S} k . These signals select a single partial product D km from W k , (W k ={W ij } k in Fig. 6 ) as 
Sign extension
The addition of partial products in this Booth multiplier must be done with sign bit extension, because it is a signed multiplication. This sign extension is derived from expressions developed in 11,12 which leads to a reduction in area. A formulation which reduces the number of full adders involved in sign extension is presented. For the shake of clarity, Fig. 6 only depicts the extension of sign of each partial product for the particular scheme of multiplier of Fig. 4 . Here, S i,16 (i=0,1,2,..,6) represents the sign bits for each partial product and D i,j the data bit of partial product. The problem is shown graphically in Fig. 6 where sign must be extended over 7 rows in order to be propagated. The sign SIG of these rows can be written as the result of the following operation: (  S  1  (  SIG   28  30  16  ,  6   26  30  16  ,  5  24  30  16  ,  4  22  30  16  ,  3   20  30  16  ,  2  18  30  16  ,  1  16 24  22  20  18  16   28  16  ,  2  26  16  ,  5  24  16  ,  4  22  16  ,  3  20  16  ,  2  18  16  ,  1  16  16  ,  0   30  16  ,  6  16  ,  5  16  ,  4  16  ,  3  16  ,  2  16  ,  1  16  ,  0   2  2  2  2  2  2  2   2  S  2  S  2  S  2  S  2  S  2  S  2 
Then, SIG can therefore be written as: 
Implementation and Comparisons
The architecture presented in Fig. 5 has been described in VERILOG as being easily transferable to a range of silicon fabrication technologies. Moreover, it has been exhaustively verified by comparing the results with test patterns generated using C and MATLAB codes. For the purpose of this research, this architecture has been synthesized by the Synopsys Design Compiler with an AMS 0.35µm standard cell library (3.3 V). The implementation shown in Fig.1 has also been synthesized using the same technology. Layouts of both implementations are shown in Figures 9.a) and 9.b) . Synthesis results are shown in Table IV . Clearly, the proposed scheme of Fig. 5 eliminates the need for computation of ij W and a subsequent sign conversion, and the arithmetic operation is performed by a compact truncated Booth multiplier. As a result, it reduces area by 26%, power by 32% and critical path delay by 21%. For comparative analysis, a) b) Figure 9 . Layout of a) proposed quantizer (area≈420µm 2 *500µm 2 ) and b) from figure 1 (area=530µm 2 *530µm 2 ). ; only these reference describing quantifier implementations have been found. Here, two quantizer architectures are proposed and their tradeoffs analyzed: 1) optimized for area, which is based on a 4-stage pipelined architecture, and 2) optimized in speed, which is a purely combinational circuit. The latest architecture is conceived to compute for every cycle 16 input data in parallel. Our proposed architecture reduces by 19% the number of cells in comparison with the area optimized scheme and is slightly faster than the speed optimized scheme.
Techn. Pipeline Multipliers

CONCLUSION
A modification of the quantization process and the use of a truncated multiplier have been proposed to implement efficiently the forward quantization of H.264 suitable for VLSI implementation. The proposed architecture presents an important reduction in hardware and power, and an increase in speed, which are achieved by combining a new algorithm for computing Eq. (4) and a compact truncated Booth multiplier. Moreover, some hardware implementations for transform and quantization require several quantizers operating in parallel, 4 in 13 , 8 in 14 and 16 in 15 . In these schemes, efficient quantizer architectures are necessary and the proposed quantizer is highly suitable.
