Introduction
The H.264/AVC (Advanced Video Codec) is the latest standard for video coding established by the Joint Video Team ITU-T VCEG and ISO/IEC MPEG (Wiegand et al., 2003 ) (Sühring, 2010) (Links, 2010) . This standard has many innovations, such as hybrid prediction/transform coding of intra frames and integer transforms (Richardson, 2004) . Fig.  1 presents a simplified block diagram of the H.264/AVC encoder with the following main blocks: motion estimation (ME), motion compensation (MC), intra prediction, forward transform (FT), forward quantization (FQ), inverse quantization or re-scaling (IQ), inverse transform (IT), entropy coding and de-blocking filter, among others. Initially, most of the work done on H.264 was oriented toward its software implementation. However, in recent years the contributions to the hardware implementation of H.264 have increased greatly, enabling the implementation of fast architectures for real-time video applications (Lin et al., 2008) (Finchelstein et al., 2009) ). The initial version of H.264/AVC used a transform hierarchy based on three transforms that are computed in integer arithmetic, two of size 4×4 and one of 2x2. In July 2004, the first amendment to the H.264 standard was presented, named Fidelity Range Extensions (FRExt) (JVT, 2004) , in which a new set of tools was specified to increase the high-fidelity video encoding efficiency, focusing on professional applications and high-definition videos. One 311 and/or floating arithmetic based on post and pre-scaling matrices. Section 4 describes the proposed architecture for implementing the configurable process of transform and quantization for an 8×8 luma block capable of operating with different bit-depths (8 bits up to 14 bits). This section includes a description of the main modules: 1D configurable forward and inverse transform, 8×8 transpose register and the optimized arithmetic circuit needed to perform the computation of bit-depth-dependent quantization and rescaling in a unified structure. A review of the state-of-the-art of the previous implementations and references is also included. However, most hardware implementations only operate in 8 bits and further bit-depths have not been taken into account. Section 5 shows the characteristics and the performance of the proposed processor as well as comparisons with other published and related implementations. These comparisons are made in terms of area, speed and power.
8×8 Transform in the H.264/AVC
The FRExt amendment to H.264 proposes a scheme based on an 8×8 integer approximation of DCT transform to be added to the existing 4×4 transform in order to improve highdefinition video compression (Gordon & Wiegand, 2004) . This transform provides excellent compression performance in high-resolution video streams with a level of complexity only slightly higher than the 4×4 transform even though the coefficients are not powers of 2 in all the cases. However, it's implemented using additions and shifts and no multiplications are necessary. Moreover it uses integer arithmetic which eliminates the mismatch issues between the encoder and the decoder. The forward 8×8 integer transform is applied to each block in the residual luminance component (x) of the input video stream as follows
where T is a matrix of dimension 8×8 which represents the transform kernel defined as 88888888 12 10 6 3 -3 -6 -10 -12 8 4 -4 -8 -8 -4 4 8 10 -3 -12 -6 6 12 3 -10 1 8-8-8 8 8-8-8 8 8
6 -12 3 10 -10 -3 12 -6 4 -8 8 -4 -4 8 -8 4 3 -6 10 -12 12 -10 6 -3
In the JM reference software (Sühring, 2010) , the property of separability of this 8×8 transform is used to implement equation (1) in a separable way as a 1D horizontal (Eq. (3)) transform followed by a 1D vertical (Eq. (4)) transform according to the following equations
Equations (3) and (4) are obtained from the decomposition of T as a sparse matrix product of matrices T 1 , T 2 and T 3 defined as 1 10000001 01000010 00100100 00011000 = 1000000-1 010000 -10 00100-100 0001 -1000
2 10010000 01100000 100-10000 01 -100000 = 0000 3 / 2110 000010 -3 / 2-1 00000 -1 / 410 00 1 / 2-10000 0000 1 / 400-1
Table 1, which it is directly extracted from the JM reference software, shows the expressions used to compute the 1D transforms involved in equations (3) and (4). In this (3) or each column of X in (4)), and a and b are internal variables. In a 3-stage butterfly, stage 1 implements the operations involved in T 1 , stage 2 implements T 2 and stage 3 implements T 3 . The multiplications by the coefficients 1/2, 1/4 and 3/2=1+1/2 are implemented by means of shift-right (>>) operations which cause truncation errors which are propagated through the datapath. To avoid mismatch between the encoder and decoder, the implementation of 1D transform must fulfill the operations specified in the standard. As a result, any implementation of this transform must be in compliance with the arithmetic described in Table 1 The inverse 8×8 integer transform of a block of coefficients of size 8×8 (Z) is defined through the equation
Likewise to the forward transform, the 8×8 inverse transform can be computed as the concatenation of a 1D horizontal inverse transform (Eq. (9)) and a 1D vertical inverse transform (Eq. (10)) through the decomposition of T as a sparse matrix product of matrices G 1 , G 2 and G 3 giving
The G 1 , G 2 and G 3 matrices are defined as 
2 10000010 0100000 -1 / 4 0010 -1000 00010 1 / 400 = 00101000 000 1 / 40 -100 100000-10 0 1 / 4000001
www.intechopen.com Coding   314   3   10000001 0001 -1000 01000010 00100 -100 = 00100100 010000 -10 00011000 1000000 -1 Table 2 shows the expressions for computing these 1D transforms used in the JM reference software. In a similar way to the forward 1D transform, a 3-stage butterfly structure is used where stage 1 implements the operations specified in G 1 , stage 2 in G 2 and stage 3 in G 3 .
Recent Advances on Video
Here, II denotes the vector of input values (II represents either each file of Z in equation (9) or each column of q in (10)), OI denotes the transformed output vector (OI represents either each file of q in equation (9) or each column z in (10)), and ia and ib are internal variables. 
Quantization and rescaling in the H.264/AVC
The forward quantization process in H.264/AVC FRExt is performed for the transformed coefficients (X) computed in equations (3) and (4) according to the following equations
where sc qbits=QP /6+16
In this equation, QP sc is the scaled quantization parameter defined as
QP takes an integer value (from 0 to 51) and determines the level of coarseness of the quantization process enabling the encoder to control the trade-off between bit rate and quality. The parameter bd represents the bit-depth video content, 8 ≤ bd ≤ 14. There are lots of professional applications which require higher bit depth support such as studio application and HD application. In H.264/AVC, 7 of 11 profiles support more than 8-bit bit depth starting from High10 which supports 10-bit bit depth. High 444 Predictive and some related profiles support up to 14 bits. As can be seen in equation (16), QP sc depends on the quantization parameter QP as well as bd; note QP sc =QP for bd=8 bits. This means that QP sc can have a value from 0 to 51 when bd=8 and from 36 to 87 for bd=14. The approximation factor, lev_off, used in equation (14) 
The inverse quantization or rescaling "re-scales" the quantized transform coefficients (Y) coefficients computed in (14). The rescaling process, which is different to that used in the 4×4 transform (Malvar et al., 2006) , is defined by the following equation directly extracted from the JM reference software as ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki = ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki ki
whose elements are obtained by evaluating the expression
Here, MI is the rescaling factor matrix specified as 
Variable bit-depth processor for the 8×8 transform and quantization
Fig . 2 shows the block diagram of the proposed variable bit-depth processor for real-time implementation of the complete process for the 8×8 transform and quantization coding in the H.264/AVC. This processor includes the following main modules: configurable forward and inverse 1D integer transform, bit-depth dependent quantization and rescaling module, and transpose register memory. This architecture, which fulfils the requirements of H.264/AVC FRExt, has been conceived to operate with different bit-depth (bd) -8 bits up to 14 bits with the aim of achieving a high performance with a reduced hardware complexity implementation. In order to provide an efficient processor, hardware solutions have been developed for the different circuit modules. The 8×8 forward and inverse transforms are calculated using the separability property simplifying its architecture to a single configurable 1D forward (FT)/inverse (IT) transform processor and a transpose register array. Forward quantization (FQ) and rescaling (IQ) operations are computed in the same circuit for the different bit-depth requirements. Here, new expressions are proposed allowing efficient hardware implementation by avoiding the sign conversion and minimizing the arithmetic operations involved. Furthermore, an exhaustive analysis in the dynamic range of the datapath was performed to fix the optimum bus widths with the aim of reducing the size of the circuit while avoiding overflow. Finally, the critical paths of the various computing units have been carefully analyzed and balanced using a pipeline scheme in order to maximize the operation frequency without introducing an excessive latency. This circuit processes 8 input data in parallel, starting by reading the residual luminance component (x) row by row until the entire 8×8 input block is read. The forward 1D transform module generates the intermediate coefficients p to be stored in the transpose register row-wise. After 8 clock cycles, these coefficients are read column-wise and processed again in the 1D transform module. Then, the resulting X coefficients are quantized column by column in parallel in the quantization and rescaling module and stored in the transpose register column-wise. On finishing this operation, the quantized coefficients (Y) are rescaled row by row and the results (Z) are sent to inverse 1D transform whose output data (q) are stored in the transpose register row-wise. Finally, the coefficients q are fetched to the transpose register column-wise to be processed in the inverse 1D transform to obtain the recovered residual luminance (z).
Forward and Inverse 8×8 transform
The 8×8 transform proposed in FRExt for addition to the JVT specification in the H.264/AVC is based on the fact that at SD resolutions and above, the use of block sizes smaller than 8×8 is limited. One of the first papers (Amer et al., 2005) related to this matter was the FPGA pipelined implementation of a simplified 8×8 transform and quantization. Another FPGA implementation of an algebraic integer quantization approach to computing the 8×8 TRANSFROM was presented in (Wahid et al., 2006) . (Silva et al., 2007) proposed high-throughput architecture of the forward 8×8 transform to encode high-definition videos in real time with a latency of 5 clock cycles to process 1D transform. This architecture was synthesized in FPGA with a minimum period of 8.13ns and in a TSMC 0.35µm CMOS standard cell technology leading to a period of 8.05ns. Recently, (Park & Ogunfunmi, 2009 ) presented a reduced and parallel FPGA implementation of an 8×8 integer transform, quantization and scaling for H.264. Here, each pixel is processed one by one on a simplified pipelined architecture without multiplication.
In the adaptive block-size transform of the FRExt, different kinds of transforms are required: 8×8 forward/inverse transform, 4×4 forward/inverse transform, 4×4 forward/inverse Hadamard transform and 2×2 forward/inverse Hadamard transform. In order to reduce hardware, diverse configurable data-path architectures to support all of these transforms in a unified scheme have been proposed. Other examples of this kind of architectures include; the multi-transform processor where the quantization is performed at the pace demanded by the entropy coder in (Bruguera & Osorio, 2006) , the low hardware cost suitable for VLSI implementations in (Fan, 2006) , the reduced hardware and high latency in (Chao et al., 2007) , the high-performance architecture for high-definition applications in ( Ma & et. al, 2007) , the IP design to be implemented on an ASIP-controlled SoC platform in (Ngo et al., 2008) , the high-performance, low-power unified transform architecture in (Choi et al., 2008) , the highly parallel joint circuit architecture in , and the fast, high-throughput and cost-effective implementation in (Hwangbo & Kyung, 2010 OF1  OF3  OF5  OF7   II0  II2  II6  II4   II1   II3  II5  II7   ib0  ib2  ib6  ib4   ib3  ib1  ib7  ib5   IO0  IO1  IO2  IO3  IO4  IO5  IO6 Initially, the specifications of H.264 adopted an integer approximation of 4×4, but when transforms are larger, significant compression performance gains have been reported for High-Definition (HD) resolutions. Thus, a new integer transform of 8×8 was proposed in the Fidelity Range Extensions (FRExt) to be added to the previously existing specifications, which were verified in SD resolutions. In fact, the use of block sizes 8x8 and bigger is dominant. Following this assumption, we proposed architecture for computing the 8×8 forward/inverse transform based on a configurable high-throughput 1D processor which has been conceived to implement the arithmetic operations described in Table 1 and Table 2 aiming to fulfill two objectives. First, to avoid mismatches between the encoder and decoder there is no possible alternative in the implementation of the operations other than those specified in these tables, which are directly extracted from the JM reference software. Second, these equations share compatible arithmetic which leads to hardware reduction if a configurable data-path is used. To comply with these prerequisites, arithmetic operations presented in Tables I and II can be implemented in terms of a three-processor architecture that fulfils the requirements of H.264. These processors, as is shown in Fig. 3 , are named I/O, even and odd. The operation mode, forward (FT) and inverse (IT), is arranged by multiplexers which select the inputs and modify the inner arithmetic operations of each processor. The schematic at the bottom left in Fig. 3 represents the equivalent scheme for computing the forward 1D transform. In this configuration, the eight elements of IF are input to the I/O processor and their outputs run in parallel into the even and odd processors to generate the output OF. In the first 1D transform, the input IF takes each row of x and generates each row of p at the output OF according to equation (3), and in the second one, each column of p is processed to generate each column of X according to equation (4). In contrast, the schematic at the bottom left shows the equivalent scheme for the inverse 1D transform. The input data II are connected to the even and odd processors while the output data OI are generated in the I/O processor. In this configuration, the first inverse 1D transform processes each row of Z, generating each column of q at the output OI according to equation (9), and the second one q is read column by column generating each row of z according to equation (10). Fig. 4 shows the data-path of the processors I/O, even and odd. The I/O processor implements the arithmetic operations involved in T 1 (Stage 1 in Table 1 ) and in G 3 (Stage 3 in Table 2 ). It is exclusively made up of adders and subtractors where the inputs are properly arranged depending on the operation mode: forward or inverse. Nonetheless, the operations of T 2 , G 2 , T 3 and G 1 are split up into two processors (even and odd) aiming for the maximum compatibility. As a result, the arithmetic of the even processor varies depending on the operation mode as 
This means that this processor is configurable by means of multiplexers used to modify the data path according to the operation mode. In a similar way, the odd processor implements the following equations The entire circuit to work out the 1D transform takes a total of 32 additions/subtractions and 10 right-shifts that are built by means of data-bus wiring (no additional hardware is necessary). To prevent overflow in the computing of the transform, we consider the biggest www.intechopen.com bit-depth of 14 bits for each luminance sample; this means an unsigned integer number from 0 to 16383. However, this processor operates with the residual luminance whose value is ±16383, 15 bits being necessary for its representation. If k represents the input bus width, then k=15 bits for the first forward 1D transform and k=18 for the second one. The intermediate data a 0 to 7 must be of k+1 bits, b 0 to 3 of k+2, b 4 to 7 of k+3, and, finally, the output data of k+3. The range of the coefficients is ±16383·8=±131064 (18 bit) for the first 1D transform, and ±131064·8=±1048512 (21 bit) for the second one. However, the quantization and scaling process increases the data-path by 1 bit, giving input data of 22 bits before calculating the inverse 8×8 transform, this bit width being what limits the data-path of the whole transform module to prevent overflow. This means that all arithmetic in the forward and inverse 1D transform module is performed in 22 bits and the latency is 2 clock cycles.
Transpose register array
The transpose memory stores 8×8 data and allows simultaneous read and write operations while doing matrix transposition. To achieve this, the 8 input data are read out of the buffer column-wise if the previous intermediate data were written into the buffer row-wise, and vice versa. The transpose buffer based on D-type flip-flops (DFF) (Zhang & Meng, 2009 ) has been chosen as it is more suitable for pipeline architectures, unlike other proposed architectures based on RAM memories. Indeed, solutions based on a single RAM (Do & Le, 2010) lead to high latency, while those based on duplication of the RAMs (one for processing columns and the other for rows) have a high area cost (Ruiz & Michell, 1998) , and those based on bank of SRAMs have a high cost in area (Bojnordi et al., 2006) or in alignment modules . 5 shows the schematic of an 8×8 transpose register array of 22 bits each element whose basic cell is a FFD and a multiplexer. Each FFD of the array is interconnected via 2:1 multiplexers forming 8 shift-registers of length 8 either in the horizontal direction (columns) or in the vertical direction (rows). A selection signal controls the direction of shift in the registers. The loading and shifting mode in the buffer alternates each time a new block of input data is processed: the even (odd) 8×8 block is stored by columns (rows) in the buffer. As a result, the transpose buffer has a parallel input/output structure and the data are transposed on the fly supporting a continuous data flow with the smallest possible size and minimal latency (8 clock cycles).
Quantization and rescaling
H.264 assumes a scalar quantizer avoiding division and/or floating point arithmetic. Most of the proposed quantization and rescaling hardware solutions attempt to directly implement the expressions defined in the standard, but only a few facilitate its implementation. Moreover, all of them work in 8-bit bit-depth and further bits are not considered. (Amer et al., 2005) presented a simple forward quantizer FPGA design to be run o n a D i g i t a l S i g n a l P r o c e s s o r . ( W a h i d e t al., 2006) proposed an Algebraic Integer Quantization to reduce the complexity of the quantization and rescaling parameters required for the H.264. The architecture described by (Bruguera and Osorio, 2006 ) is based on a prediction scheme that allows parallel quantization by detecting zero coefficients to facilitate the entropy encoding. In (Chunganet al., 2007) , the multiplier and RAM/ROM were removed by using a 16 parallel shift-adder scheme. An inverse quantizer based on 6-stage pipelined dual issue VLIW-SIMD architecture was proposed in (Lee, J.J. et al., 2008) . (Pastuszak, 2008) presented an architecture in a FPGA capable of processing up to 32 coefficients per clock cycle. (Lee & Cho, 2008) proposed a scheme to be applied in several video compression standards such as JPEG, MPEG-1/2/4, H.264 and VC-1 where only one multiplier is used to minimize circuit size. A simplification of the quantization process to reduce overhead logic by removing absolute values leads to a decrease of around 20% in power consumption (Owaida et al., 2009) . Another simplification consists of replacing the multiplier with adders and shifters to reduce hardware (Park & Ogunfunmi, 2009 ). An inverse quantization that adopts three kinds of inverse quantizers based on prediction modes and coefficients used in a H.264/AVC decoder was presented in (Chao et al., 2009) . (Husemann et al., 2010) proposed a four forward parallel quantizer architecture implemented in a commercial FPGA board. We propose a single circuit to compute the forward quantization and rescaling for different bit-depth requirements. In both procedures, multiplication, addition and shifting operations are involved and a configurable architecture enables the same module to perform all the specific operations in order to save hardware. The forward quantization (FQ) operates, cycle by cycle, on the coefficients of each column of the forward 8×8 transform (X) and the quantized coefficients (Y) are generated according to what is established in equation (14). In this equation, the modulus operation is necessary because the arithmetic operation ">>qbits" performs an integer division with truncation of the result toward zero which causes errors for X i,j <0. For example, the integer 3 in a 4-bit two's-complement representation is 1101. The operation 3>>2 should be 0, but 1101>>2 gives 1. To resolve this error, 1<<n1 must be added to the negative number, where n is the number of right shifts. Thus, (1101+1<<21)>>2 is 0. Applying this procedure, the absolute value of i, j X can be eliminated from equation (14) by assigning lev_off the same sign as X i,j . To do this, a term 1<<qbits1 must be added when X i,j <0. Then, equation (14) 
Therefore, i, j X and a subsequent sign conversion should not be necessary in equation (28) which leads to a more efficient hardware implementation than that directly proposed from equation (14). The design to implement equation (28) must be able to manage up to 14-bit depth, that is bd=14. In this case, equation (16) shows that QP sc varies from 36 to 87 as QP does from 0 to 51, and qbits from 22 to 30 according to equation (15). From equations (17) and (29), lev_off(+) for intra mode varies from 1396736 to 357564416, lev_off() for intra mode from 2797567 to 716177407, lev_off(+) for inter mode from 700416 to 179306496 and lev_off() for inter mode from 3493887 to 894435327. These bounds fix the lev's bit width to 30 bits. Table 3 depicts the definition of lev according to the sign of X i,j and whether intra is 0 or 1, which can be easily implemented by using basic logic and shift operations. Table 3 . Definition of lev.
The inverse quantization (IQ) or rescaling specified in (21) can be simplified if this equation is rewritten as follows
Equations (28) and (30) are hardware compatible as they share the same basic arithmetic operations. Fig. 6 .a shows the block diagram of the quantizer and rescaling module that is capable of processing 8 coefficients in parallel. It is composed of a control circuit and an 8-way data-path based on a configurable arithmetic unit. The control circuit generates the intermediate parameters needed for the forward quantization or rescaling mode, all of these are obtained from the scaled compression factor (QP sc ), the intra value (intra), the operation mode (FQ/IQ) and the operation synchronization (init). These parameters are: lev(+) and lev(), {k n , k o , k p }, qbits and qpper defined as sc qpper=QP /6 (31)
The three coefficients {k n , k o , k p } represent either the quantization multiplication factors kf m QF i,j specified in equations (18), (19) and (20) or the rescaling multiplication factors ki m QI i,j defined in equations (22), (23) and (24). The indexes {n,o,p} take some of these possible values {0, 1, 2}, {1, 3, 4} or {2, 4, 5}. Only three coefficients need to be generated for the 8 arithmetic units because each row or column of the matrix QF in (18) or the matrix QI in (22) is composed of three different coefficients. All coefficients are read in a look-up table depending on the operation mode and the value of QP sc . (28) and (30). The multiplier has a high area cost and delay, so some papers (Michael & Hsu, 2008 ) ) have proposed replacing it with a reduced number of shifts and additions by modifying the QF factors to be more suitable for hardware optimization. However, they introduce an error between the quantization and the inverse quantization which leads to a reduction of the rate-distortion performance. In order to avoid mismatching between encoder and decoder, in our approach an implementation of the whole multiplier is selected, with a pipeline strategy to increase its speed. After an exhaustive analysis, a Wallace-tree 4-stage pipeline multiplier was demonstrated to be the optimal solution to balance the critical path of the multiplier with the critical path of the rest of circuit. In the FQ mode, first the inputs X i,j and QF i,j are multiplied. A multiplexer selects the factor lev(+) or lev() to be added to the output of the multiplier depending on the sign of X i,j . Here, a delay of 4 clock cycles in the signal of sign(X i,j ) is introduced to compensate for the delay in the multiplier. At the output of the adder, a qbit shift-right (>>) operation is performed to obtain the quantized coefficient Y i,j . In the IQ mode, the inputs Y i,j and QI i,j are multiplied. A constant 2 is added to the result and the last >>2 operation generates the scaled coefficients Z i,j .
ASIC implementation and comparisons
A prototype of the proposed bit-depth processor has been designed and verified using different abstraction levels. Fig. 7 presents the simulation environment used to verify the functional behavior of the proposed architecture by comparing the data processed with those provided by the JM reference software (Sühring, 2010) for different data blocks of input residual luminance. The results of the diverse comparisons performed between the simulation and the reference software indicate that there are no differences between them. Initially, the processor was designed using the CoWare® Signal Processing Worksystem (SPW), editing the block diagram with the elements of the Hardware Design System (HDS) library. The first test bench was made by simulating the design with Simulation Program Builder-Interpreted (SPB-I). The code description in Verilog-RTL was automatically generated by the Verilog RTL Link from the HDS library. A new comparison was performed at this abstraction level to guarantee the correct description of the generated code. Finally, this Verilog description was synthesized using the Synopsys design compiler under HCMOS9 STMicroelectronics 130nm standard cell technology. The resulting circuit contains 26.5k cells with an area of 625700m 2 and the estimated maximum operating frequency is 330 MHz. After the logic synthesis, the PrimePowerTM tool was applied to estimate the power consumption, giving 120mW@330MHz (V DD =1.2V). The data throughput is 2640 Mpixels per second. This characteristic enables enough processing capacity for 1080HD (1920x1088@30fps) real-time video streams. With the proposed architecture, each 8×8 block input data is processed with a latency of 44 clock cycles according to the time scheduling described in Fig. 8 . BUSA indicates the output of the transform module, BUSB the output of quantization and scaling module, and IN and OUT are the input and output of the transpose register (TR); all these signals are depicted in Fig. 2 . On inputting luma (x), it takes 3 clock cycles to generate the coefficients (p) and the output coefficients (X) are obtained from the 13th clock. These coefficients go to the quantization module and the "quantized" coefficients (Y), which are generated from the 18 th clock cycle, are stored in the transpose register. In the rescaling process, the data Y are read in transpose order to compute the "rescaled" coefficients Z from the 31 st clock cycle. On processing these coefficients in the 1D transform module, the intermediate data q are obtained in the 34 st clock cycle. Finally, the recovered residual luminance (z) is ready to be processed from the 44 th clock cycle and the next luma block can be input in the 49 th clock cycle. For comparison purposes, Table 4 shows the characteristics and the performances of previously published ASIC implementations, although some of them only implement parts of the H.264/AVC transform coding process. In (Fan, 2006) , a cost effective architecture for fast (1-D) 4×4 and 8×8 forward/inverse transform was derived through the Kronecker and direct sum operations. The configurable architecture presented in supports the six kinds of 4×4 transforms required in the adaptive block-size transform of H.264 in order to more efficiently reuse the data-path; in this architecture, one 8×8 transform can be finished within 16 clock cycles. Based on this reusability property, another unified 4×4 and 8×8 transform architecture is proposed in (Choi at al., 2008) . To increase its throughput, 4 units operate in parallel and only 5 clock cycles are needed to perform an 8×8 transform. The low power consumption is because the circuit works at quite low speed (27MHz). A pipeline 8×8 2D forward transform architecture is proposed which is capable of consuming and producing one sample per clock cycle in (Silva et al., 2007) . It uses two 1-D transform processors and transpose RAM with a latency of 144 clock cycles. The high-throughput and cost-effective implementation of six different integer transforms is proposed in (Hwangbo & Kyung, 2010) . This implementation maximizes the shared hardware and it is able to process 64 input pixels in a two-stage pipelined architecture to compute the direct 8×8 transform or two 4×4 transforms in parallel. Another flexible architecture is presented in (Chao at al., 2007) , which is suitable for a H.264 high profile decoder capable of processing a macroblock in 95 clock cycles with the 8×8 inverse transform or only 54 clock cycles without it. The architecture described in (Lee & Cho, 2008) and quantization for unified standard video CODEC (JPEG, MPEG-1/2/4, H.264 and VC-1). A high-throughput architecture which integrates forward transform, quantization, scaling, inverse transform and the sample reconstruction is presented in (Pastuszak, 2008) . It uses reconfigurable 4×4 and 8×8 transform architecture and is able to process 32 samples/coefficients per clock cycle. The 8×8 transform is performed in only 2 clock cycles by processing a whole block of 64 input samples through a scheme based on eight 1-D transforms operating in parallel. The quantization and rescaling operate on 32 coefficients in each clock cycle. Although this architecture has low latency, the cost in area is 10 times more than in other proposed designs. In a similar way to , a single data-path for implementing 4×4 and 8×8 forward and inverse transform as well as Hadamard transform is presented in (Bruguera et al., 2006) . However, the quantization and rescaling are computed using only one multiplier each and they are performed at the pace demanded by the entropy coder. In a previous work (Michell et al., 2011) , we described a parallel architecture capable of processing 8×8 blocks without interruption with a bit-depth fixed to 8 bit. The latency of 38 clock cycles is achieved by implementing in a pipeline scheme each module used in the transform coding. Indeed, the procesor presented here uses a configurable architecture based on the reusing of different variable bit-depth modules to reduce hardware and power, all of this with a latency of 44 clock clycles. It has been designed attempting to achieve the maximum throughput at the highest possible speed. To achieve these goals, the pipeline stages have been balanced during the synthesis to maintain the critical path equivalent to 2 adders as a limit, independently of the technology used. Other challenges were the hardware-efficient modifications in the quantization and rescaling module to reduce the arithmetic complexity combined with balanced pipelined multipliers, as it is the more complex arithmetic component, to attain the high performance parameters. According to the results shown in Table 4 , our design is the fastest. Its high throughput it is only surpassed by that in (Hwangbo & Kyung, 2010) , which processes 16 and 32 input samples in comparison with 8 in our design, but that scheme has a large area cost despite the fact that it only implements the direct transform without quantization and rescaling. The design proposed in (Bruguera et al., 2006) has fewer gates than ours but the quite low speed (67MHz) reduces the throughput to 266Mpixels/s. By observing the differences in the speed and throughput achieved by our processor, we can conclude that these differences cannot only be attributed to the technology used, but are a consequence of the hardware modifications introduced in our design.
Conclusions
In July 2004, a new amendment called Fidelity Range Extensions (FRExt) was added to the H.264/AVC as a standardization initiative motivated by the rapidly growing demands focusing on professional applications and high-definition videos. Improvements present in FRExt include a new 8x8 integer transform, the variety of chroma sub-sampling formats and a greater colour bit-depth ranging from 8-bit up to 14-bit. Increasing bit depth provides improved accuracy in the coding efficiency with a reduction of noise and artifacts. Indeed, bit-depth scalability is potentially useful as, in a foreseeable future where different bitdepths will simultaneously coexist in the market, it provides multiple representations of different bit-depths for the same visual content. This chapter presents a variable bit-depth processor with pipeline architecture for real-time implementation of the complete process for the 8×8 transform and quantization coding in the H.264/AVC. This architecture has been conceived with the aim of achieving a high operation frequency and high throughput without increasing the hardware complexity. Initially, the mathematical expressions of the 8×8 transform and quantization used in the standard H.264/AVC are presented to facilitate the readers' understanding of this matter. A review of the state-of-the-art of the previous implementations and references is also included; here, special emphasis is given to describing the effect of the bit-depth in quantization and rescaling formulas. However, most hardware implementations only operate in 8 bits and further bit-depths have not been taken into account. In order to achieve an efficient implementation of the processor, hardware solutions have been developed for the different circuit modules. A configurable forward and inverse 1D processor and a transpose register array enable an efficient hardware computation of the 8x8 transform. Forward quantization and rescaling operations are computed in the same circuit for different bit-depth requirements and new expressions are included enabling efficient hardware implementation by minimizing the arithmetic operations involved. Finally, the critical paths of the distinct computing units have been carefully analyzed and balanced using a pipeline scheme in order to maximize the operation frequency without introducing an excessive latency. A prototype with the proposed architecture has been synthesized in a 130nm HCMOS technology process which achieves a maximum speed of 330 MHz. The throughput of 2640 Mpixels/s allows real-time video streams of 1080HD (1920×1088@30fps) to be processed.
