Variable Bit-Depth Processor for 8×8 Transform and Quantization Coding in H.264/AVC by Gustavo A. Ruiz & Juan A. Michell
Selection of our books indexed in the Book Citation Index 
in Web of Science™ Core Collection (BKCI)
Interested in publishing with us? 
Contact book.department@intechopen.com
Numbers displayed above are based on latest data collected. 
For more information visit www.intechopen.com
Open access books available
Countries delivered to Contributors from top 500 universities
International  authors and editors
Our authors are among the
most cited scientists
Downloads
We are IntechOpen,
the world’s leading publisher of
Open Access books
Built by scientists, for scientists
12.2%
122,000 135M
TOP 1%154
4,800
15 
Variable Bit-Depth Processor for 8×8 Transform 
and Quantization Coding in H.264/AVC 
Gustavo A. Ruiz and Juan A. Michell 
Department of Electronics and Computers, University of Cantabria 
Spain 
1. Introduction 
The H.264/AVC (Advanced Video Codec) is the latest standard for video coding established 
by the Joint Video Team ITU-T VCEG and ISO/IEC MPEG (Wiegand et al., 2003)  
(Sühring, 2010) (Links, 2010). This standard has many innovations, such as hybrid 
prediction/transform coding of intra frames and integer transforms (Richardson, 2004). Fig. 
1 presents a simplified block diagram of the H.264/AVC encoder with the following main 
blocks: motion estimation (ME), motion compensation (MC), intra prediction, forward 
transform (FT), forward quantization (FQ), inverse quantization or re-scaling (IQ), inverse 
transform (IT), entropy coding and de-blocking filter, among others. Initially, most of the 
work done on H.264 was oriented toward its software implementation. However, in recent 
years the contributions to the hardware implementation of H.264 have increased greatly, 
enabling the implementation of fast architectures for real-time video applications (Lin et al., 
2008) (Finchelstein et al., 2009) (Liu et al., 2009).  
 
Entropy
encoder
ME
Intra
prediction
NAL
Inter
Intra
Fn
(current)
Fn-1
(reference )
F’n
(reconstructed)
Deblocking
Filter
+
-
x X Y
Zz
MC
+
FT FQ
IT IQ
+  
Fig. 1. Diagram of the H.264/AVC encoder. 
The initial version of H.264/AVC used a transform hierarchy based on three transforms that 
are computed in integer arithmetic, two of size 4×4 and one of 2x2. In July 2004, the first 
amendment to the H.264 standard was presented, named Fidelity Range Extensions (FRExt) 
(JVT, 2004), in which a new set of tools was specified to increase the high-fidelity video 
encoding efficiency, focusing on professional applications and high-definition videos. One 
www.intechopen.com
 
Recent Advances on Video Coding 
 
310 
of the most significant differences between the H.264 FRExt codification and the non-FRExt 
one is the use of an 8×8 integer transform (Gordon, 2004), which is an integer approximation 
of the 8×8 2-D Discrete Cosine Transform (DCT), as well as the original 4×4 and 2×2 
transforms. The H.264 FRExt enables high quality video by supporting varied chroma sub-
sampling formats 4:2:0, 4:2:2 and 4:4:4 with greater color bit-depth ranging from 8-bit up 
to 14-bit and resolution ranging from QCIF (176x144) to Full HD (1920x1080), both in 
progressive and interlaced scanning. There are several AVC/H.264 profiles to encode pixels 
with a bit depth greater than 8 bits: High 10 Profile (8 bits up to 10 bits), high 4:2:2 profile (8 
bits up to 10 bits), high 4:4:4 predictive profile (8 bits up to 14 bits), high 10 intra profile (8 
bits up to 10 bits), high 4:2:2 intra profile (8 bits up to 10 bits), high 4:4:4 intra profile (8 bits 
up to 14 bits) and CAVLC 4:4:4 intra profile (8 bits up to 14 bits). Increasing bit depth 
provides improved accuracy in the compression scheme as well as in motion compensation, 
in intra prediction and in-loop filtering (Gish, 2002) (Gish, 2003) (Lavier, 2009). Indeed, 
extensive experimentation proves that the coding efficiency with the largest bit-depth is 
higher on videos that contain shallow textures and low noise, and perceivable gains exist in 
the reduction of three kinds of artifacts: contouring, banding and mosquito noise. Currently, 
bit-depth is especially focused on video quality (Sims et al., 2005). The coding efficiency can 
be improved by increasing the internal bit depth in relation to the external bit depth used in 
the video codec (Chujoh & Noda, 2007a, 2007b). Moreover, bit-depth scalability is 
potentially useful considering that for the foreseeable future, conventional 8-bit and high-bit 
digital imaging systems will exist simultaneously in the market, providing multiple 
representations of different bit-depths for the same visual content (Chujoh & Noda, 2006) 
(Gao & Wu, 2006) (Gao et al., 2010). Other applications of bit-depth are the bit-depth 
transform of the characteristics for high bit-depth images to maximize the encoding 
efficiency (Ito et al., 2010), the novel bit-depth expansion method used to remove the 
contouring effects in smooth regions when mapping low-color bit-depth image to high-color 
bit-depth (Chen et al., 2009) or the three bit-depth scalable coding architectures compatible 
with H.264 (Chiang at al., 2009). 
This chapter presents a variable bit-depth processor with pipeline architecture for real-time 
implementation of the complete process for the 8×8 transform and quantization coding in 
the H.264/AVC. The processor manages different bit-depths – 8 bits up to 14 bits – and 
quantization parameters (QP) fulfilling the requirements of H.264/AVC. Hardware 
solutions to reduce its complexity, combined with an efficient implementation, provide a 
high-speed, high-throughput circuit at a low cost in area. A prototype of the processor, 
which has been synthesized in a 130nm HCMOS technology, uses 26.5k gates and achieves a 
maximum speed of 330 MHz with a throughput of 2640 Mpixels/s; this throughput is 
enough to reach a processing capacity for 1080HD (1920×1088@30fps) real-time video 
streams. 
The remainder of this chapter is organized as follows. Sections 2 and 3 describe the 8×8 
transform and quantization in H.264/AVC, providing the necessary mathematical 
background with special emphasis on describing the effect of the bit-depth in quantization 
and rescaling expressions. The 8×8 transform provides excellent compression performance 
in high-resolution video streams with a level of complexity only slightly higher than the 4×4 
transform. Its implementation can also be done in terms of additions and shifts and no 
multiplications are necessary, despite the fact that the coefficients are not powers of 2 in all 
cases. Quantization and rescaling enable the encoder to control the trade-off between bit-
rate and quality. H.264 assumes a bit-depth-dependent scalar quantizer without division 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
311 
and/or floating arithmetic based on post and pre-scaling matrices. Section 4 describes the 
proposed architecture for implementing the configurable process of transform and 
quantization for an 8×8 luma block capable of operating with different bit-depths (8 bits up 
to 14 bits). This section includes a description of the main modules: 1D configurable forward 
and inverse transform, 8×8 transpose register and the optimized arithmetic circuit needed to 
perform the computation of bit-depth-dependent quantization and rescaling in a unified 
structure. A review of the state-of-the-art of the previous implementations and references is 
also included. However, most hardware implementations only operate in 8 bits and further 
bit-depths have not been taken into account. Section 5 shows the characteristics and the 
performance of the proposed processor as well as comparisons with other published and 
related implementations. These comparisons are made in terms of area, speed and power. 
2. 8×8 Transform in the H.264/AVC 
The FRExt amendment to H.264 proposes a scheme based on an 8×8 integer approximation 
of DCT transform to be added to the existing 4×4 transform in order to improve high-
definition video compression (Gordon & Wiegand, 2004). This transform provides excellent 
compression performance in high-resolution video streams with a level of complexity only 
slightly higher than the 4×4 transform even though the coefficients are not powers of 2 in all 
the cases. However, it’s implemented using additions and shifts and no multiplications are 
necessary. Moreover it uses integer arithmetic which eliminates the mismatch issues 
between the encoder and the decoder. 
The forward 8×8 integer transform is applied to each block in the residual luminance 
component (x) of the input video stream as follows 
 t=X T x T   (1) 
where T is a matrix of dimension 8×8 which represents the transform kernel defined as 
 
8 8 8 8 8 8 8 8
12 10 6 3 -3 -6 -10 -12
8 4 -4 -8 -8 -4 4 8
10 -3 -12 -6 6 12 3 -101
8 -8 -8 8 8 -8 -8 88
6 -12 3 10 -10 -3 12 -6
4 -8 8 -4 -4 8 -8 4
3 -6 10 -12 12 -10 6 -3
T
               
 (2) 
In the JM reference software (Sühring, 2010), the property of separability of this 8×8 
transform is used to implement equation (1) in a separable way as a 1D horizontal (Eq. (3)) 
transform followed by a 1D vertical (Eq. (4)) transform according to the following equations 
    t t t1 2 3=p x T T T    (3) 
    t t t t1 2 3=tX p T T T    (4) 
www.intechopen.com
 
Recent Advances on Video Coding 
 
312 
Equations (3) and (4) are obtained from the decomposition of T as a sparse matrix product 
of matrices T1, T2 and T3 defined as 
 1
1 0 0 0 0 0 0 1
0 1 0 0 0 0 1 0
0 0 1 0 0 1 0 0
0 0 0 1 1 0 0 0
=
1 0 0 0 0 0 0 -1
0 1 0 0 0 0 -1 0
0 0 1 0 0 -1 0 0
0 0 0 1 -1 0 0 0
T
             
 (5) 
 2
1 0 0 1 0 0 0 0
0 1 1 0 0 0 0 0
1 0 0 -1 0 0 0 0
0 1 -1 0 0 0 0 0
=
0 0 0 0 3 / 2 1 1 0
0 0 0 0 1 0 -3 / 2 -1
0 0 0 0 1 -3 / 2 0 1
0 0 0 0 0 1 -1 3 / 2
T
             
 (6) 
 3
1 1 0 0 0 0 0 0
0 0 0 0 1 0 0 1/4
0 0 1 1/2 0 0 0 0
0 0 0 0 0 1 1/4 0
=
1 -1 0 0 0 0 0 0
0 0 0 0 0 -1/4 1 0
0 0 1/2 -1 0 0 0 0
0 0 0 0 1/4 0 0 -1
T
             
 (7) 
Table 1, which it is directly extracted from the JM reference software, shows the expressions 
used to compute the 1D transforms involved in equations (3) and (4). In this Table, IF 
denotes the vector of input values (IF represents either each row of x in equation (3) or each 
column of p in (4)), OF denotes the transformed output vector (OF represents either each 
row of p in equation (3) or each column of X in (4)), and a and b are internal variables. In a 
3-stage butterfly, stage 1 implements the operations involved in T1, stage 2 implements T2 
and stage 3 implements T3. The multiplications by the coefficients 1/2, 1/4 and 3/2=1+1/2 
are implemented by means of shift-right (>>) operations which cause truncation errors 
which are propagated through the datapath. To avoid mismatch between the encoder and 
decoder, the implementation of 1D transform must fulfill the operations specified in the 
standard. As a result, any implementation of this transform must be in compliance with the 
arithmetic described in Table 1 and no other alternative is possible. 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
313 
Stage 1 – T1 Stage 2– T2 Stage 3 – T3 
a0=IF0+IF7 
a1=IF1+IF6 
a2=IF2+IF5 
a3=IF3+IF4 
a4=IF0IF7 
a5=IF1IF6 
a6=IF2IF5 
a7=IF3IF4 
b0=a0+a3 
b1=a1+a2 
b2=a0a3 
b3=a1a2 
b4=a5+a6+((a4>>1)+a4) 
b5=a4a7 ((a6>>1)+a6) 
b6=a4+a7 ((a5>>1)+a5) 
b7=a5a6+((a7>>1)+a7) 
OF0=b0+b1 
OF1=b4+(b7>>2) 
OF2=b2+(b3>>1) 
OF3=b5+(b6>>2) 
OF4=b0b1 
OF5=b6 (b5>>2) 
OF6=(b2>>1)b3 
OF7=b7+(b4>>2) 
Table 1. Forward 1D transform algorithm extracted from the JM software reference. 
The inverse 8×8 integer transform of a block of coefficients of size 8×8 (Z) is defined through 
the equation 
 t=z T Z T   (8) 
Likewise to the forward transform, the 8×8 inverse transform can be computed as the 
concatenation of a 1D horizontal inverse transform (Eq. (9)) and a 1D vertical inverse 
transform (Eq. (10)) through the decomposition of T as a sparse matrix product of matrices 
G1, G2 and G3 giving 
    1 2 3=q Z G G G    (9) 
    t 1 2 3=tz q G G G    (10) 
The G1, G2 and G3 matrices are defined as 
 
1
1 0 0 0 1 0 0 0
0 0 0 1 0 -1 0 3/2
0 0 1 / 2 0 0 0 1 0
0 -1 0 -3/2 0 0 0 1
=
1 0 0 0 -1 0 0 0
0 1 0 0 0 3/2 0 1
0 0 -1 0 0 0 3/2 0
0 -3/2 0 1 0 1 0 0
G
             
 (11) 
  
2
1 0 0 0 0 0 1 0
0 1 0 0 0 0 0 -1/4
0 0 1 0 -1 0 0 0
0 0 0 1 0 1/4 0 0
=
0 0 1 0 1 0 0 0
0 0 0 1/4 0 -1 0 0
1 0 0 0 0 0 -1 0
0 1/4 0 0 0 0 0 1
G
             
 (12) 
www.intechopen.com
 
Recent Advances on Video Coding 
 
314 
 3
1 0 0 0 0 0 0 1
0 0 0 1 -1 0 0 0
0 1 0 0 0 0 1 0
0 0 1 0 0 -1 0 0
=
0 0 1 0 0 1 0 0
0 1 0 0 0 0 -1 0
0 0 0 1 1 0 0 0
1 0 0 0 0 0 0 -1
G
             
 (13) 
Table 2 shows the expressions for computing these 1D transforms used in the JM reference 
software. In a similar way to the forward 1D transform, a 3-stage butterfly structure is used 
where stage 1 implements the operations specified in G1, stage 2 in G2 and stage 3 in G3. 
Here, II denotes the vector of input values (II represents either each file of Z in equation (9) 
or each column of q in (10)), OI denotes the transformed output vector (OI represents either 
each file of q in equation (9) or each column z in (10)), and ia and ib are internal variables. 
 
Stage 1 – G1 Stage 2– G2 Stage 3 – G3 
ia0=II0+II4 
ia1=II3+II5–II7–(II7>>1) 
ia2=(II2>>1)–II6 
ia3=II1+II7–II3–(II3>>1) 
ia4=II0–II4 
ia5=–II1+II7+II5+(II5>>1) 
ia6=II2+(II6>>1) 
ia7=II 3+II5+II1+(II1>>1) 
ib0=ia0+ia6 
ib1=ia1+(ia7>>2) 
ib2=ia4+ia2 
ib3=ia3+(ia5>>2) 
ib4=ia4–ia2 
ib5=(ia3>>2)ia5 
ib6=ia0ia6 
ib7=–(ia1>>2)+ia7 
OI0=ib0+ib7 
OI1=ib2+ib5 
OI2=ib4+ib3 
OI3=ib6+ib1 
OI4=ib6ib1 
OI5=ib4ib3 
OI6=ib2ib5 
OI7=ib0ib7 
Table 2. Inverse 1D transform algorithm extracted from the JM software reference. 
3. Quantization and rescaling in the H.264/AVC 
The forward quantization process in H.264/AVC FRExt is performed for the transformed 
coefficients (X) computed in equations (3) and (4) according to the following equations 
 
 
   
i,j i,j i,j
i,j i,j
Y = QF X +lev_off >>qbits
sign Y =sign X

 (14) 
where 
 scqbits=QP /6+16  (15) 
In this equation, QPsc is the scaled quantization parameter defined as 
  scQP =QP+6 bd-8  (16) 
QP takes an integer value (from 0 to 51) and determines the level of coarseness of the 
quantization process enabling the encoder to control the trade-off between bit rate and 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
315 
quality. The parameter bd represents the bit-depth video content, 8 ≤ bd ≤ 14. There are lots 
of professional applications which require higher bit depth support such as studio 
application and HD application. In H.264/AVC, 7 of 11 profiles support more than 8-bit bit 
depth starting from High10 which supports 10-bit bit depth. High 444 Predictive and some 
related profiles support up to 14 bits. As can be seen in equation (16), QPsc depends on the 
quantization parameter QP as well as bd; note QPsc=QP for bd=8 bits. This means that QPsc 
can have a value from 0 to 51 when bd=8 and from 36 to 87 for bd=14. 
The approximation factor, lev_off, used in equation (14) is defined as 
      lev_off= 682 intra+342 intra << qbits-11 , intra 0, 1    (17) 
where intra=1 is used for intra coefficient quantization and intra=0 for inter coefficient 
quantization. 
The forward quantization matrix, QF, is  
 
0 1 2 1 0 1 2 1
1 3 4 3 1 3 4 3
2 4 5 4 2 4 5 4
1 3 4 3 1 3 4 3
0 1 2 1 0 1 2 1
1 3 4 3 1 3 4 3
2 4 5 4 2 4 5 4
1 3 4 3 1 3 4 3
kf kf kf kf kf kf kf kf
kf kf kf kf kf kf kf kf
kf kf kf kf kf kf kf kf
kf kf kf kf kf kf kf kf
=
kf kf kf kf kf kf kf kf
kf kf kf kf kf kf kf kf
kf kf kf kf kf kf kf kf
kf kf kf kf kf kf kf kf
QF


        
 (18) 
whose elements are obtained by evaluating the expression 
    m sckf  = mod(QP ,6), m , m 0, 1, 2,3,4,5MF    (19) 
In this equation, MF is the multiplication factor matrix of dimension 6×6, and the term 
mod(QPsc, 6) and m denote the row and column indices respectively. MF is specified as 
 
 
13107 12222 11428 16777 15481 20972
11916 11058 14980 10826 14290 19174
10082 9675 12710 8943 11985 15978
=
9362 8931 11984 8228 11259 14913
8192 7740 10486 7346 9777 13159
7282 6830 9118 6428 8640 11570
MF
          
 (20) 
 
The inverse quantization or rescaling “re-scales” the quantized transform coefficients (Y) 
coefficients computed in (14). The rescaling process, which is different to that used in the 
4×4 transform (Malvar et al., 2006), is defined by the following equation directly extracted 
from the JM reference software as 
      i,j i,j i,j scZ = QI <<4 Y << QP /6 +1<<5 >>6  (21) 
www.intechopen.com
 
Recent Advances on Video Coding 
 
316 
where QI is the rescaling matrix defined as 
 
0 1 2 1 0 1 2 1
1 3 4 3 1 3 4 3
2 4 5 4 2 4 5 4
1 3 4 3 1 3 4 3
0 1 2 1 0 1 2 1
1 3 4 3 1 3 4 3
2 4 5 4 2 4 5 4
1 3 4 3 1 3 4 3
ki ki ki ki ki ki ki ki
ki ki ki ki ki ki ki ki
ki ki ki ki ki ki ki ki
ki ki ki ki ki ki ki ki
=
ki ki ki ki ki ki ki ki
ki ki ki ki ki ki ki ki
ki ki ki ki ki ki ki ki
ki ki ki ki ki ki ki ki
QI


        
 (22) 
whose elements are obtained by evaluating the expression 
    m scki  = mod(QP ,6), m , m 0, 1, 2, 3, 4, 5MI    (23) 
Here, MI is the rescaling factor matrix specified as 
 
20 19 25 18 24 32
22 21 28 19 26 35
26 24 33 23 31 42
=
28 26 35 25 33 45
32 30 40 28 38 51
36 34 46 32 43 58
MI
          
 (24) 
4. Variable bit-depth processor for the 8×8 transform and quantization 
Fig. 2 shows the block diagram of the proposed variable bit-depth processor for real-time 
implementation of the complete process for the 8×8 transform and quantization coding in 
the H.264/AVC. This processor includes the following main modules: configurable forward 
and inverse 1D integer transform, bit-depth dependent quantization and rescaling module, 
and transpose register memory. This architecture, which fulfils the requirements of 
H.264/AVC FRExt, has been conceived to operate with different bit-depth (bd) – 8 bits up to 
14 bits with the aim of achieving a high performance with a reduced hardware complexity 
implementation. In order to provide an efficient processor, hardware solutions have been 
developed for the different circuit modules. The 8×8 forward and inverse transforms are 
calculated using the separability property simplifying its architecture to a single 
configurable 1D forward (FT)/inverse (IT) transform processor and a transpose register 
array. Forward quantization (FQ) and rescaling (IQ) operations are computed in the same 
circuit for the different bit-depth requirements. Here, new expressions are proposed 
allowing efficient hardware implementation by avoiding the sign conversion and 
minimizing the arithmetic operations involved. Furthermore, an exhaustive analysis in the 
dynamic range of the datapath was performed to fix the optimum bus widths with the aim 
of reducing the size of the circuit while avoiding overflow. Finally, the critical paths of the 
various computing units have been carefully analyzed and balanced using a pipeline scheme 
in order to maximize the operation frequency without introducing an excessive latency. 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
317 
M
U
X
Reconstruction
Entropy coding
R
E
G
R
E
G
M
U
X
Residual luma
M
U
X
8 pixels
R
E
G
R
E
G CONFIGURABLE
FORWARD
&
INVERSE
1D TRANSFORM
bd QP
QUANTIZATION
&
RESCALING
FT/IT
(x)
(z)
(Y)
FQ/IQ
BUSA
BUSB
IN
8×8
Transpose 
Register
(TR)
OUT
 
Fig. 2. Block diagram of the variable bit-depth processor. 
This circuit processes 8 input data in parallel, starting by reading the residual luminance 
component (x) row by row until the entire 8×8 input block is read. The forward 1D 
transform module generates the intermediate coefficients p to be stored in the transpose 
register row-wise. After 8 clock cycles, these coefficients are read column-wise and 
processed again in the 1D transform module. Then, the resulting X coefficients are 
quantized column by column in parallel in the quantization and rescaling module and 
stored in the transpose register column-wise. On finishing this operation, the quantized 
coefficients (Y) are rescaled row by row and the results (Z) are sent to inverse 1D transform 
whose output data (q) are stored in the transpose register row-wise. Finally, the coefficients 
q are fetched to the transpose register column-wise to be processed in the inverse 1D 
transform to obtain the recovered residual luminance (z). 
4.1 Forward and Inverse 8×8 transform 
The 8×8 transform proposed in FRExt for addition to the JVT specification in the 
H.264/AVC is based on the fact that at SD resolutions and above, the use of block sizes 
smaller than 8×8 is limited. One of the first papers (Amer et al., 2005) related to this matter 
was the FPGA pipelined implementation of a simplified 8×8 transform and quantization. 
Another FPGA implementation of an algebraic integer quantization approach to computing 
the 8×8 TRANSFROM was presented in (Wahid et al., 2006). (Silva et al., 2007) proposed 
high-throughput architecture of the forward 8×8 transform to encode high-definition videos 
in real time with a latency of 5 clock cycles to process 1D transform. This architecture was 
synthesized in FPGA with a minimum period of 8.13ns and in a TSMC 0.35µm CMOS 
standard cell technology leading to a period of 8.05ns. Recently, (Park & Ogunfunmi, 2009) 
presented a reduced and parallel FPGA implementation of an 8×8 integer transform, 
quantization and scaling for H.264. Here, each pixel is processed one by one on a simplified 
pipelined architecture without multiplication. 
In the adaptive block-size transform of the FRExt, different kinds of transforms are required: 
8×8 forward/inverse transform, 4×4 forward/inverse transform, 4×4 forward/inverse 
Hadamard transform and 2×2 forward/inverse Hadamard transform. In order to reduce 
hardware, diverse configurable data-path architectures to support all of these transforms in 
www.intechopen.com
 
Recent Advances on Video Coding 
 
318 
a unified scheme have been proposed. Other examples of this kind of architectures include; 
the multi-transform processor where the quantization is performed at the pace demanded 
by the entropy coder in (Bruguera & Osorio, 2006), the low hardware cost suitable for VLSI 
implementations in (Fan, 2006), the reduced hardware and high latency in (Chao et al., 
2007), the high-performance architecture for high-definition applications in ( Ma & et. al, 
2007), the IP design to be implemented on an ASIP-controlled SoC platform in (Ngo et al., 
2008), the high-performance, low-power unified transform architecture in (Choi et al., 2008), 
the highly parallel joint circuit architecture in (Li et al., 2008), and the fast, high-throughput 
and cost-effective implementation in (Hwangbo & Kyung, 2010). 
 
INVERSE TRANSFORM (IT)
IF0 IF0 /ib0
IF1 /ib2
IF2 /ib4
IF3 /ib6
IF4 /ib1
IF5 /ib3
IF6 /ib5
IF7 /ib6
IF1
IF2
IF3
IF4
IF5
IF6
IF7
a0 /OI0
a1 /OI1
a2 /OI2
a3 /OI3
a4 /OI4
a5 /OI5
a6 /OI6
a7 /OI7
a0 /II0
a1 /II2
a2 /II6
a3 /II4
OF0 /ib0
OF2 /ib2
OF4 /ib6
OF6 /ib4
II0
II2
II4
II6
a4 /II1
a5 /II3
a6 /II5
a7 /II7
OF1 /ib3
OF3 /ib1
OF5 /ib7
OF7 /ib5
II1
II3
II5
II7
OF0 /OI0
OF1 /OI1
OF2 /OI2
OF3 /OI3
OF4 /OI4
OF5 /OI5
OF6 /OI6
OF7 /OI7
FT/IT
I/O
Processor
Config.
Even
Processor
Config.
Odd
Processor
FORWARD TRANSFORM (FT)
IF0
IF1
IF2
IF3
IF4
IF5
IF6
IF7
a0
a1
a2
a3
a4
a5
a6
a7
OF0
OF2
OF4
OF6I/O
Processor
Forward
Even
Processor
Forward
Odd
Processor
OF1
OF3
OF5
OF7
II0
II2
II6
II4
II1
II3
II5
II7
ib0
ib2
ib6
ib4
ib3
ib1
ib7
ib5
IO0
IO1
IO2
IO3
IO4
IO5
IO6
IO7
I/O
Processor
Inverse
Even
Processor
Inverse
Odd
Processor
CONFIGURABLE FORWARD & INVERSE 1D TRANSFORM
 
 
Fig. 3. Block diagram of the forward/inverse transform. The equivalent scheme is also 
shown for the forward transform (bottom-left) and inverse transform (bottom-right). 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
319 
Initially, the specifications of H.264 adopted an integer approximation of 4×4, but when 
transforms are larger, significant compression performance gains have been reported for 
High-Definition (HD) resolutions. Thus, a new integer transform of 8×8 was proposed in the 
Fidelity Range Extensions (FRExt) to be added to the previously existing specifications, 
which were verified in SD resolutions. In fact, the use of block sizes 8x8 and bigger is 
dominant. Following this assumption, we proposed architecture for computing the 8×8 
forward/inverse transform based on a configurable high-throughput 1D processor which 
has been conceived to implement the arithmetic operations described in Table 1 and Table 2 
aiming to fulfill two objectives. First, to avoid mismatches between the encoder and decoder 
there is no possible alternative in the implementation of the operations other than those 
specified in these tables, which are directly extracted from the JM reference software. 
Second, these equations share compatible arithmetic which leads to hardware reduction if a 
configurable data-path is used. To comply with these prerequisites, arithmetic operations 
presented in Tables I and II can be implemented in terms of a three-processor architecture 
that fulfils the requirements of H.264. These processors, as is shown in Fig. 3, are named 
I/O, even and odd. The operation mode, forward (FT) and inverse (IT), is arranged by 
multiplexers which select the inputs and modify the inner arithmetic operations of each 
processor. The schematic at the bottom left in Fig. 3 represents the equivalent scheme for 
computing the forward 1D transform. In this configuration, the eight elements of IF are 
input to the I/O processor and their outputs run in parallel into the even and odd 
processors to generate the output OF. In the first 1D transform, the input IF takes each row 
of x and generates each row of p at the output OF according to equation (3), and in the 
second one, each column of p is processed to generate each column of X according to 
equation (4). In contrast, the schematic at the bottom left shows the equivalent scheme for 
the inverse 1D transform. The input data II are connected to the even and odd processors 
while the output data OI are generated in the I/O processor. In this configuration, the first 
inverse 1D transform processes each row of Z, generating each column of q at the output OI 
according to equation (9), and the second one q is read column by column generating each 
row of z according to equation (10).  
Fig. 4 shows the data-path of the processors I/O, even and odd. The I/O processor 
implements the arithmetic operations involved in T1 (Stage 1 in Table 1) and in G3 (Stage 3 
in Table 2). It is exclusively made up of adders and subtractors where the inputs are 
properly arranged depending on the operation mode: forward or inverse. Nonetheless, the 
operations of T2, G2, T3 and G1 are split up into two processors (even and odd) aiming for 
the maximum compatibility. As a result, the arithmetic of the even processor varies 
depending on the operation mode as 
 
 
 
 
 
 
0 0 3 1 2 0 0 4 2 6
2 0 3 1 2 2 0 4 2 6
4 0 3 1 2 4 0 4 2 6
6 0 3 1 2 6
1
( ) 1 1
       
( ) 1
Pr Pr
( ) 1 ( )
OF a a a a ib II II II II
Forward Inverse
OF a a a a ib II II II II
Even Even
OF a a a a ib II II II II
ocessor ocessor
OF a a a a ib II
                                   0 4 2 6 1II II II
    
 (25) 
 
This means that this processor is configurable by means of multiplexers used to modify the 
data path according to the operation mode. In a similar way, the odd processor implements 
the following equations 
www.intechopen.com
 
Recent Advances on Video Coding 
 
320 
 
    
    
    
    
4 5 6 4 4 1 4 7
5 4 7 6 6 3 5 6
6 4 7 5 5 5 6 5
7 5 6 7 7 7 7 4
1 ; 2
1 ; 2
1 ; 2
Pr
1 ; 2
b a a a a OF b b
Forward
b a a a a OF b b
Odd
b a a a a OF b b
ocessor
b a a a a OF b b
                                 
 (26) 
 
    
    
    
    
1 5 3 7 7 1 1 7
3 1 7 3 3 3 3 5
5 7 1 5 5 5 5 3
7 3 5 1 1 7 7 1
1 ; 2
1 ; 2
1 ; 2
Pr
1 ; 2
ia II II II II ib ia ia
Inverse
ia II II II II ib ia ia
Odd
ia II II II II ib ia ia
ocessor
ia II II II II ib ia ia
                                 
 (27) 
 
IF0 /ib0
IF7 /ib7
a0 /OI0
a4 /OI7
IF1 /ib2
IF6 /ib5
a1 /OI1
a5 /OI6
IF2 /ib4
IF5 /ib3
a2 /OI2
a6 /OI5
IF3 /ib6
IF4 /ib1
a3 /OI3
a7 /OI4
b0 /ia0
b2 /ia4
b1 /ia6
b3 /ia2
>>1
R
E
G
R
E
G
R
E
G
R
E
G
a0 /II0
a3 /II4
R
E
G
R
E
G
R
E
G
R
E
G
a1 /II2
a2 /II6
OF0 /ib0
OF4 /ib6
OF2 /ib2
OF6 /ib4
>>1
>>1
>>1
FT/IT
>>1
b4 /ia7
R
E
Ga4 /II1
a5 /II3
a6 /II5
b5 /ia1
R
E
Ga4 /II5
a3 /II7
>>1
a7 /II3
b6 /ia3
R
E
Ga4 /II1
a5 /II3
>>1
a7 /II7
>>1
b7 /ia5
R
E
Ga7 /II5
a5 /II7
a6 /II1
a4 /II1
a5 /II3
a6 /II5
a7 /II7
R
E
G
R
E
G
R
E
G
R
E
G
>>2
>>2
>>2
>>2
b4 /ia3
b7 /ia5
b5 /ia1
b6 /ia7
OF1 /ib3
OF7 /ib5
OF3 /ib1
OF5 /ib7
FT/IT
EVEN PROCESSOR
ODD PROCESSOR
I/O
PROCESSOR
 
Fig. 4. Schematic of the processors shown in Fig. 3. 
The entire circuit to work out the 1D transform takes a total of 32 additions/subtractions 
and 10 right-shifts that are built by means of data-bus wiring (no additional hardware is 
necessary). To prevent overflow in the computing of the transform, we consider the biggest 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
321 
bit-depth of 14 bits for each luminance sample; this means an unsigned integer number from 
0 to 16383. However, this processor operates with the residual luminance whose value is 
±16383, 15 bits being necessary for its representation. If k represents the input bus width, 
then k=15 bits for the first forward 1D transform and k=18 for the second one. The 
intermediate data a0 to 7 must be of k+1 bits, b0 to 3 of k+2, b4 to 7 of k+3, and, finally, the output 
data of k+3. The range of the coefficients is ±16383·8=±131064 (18 bit) for the first 1D 
transform, and ±131064·8=±1048512 (21 bit) for the second one. However, the quantization 
and scaling process increases the data-path by 1 bit, giving input data of 22 bits before 
calculating the inverse 8×8 transform, this bit width being what limits the data-path of the 
whole transform module to prevent overflow. This means that all arithmetic in the forward 
and inverse 1D transform module is performed in 22 bits and the latency is 2 clock cycles. 
4.2 Transpose register array 
The transpose memory stores 8×8 data and allows simultaneous read and write operations 
while doing matrix transposition. To achieve this, the 8 input data are read out of the buffer 
column-wise if the previous intermediate data were written into the buffer row-wise, and 
vice versa. The transpose buffer based on D-type flip-flops (DFF) (Zhang & Meng, 2009) has 
been chosen as it is more suitable for pipeline architectures, unlike other proposed 
architectures based on RAM memories. Indeed, solutions based on a single RAM (Do & Le, 
2010) lead to high latency, while those based on duplication of the RAMs (one for processing 
columns and the other for rows) have a high area cost (Ruiz & Michell, 1998), and those 
based on bank of SRAMs have a high cost in area (Bojnordi et al., 2006) or in alignment 
modules (Li et al., 2008). 
 
out0
out1
out2
out3
out4
out5
out6
out7
inp0
inp1
inp2
inp3
inp4
inp5
inp6
inp7
 
Fig. 5. 8×8 transpose register array. 
Fig. 5 shows the schematic of an 8×8 transpose register array of 22 bits each element whose 
basic cell is a FFD and a multiplexer. Each FFD of the array is interconnected via 2:1 
multiplexers forming 8 shift-registers of length 8 either in the horizontal direction (columns) 
or in the vertical direction (rows). A selection signal controls the direction of shift in the 
www.intechopen.com
 
Recent Advances on Video Coding 
 
322 
registers. The loading and shifting mode in the buffer alternates each time a new block of 
input data is processed: the even (odd) 8×8 block is stored by columns (rows) in the buffer. 
As a result, the transpose buffer has a parallel input/output structure and the data are 
transposed on the fly supporting a continuous data flow with the smallest possible size and 
minimal latency (8 clock cycles). 
4.3 Quantization and rescaling 
H.264 assumes a scalar quantizer avoiding division and/or floating point arithmetic. Most 
of the proposed quantization and rescaling hardware solutions attempt to directly 
implement the expressions defined in the standard, but only a few facilitate its 
implementation. Moreover, all of them work in 8-bit bit-depth and further bits are not 
considered. (Amer et al., 2005) presented a simple forward quantizer FPGA design to be run 
on a Digital Signal Processor. (Wahid et al., 2006) proposed an Algebraic Integer 
Quantization to reduce the complexity of the quantization and rescaling parameters 
required for the H.264. The architecture described by (Bruguera and Osorio, 2006) is based 
on a prediction scheme that allows parallel quantization by detecting zero coefficients to 
facilitate the entropy encoding. In (Chunganet al., 2007), the multiplier and RAM/ROM 
were removed by using a 16 parallel shift-adder scheme. An inverse quantizer based on 6-
stage pipelined dual issue VLIW-SIMD architecture was proposed in (Lee, J.J. et al., 2008). 
(Pastuszak, 2008) presented an architecture in a FPGA capable of processing up to 32 
coefficients per clock cycle. (Lee & Cho, 2008) proposed a scheme to be applied in several 
video compression standards such as JPEG, MPEG-1/2/4, H.264 and VC-1 where only one 
multiplier is used to minimize circuit size. A simplification of the quantization process to 
reduce overhead logic by removing absolute values leads to a decrease of around 20% in 
power consumption (Owaida et al., 2009). Another simplification consists of replacing the 
multiplier with adders and shifters to reduce hardware (Park & Ogunfunmi, 2009). An 
inverse quantization that adopts three kinds of inverse quantizers based on prediction 
modes and coefficients used in a H.264/AVC decoder was presented in (Chao et al., 2009). 
(Husemann et al., 2010) proposed a four forward parallel quantizer architecture 
implemented in a commercial FPGA board.  
We propose a single circuit to compute the forward quantization and rescaling for different 
bit-depth requirements. In both procedures, multiplication, addition and shifting operations 
are involved and a configurable architecture enables the same module to perform all the 
specific operations in order to save hardware. The forward quantization (FQ) operates, cycle 
by cycle, on the coefficients of each column of the forward 8×8 transform (X) and the 
quantized coefficients (Y) are generated according to what is established in equation (14). In 
this equation, the modulus operation is necessary because the arithmetic operation 
“>>qbits” performs an integer division with truncation of the result toward zero which 
causes errors for Xi,j<0. For example, the integer 3 in a 4-bit two’s-complement 
representation is 1101. The operation 3>>2 should be 0, but 1101>>2 gives 1. To resolve 
this error, 1<<n1 must be added to the negative number, where n is the number of right 
shifts. Thus, (1101+1<<21)>>2 is 0. Applying this procedure, the absolute value of 
i, j
X  can 
be eliminated from equation (14) by assigning lev_off the same sign as Xi,j. To do this, a term 
1<<qbits1 must be added when Xi,j <0. Then, equation (14) can be directly implemented as 
follows 
  i,j i,j i,jY = QF X +lev >>qbits  (28) 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
323 
where 
 
,
,
_ ( ) _ , 0
_ ( ) 1 1 _ , 0
i j
i j
lev off lev off for X
lev
lev off qbits lev off for X
         
 (29) 
Therefore, 
i, j
X  and a subsequent sign conversion should not be necessary in equation (28) 
which leads to a more efficient hardware implementation than that directly proposed from 
equation (14). The design to implement equation (28) must be able to manage up to 14-bit 
depth, that is bd=14. In this case, equation (16) shows that QPsc varies from 36 to 87 as QP 
does from 0 to 51, and qbits from 22 to 30 according to equation (15). From equations (17) 
and (29), lev_off(+) for intra mode varies from 1396736 to 357564416, lev_off() for intra 
mode from 2797567 to 716177407, lev_off(+) for inter mode from 700416 to 179306496 and 
lev_off() for inter mode from 3493887 to 894435327. These bounds fix the lev’s bit width to 
30 bits. Table 3 depicts the definition of lev according to the sign of Xi,j and whether intra is 0 
or 1, which can be easily implemented by using basic logic and shift operations. 
 
Binary representation
682<<(5+QPSC/6)
-682<<(5+QPSC/6)+(1<<qbits)-1in
tr
a=
1
342<<(5+QPSC/6)
-342<<(5+QPSC/6)+(1<<qbits)-1in
tr
a
=
0
lev
6+QPSC/6
30
Sign 
extension
X i,j≥0
X i,j<0
X i,j<0
10
X i,j≥0
0 . . . 0 . . . 0 0 1 0 1 0 1 0 1 0 10 . . . 0 . . . 0
0 . . . 0 . . . 0 1 1 0 1 0 1 0 1 0 01 . . . 1 . . . 1
0 . . . 0 . . . 0 0 0 1 0 1 0 1 0 1 10 . . . 0 . . . 0
0 . . . 0 . . . 0 1 0 1 0 1 0 1 0 1 01 . . . 1 . . . 1
 
Table 3. Definition of lev. 
The inverse quantization (IQ) or rescaling specified in (21) can be simplified if this equation 
is rewritten as follows 
     i,j i,j i,j scZ = QI Y << QP /6 +2 >>2  (30) 
Equations (28) and (30) are hardware compatible as they share the same basic arithmetic 
operations. Fig. 6.a shows the block diagram of the quantizer and rescaling module that is 
capable of processing 8 coefficients in parallel. It is composed of a control circuit and an 8-
way data-path based on a configurable arithmetic unit. The control circuit generates the 
intermediate parameters needed for the forward quantization or rescaling mode, all of these 
are obtained from the scaled compression factor (QPsc), the intra value (intra), the operation 
mode (FQ/IQ) and the operation synchronization (init). These parameters are: lev(+) and 
lev(), {kn, ko, kp}, qbits and qpper defined as 
 scqpper=QP /6  (31) 
The three coefficients {kn, ko, kp} represent either the quantization multiplication factors 
kfmQFi,j specified in equations (18), (19) and (20) or the rescaling multiplication factors 
kimQIi,j defined in equations (22), (23) and (24). The indexes {n,o,p} take some of these 
possible values {0, 1, 2}, {1, 3, 4} or {2, 4, 5}. Only three coefficients need to be generated for 
the 8 arithmetic units because each row or column of the matrix QF in (18) or the matrix QI 
www.intechopen.com
 
Recent Advances on Video Coding 
 
324 
in (22) is composed of three different coefficients. All coefficients are read in a look-up table 
depending on the operation mode and the value of QPsc. 
 
22X0,j /Yi,0
QPSC
intra
kn
init
lev(-)
lev(+)
qpper
qbits5
30
30
4
15
14
15
7
1
1
FQ/IQ 1
CONTROL
kp
kq
22 Y0,j /Zi,0
ARITHMETIC UNIT
22X1,j /Yi,1 22 Y1,j /Zi,1
ARITHMETIC UNIT
22X2,j /Yi,2 22 Y2,j /Zi,2
ARITHMETIC UNIT
22X3,j /Yi,3 22 Y3,j /Zi,3
ARITHMETIC UNIT
22X4,j /Yi,4 22 Y4,j /Zi,4
ARITHMETIC UNIT
22X5,j /Yi,5 22 Y5,j /Zi,5
ARITHMETIC UNIT
22X6,j /Yi,6 22 Y6,j /Zi,6
ARITHMETIC UNIT
22X7,j /Yi,7 22 Y7,j /Zi,7
ARITHMETIC UNIT
 
a) 
lev(−)
lev(+)
Xi,j / Yi,j
QFi,j /QIi,j 4-stage pipeline 
multiplier
delay 4
1
0
<<qpper
FQ/IQ
sig(Xi,j )
1
0
FQ/IQ
1
0
2
1 0
>>
2 q
b
its
Yi,j / Zi,j
FQ/IQ
 
b) 
Fig. 6. Configurable forward quantizer and scaling module: a) Block diagram, and b) 
Schematic of the arithmetic unit. 
Fig. 6.b shows a more detailed description of the configurable arithmetic unit. The main 
arithmetic elements are a multiplier and a adder, and multiplexers and additional logic are 
used to configure the implementation of equations (28) and (30). The multiplier has a high 
area cost and delay, so some papers (Michael & Hsu, 2008) (Zhang and et al., 2009) have 
proposed replacing it with a reduced number of shifts and additions by modifying the QF 
factors to be more suitable for hardware optimization. However, they introduce an error 
between the quantization and the inverse quantization which leads to a reduction of the 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
325 
rate-distortion performance. In order to avoid mismatching between encoder and decoder, 
in our approach an implementation of the whole multiplier is selected, with a pipeline 
strategy to increase its speed. After an exhaustive analysis, a Wallace-tree 4-stage pipeline 
multiplier was demonstrated to be the optimal solution to balance the critical path of the 
multiplier with the critical path of the rest of circuit. In the FQ mode, first the inputs Xi,j and 
QFi,j are multiplied. A multiplexer selects the factor lev(+) or lev() to be added to the 
output of the multiplier depending on the sign of Xi,j. Here, a delay of 4 clock cycles in the 
signal of sign(Xi,j) is introduced to compensate for the delay in the multiplier. At the output 
of the adder, a qbit shift-right (>>) operation is performed to obtain the quantized coefficient 
Yi,j. In the IQ mode, the inputs Yi,j and QIi,j are multiplied. A constant 2 is added to the result 
and the last >>2 operation generates the scaled coefficients Zi,j. 
5. ASIC implementation and comparisons 
A prototype of the proposed bit-depth processor has been designed and verified using 
different abstraction levels. Fig. 7 presents the simulation environment used to verify the 
functional behavior of the proposed architecture by comparing the data processed with 
those provided by the JM reference software (Sühring, 2010) for different data blocks of 
input residual luminance. The results of the diverse comparisons performed between the 
simulation and the reference software indicate that there are no differences between them. 
Initially, the processor was designed using the CoWare® Signal Processing Worksystem 
(SPW), editing the block diagram with the elements of the Hardware Design System (HDS) 
library. The first test bench was made by simulating the design with Simulation Program 
Builder-Interpreted (SPB-I). The code description in Verilog-RTL was automatically 
generated by the Verilog RTL Link from the HDS library. A new comparison was performed 
at this abstraction level to guarantee the correct description of the generated code. Finally, 
this Verilog description was synthesized using the Synopsys design compiler under 
HCMOS9 STMicroelectronics 130nm standard cell technology. The resulting circuit contains 
26.5k cells with an area of 625700m2 and the estimated maximum operating frequency is 
330 MHz. After the logic synthesis, the PrimePowerTM tool was applied to estimate the 
power consumption, giving 120mW@330MHz (VDD=1.2V). The data throughput is 2640 
Mpixels per second. This characteristic enables enough processing capacity for 1080HD 
(1920x1088@30fps) real-time video streams. 
With the proposed architecture, each 8×8 block input data is processed with a latency of 44 
clock cycles according to the time scheduling described in Fig. 8. BUSA indicates the output 
of the transform module, BUSB the output of quantization and scaling module, and IN and 
OUT are the input and output of the transpose register (TR); all these signals are depicted in 
Fig. 2. On inputting luma (x), it takes 3 clock cycles to generate the coefficients (p) and the 
output coefficients (X) are obtained from the 13th clock. These coefficients go to the 
quantization module and the “quantized” coefficients (Y), which are generated from the 18th 
clock cycle, are stored in the transpose register. In the rescaling process, the data Y are read 
in transpose order to compute the “rescaled” coefficients Z from the 31st clock cycle. On 
processing these coefficients in the 1D transform module, the intermediate data q are 
obtained in the 34st clock cycle. Finally, the recovered residual luminance (z) is ready to be 
processed from the 44th clock cycle and the next luma block can be input in the 49th clock 
cycle.  
For comparison purposes, Table 4 shows the characteristics and the performances of 
previously published ASIC implementations, although some of them only implement parts 
 
www.intechopen.com
 
Recent Advances on Video Coding 
 
326 
Test bench
JM software
Data flow design of the 
Bit-depth processor
Data
comparison
COWARE (SPW)
Synthesis
Verilog RTL
Standard cells
Test Stimuli
Data 
processed
 
Fig. 7. Block diagram for functional verification of the proposed bit-depth processor. 
 
FQ IQ
Xt
clk
Y
Z
0 10
data in
20 30 40 50
p
5 15 25 35 45 55
x
ztBUSA
Luma in
BUSB
Forward 88 transform
q
Inverse 88 transform
TR
Next
IN
OUT
p
pt
Yt
q
qt
Yt
 
Fig. 8. Time scheduling. 
of the H.264/AVC transform coding process. In (Fan, 2006), a cost effective architecture for 
fast (1-D) 4×4 and 8×8 forward/inverse transform was derived through the Kronecker and 
direct sum operations. The configurable architecture presented in (Li et al., 2008) supports 
the six kinds of 4×4 transforms required in the adaptive block-size transform of H.264 in 
order to more efficiently reuse the data-path; in this architecture, one 8×8 transform can be 
finished within 16 clock cycles. Based on this reusability property, another unified 4×4 and 
8×8 transform architecture is proposed in (Choi at al., 2008). To increase its throughput, 4 
units operate in parallel and only 5 clock cycles are needed to perform an 8×8 transform. 
The low power consumption is because the circuit works at quite low speed (27MHz). A 
pipeline 8×8 2D forward transform architecture is proposed which is capable of consuming 
and producing one sample per clock cycle in (Silva et al., 2007). It uses two 1-D transform 
processors and transpose RAM with a latency of 144 clock cycles. The high-throughput and 
cost-effective implementation of six different integer transforms is proposed in (Hwangbo & 
Kyung, 2010). This implementation maximizes the shared hardware and it is able to process 
64 input pixels in a two-stage pipelined architecture to compute the direct 8×8 transform or 
two 4×4 transforms in parallel. Another flexible architecture is presented in (Chao at al., 
2007), which is suitable for a H.264 high profile decoder capable of processing a macroblock 
in 95 clock cycles with the 8×8 inverse transform or only 54 clock cycles without it. The 
architecture described in (Lee & Cho, 2008) performs the forward 4×4 and 8×8 transform 
 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
327 
Ref. 
Transform FQ 
IQ 
bd 
Techn. 
(µm) 
Area 
(gates) 
Speed 
(MHz) 
Throughput 
(Mpixel/s) 
Power 
Type Size 
(Fan, 2006) 
FWD 
INV 
(1-D) 
4, 8 
no 
no 
8 
TSMC 
0.18 
6.5k 125 1000 
2.5mW 
@62.5MHz 
(Li et al.,  
2008) 
FWD 
INV 
4×4 
8×8 
no 
no 
8 
UMC 
0.18 
13.6k+ 
RAM 
200 800 N/A 
(Choi at al., 
2008) 
FWD 
4×4 
8×8 
no 
no 
8 
AMS 
0.35 
27k 27 346 
9.78mW 
@27MHz 
(Silva et al., 
2007) 
FWD 8×8 
no 
no 
8 
TSMC 
0.35 
33.9k 125 124 N/A 
(Chao at al., 
2007) 
INV 
4×4 
8×8 
no 
no 
8 
TSMC 
0.18 
18.5k 125 860 N/A 
(Huang et al., 
2008) 
FWD 
INV 
4×4 
8×8 
no 
no 
8 
UMC 
0.18 
39.8k 
(NAND2) 
200 400 
38.7mW 
@50MHz 
(Hwangbo & 
Kyung, 2010) 
FWD 
INV 
4×4 no 
no 
8 
UMC 
0.18 
63.6k 200 
3200 86.9mW 
@200MHz 8×8 6400 
(Lee & Cho, 
2008) 
FWD 
4×4 
8×8 
yes 
no 
8 0.18 
36.6k+ 
RAM 
103 412 N/A 
Pastuszak, 
2008) 
FWD 
INV 
4×4 
8×8 
yes 
yes 
8 
0.35 229k 79 2528 
N/A 
0.18 320k 76 2432 
(Bruguera et 
al., 2006) 
FWD 
INV 
4×4 
8×8 
yes 
yes 
8 
AMS 
0.35 
23.8k 67 266 N/A 
(Michell et 
al., 2011) 
FWD 
INV 
8×8 yes 8 
STM 
0.13 
29.3k 330 2640 
147mW 
@330MHz 
Ours 
FWD 
INV 
8×8 
yes 
yes 
8 to 
14 
STM 
0.13 
26.5k 330 2640 
120mW 
@330MHz 
 
Table 4. Comparison with other architectures for ASIC implementation. 
and quantization for unified standard video CODEC (JPEG, MPEG-1/2/4, H.264 and VC-1). 
A high-throughput architecture which integrates forward transform, quantization, scaling, 
inverse transform and the sample reconstruction is presented in (Pastuszak, 2008). It uses 
reconfigurable 4×4 and 8×8 transform architecture and is able to process 32 
samples/coefficients per clock cycle. The 8×8 transform is performed in only 2 clock cycles 
by processing a whole block of 64 input samples through a scheme based on eight 1-D 
transforms operating in parallel. The quantization and rescaling operate on 32 coefficients in 
each clock cycle. Although this architecture has low latency, the cost in area is 10 times more 
than in other proposed designs. In a similar way to (Li et al., 2008), a single data-path for 
implementing 4×4 and 8×8 forward and inverse transform as well as Hadamard transform 
is presented in (Bruguera et al., 2006). However, the quantization and rescaling are 
computed using only one multiplier each and they are performed at the pace demanded by 
the entropy coder. 
In a previous work (Michell et al., 2011), we described a parallel architecture capable of 
processing 8×8 blocks without interruption with a bit-depth fixed to 8 bit. The latency of 38 
clock cycles is achieved by implementing in a pipeline scheme each module used in the 
transform coding. Indeed, the procesor presented here uses a configurable architecture 
based on the reusing of different variable bit-depth modules to reduce hardware and power, 
all of this with a latency of 44 clock clycles. It has been designed attempting to achieve the 
www.intechopen.com
 
Recent Advances on Video Coding 
 
328 
maximum throughput at the highest possible speed. To achieve these goals, the pipeline 
stages have been balanced during the synthesis to maintain the critical path equivalent to 2 
adders as a limit, independently of the technology used. Other challenges were the 
hardware-efficient modifications in the quantization and rescaling module to reduce the 
arithmetic complexity combined with balanced pipelined multipliers, as it is the more 
complex arithmetic component, to attain the high performance parameters. According to the 
results shown in Table 4, our design is the fastest. Its high throughput it is only surpassed 
by that in (Hwangbo & Kyung, 2010), which processes 16 and 32 input samples in 
comparison with 8 in our design, but that scheme has a large area cost despite the fact that it 
only implements the direct transform without quantization and rescaling. The design 
proposed in (Bruguera et al., 2006) has fewer gates than ours but the quite low speed 
(67MHz) reduces the throughput to 266Mpixels/s. By observing the differences in the speed 
and throughput achieved by our processor, we can conclude that these differences cannot 
only be attributed to the technology used, but are a consequence of the hardware 
modifications introduced in our design.  
6. Conclusions 
In July 2004, a new amendment called Fidelity Range Extensions (FRExt) was added to the 
H.264/AVC as a standardization initiative motivated by the rapidly growing demands 
focusing on professional applications and high-definition videos. Improvements present in 
FRExt include a new 8x8 integer transform, the variety of chroma sub-sampling formats and 
a greater colour bit-depth ranging from 8-bit up to 14-bit. Increasing bit depth provides 
improved accuracy in the coding efficiency with a reduction of noise and artifacts. Indeed, 
bit-depth scalability is potentially useful as, in a foreseeable future where different bit-
depths will simultaneously coexist in the market, it provides multiple representations of 
different bit-depths for the same visual content. 
This chapter presents a variable bit-depth processor with pipeline architecture for real-time 
implementation of the complete process for the 8×8 transform and quantization coding in 
the H.264/AVC. This architecture has been conceived with the aim of achieving a high 
operation frequency and high throughput without increasing the hardware complexity. 
Initially, the mathematical expressions of the 8×8 transform and quantization used in the 
standard H.264/AVC are presented to facilitate the readers’ understanding of this matter. A 
review of the state-of-the-art of the previous implementations and references is also included; 
here, special emphasis is given to describing the effect of the bit-depth in quantization and 
rescaling formulas. However, most hardware implementations only operate in 8 bits and 
further bit-depths have not been taken into account. In order to achieve an efficient 
implementation of the processor, hardware solutions have been developed for the different 
circuit modules. A configurable forward and inverse 1D processor and a transpose register 
array enable an efficient hardware computation of the 8x8 transform. Forward quantization 
and rescaling operations are computed in the same circuit for different bit-depth 
requirements and new expressions are included enabling efficient hardware implementation 
by minimizing the arithmetic operations involved. Finally, the critical paths of the distinct 
computing units have been carefully analyzed and balanced using a pipeline scheme in 
order to maximize the operation frequency without introducing an excessive latency. A 
prototype with the proposed architecture has been synthesized in a 130nm HCMOS 
technology process which achieves a maximum speed of 330 MHz. The throughput of 2640 
Mpixels/s allows real-time video streams of 1080HD (1920×1088@30fps) to be processed. 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
329 
7. Acknowledgment 
We wish to acknowledge the financial help of the Spanish Ministry of Education and Science 
through TEC2006-12438/TCM received to support this work. 
8. References 
Amer, W.; Badawy, G. & Jullien, G. (2005). A high-performance hardware implementation 
of the H.264 simplified 8×8 transformation and quantization. IEEE International 
Conference on Acoustics, Speech, and Signal Processing, Vol.2, pp. II-1137 - II-1140, 
(March 2005), doi: 10.1109/ICASSP.2005.1415610, ISBN: 0-7803-8874-7 
Bojnordi, M.N.; Sedaghati-Mokhtari, N.; Fatemi, O. & Hashemi, M.R. (2006). An efficient 
self-transposing memory structure for 32-bit video processors. IEEE Asia Pacific 
Conference on Circuits and Systems (APCCAS), pp. 1438–1441, doi: 
10.1109/APCCAS.2006.342472, ISBN: 1-4244-0387-1 
Bruguera, J.D. & Osorio, R.R. (2006). A unified architecture for H.264 multiple block-size 
DCT with fast and low cost quantization. Proceedings of the 9th EUROMICRO 
Conference on Digital System Design, pp. 407-414, (October 2006), doi: 
10.1109/DSD.2006.18, ISBN: 0-7695-2609-8 
Chao, T.C.; Tsai, H.H.; Lin, Y.H., Yang, J.F. & Liu, B.D. (2007). A novel design for computing 
of all transforms in H.264/AVC decoders. IEEE International Conference on 
Multimedia and Expo, pp. 1914-1917, (July 2007), doi: 10.1109/ICME.2007.4285050, 
ISBN: 1-4244-1016-9 
Chao, Y.C.; Wei, S.T.; Liu, B.D. & J.F. Yang, J.F. (2009). Combined CAVLC decoder, inverse 
quantizer, and transform kernel in compact H.264/AVC decoder. IEEE 
Transactions on Circuits and Systems for Video Technology, Vol.19, No.1, pp. 53-
62, (January 2009), doi: 10.1109/TCSVT.2008.2009251, ISSN: 1051-8215 
Cheng, C.H.; Au, O.C.; Liu, C.H. & Yip, K.Y. (2009). IEEE International Symposium on Circuits 
and Systems (ISCAS 2009), pp. 944-947, doi: 10.1109/ISCAS.2009.5117913, ISBN: 978-
1-4244-3827-3 
Chiang, J.C. & Kuo, W. T. (2009). Bit-depth scalable video coding using inter-layer 
prediction from high bit-depth layer. IEEE International Conference on Acoustics, 
Speech and Signal Processing (ICASSP 2009), pp. 649-652, doi: 
10.1109/ICASSP.2009.4959667, ISBN: 978-1-4244-2353-8 
Choi, W.; Park, J. & Lee, S. (2008). A high-performance & low-power unified 4×4 / 8×8 
transform architecture for the H.264/AVC Codec. 23rd International Conference 
Image and Vision Computing, pp. 1-6, (November 2008), doi: 
10.1109/IVCNZ.2008.4762099, ISBN: 9781424437801 
Chujoh, T. & Noda, R. (2007a). Internal bit depth increase for coding efficiency. Joint Video 
Team, Doc. VCEG-AE13.doc. Available from 
 http://wftp3.itu.int/av-arch/video-site/0701_Mar/VCEG-AE13.zip 
Chujoh, T. & Noda, R. (2007b). Internal bit depth increase except frame memory. Joint Video 
Team, Doc. VCEG-AF07.doc. Available from 
 http://wftp3.itu.int/av-arch/video-site/0704_San/VCEG-AF07.zip 
Chungan, P.; Dunshan, Y.; Xixin, C. & Shimin, S. (2007). A new high throughput VLSI 
architecture for H.264 transform and quantization. 7th International Conference on 
ASIC (ASICON ’07), pp.950-953, (October 2007), doi: 10.1109/ICASIC.2007.4415789, 
ISBN: 978-1-4244-1132-0 
www.intechopen.com
 
Recent Advances on Video Coding 
 
330 
Do, T.T.T. & Le, T.M. (2010). High throughput area-efficient SoC-based forward/inverse 
integer transforms for H.264/AVC. IEEE International Symposium on Circuits and 
Systems (ISCAS), pp. 4113–4116, doi: 10.1109/ISCAS.2010.5537614, ISBN: 978-1-
4244-5308-5 
Fan, C.P. (2006). Cost-effective hardware sharing architectures of fast 8×8 and 4×4 integer 
transforms for H.264/AVC. IEEE Asia Pacific Conference on Circuits and Systems 
(APCCAS), pp. 776–779, (December 2006), doi: 10.1109/APCCAS.2006.342136, 
ISBN: 1-4244-0387-1 
Finchelstein, D.F.; Sze, V. & Chandrakasan, A.P. (2009). Multicore Processing and Efficient 
On-Chip Caching for H.264 and Future Video Decoders. IEEE Transactions on 
Circuits and Systems for Video Technology, Vol.19, No. 11, pp. 1704-1713, doi: 
10.1109/TCSVT.2009.2031459, ISSN: 1051-8215 
Gao, Y. & Wu, Y. (2006). Applications and requirements for color bit depth scalability. Joint 
Video Team, Doc. JVT-U049.doc. Available from 
 http://wftp3.itu.int/av-arch/jvt-site/2006_10_Hangzhou/JVT-U049.zip 
Gao, Y.; Wu, Y. & Chen, Y. (2009). H.264/Advanced Video Coding (AVC) backward-
compatible bit-depth scalable coding. IEEE Transactions on Circuits and Systems for 
Video Technology, Vol.19, No.4, (April 2009), pp. 500-510, doi: 
10.1109/TCSVT.2009.2014018, ISSN: 1051-8215 
Gish, W. (2002). 10-bit and 12-bit sample depth. Joint Video Team, Doc. JVT-E048r2.doc. 
Available from 
 http://wftp3.itu.int/av-arch/jvt-site/2002_10_Geneva/JVT-E048r2.doc 
Gish, W. (2003). Extended sample depth: Implementation and characterization. Joint Video 
Team, Doc. JVT-H0.16.doc. Available from 
 http://wftp3.itu.int/av-arch/jvt-site/2003_05_Geneva/JVT-H016.doc 
Gordon, S.; Marpe, D. & Wiegand, T. (2004). Simplified use of 8×8 transforms. Joint Video 
Team, Doc. JVT-K028.doc. Available from 
 http://wftp3.itu.int/av-arch/jvt-site/2004_03_Munich/JVT-K028.doc 
JVT Joint Video Team of ITU-T and ISO/IEC (2004). Draft text of H.264/AVC fidelity range 
extensions amendment. Joint Video Team, Doc. JVT-L047d9wcm.doc. Available from 
 http://wftp3.itu.int/av-arch/jvt-site/2004_07_Redmond/JVT-L047d9wcm.zip 
Huang, C.Y.; Chen, L.F. & Lai, Y.K. (2008). A high-speed 2-D transform architecture with 
unique kernel for multi-standard video applications. IEEE International Symposium 
on Circuits and Systems, pp. 21-24, (May 2008), doi: 10.1109/ISCAS.2008.4541344, 
ISBN: 978-2-84813-1 
Husemann, R.; Majolo, M.; Guimaraes, V.; Susin, A.; Roesler, V. & Lima, J.V. (2010). 
Hardware integrated quantization solution for improvement of computational 
H.264 encoder module. IEEE/IFIP VLSI System on Chip Conference (VLSI-SoC), pp. 
316-321, doi: 10.1109/VLSISOC.2010.5642680, ISBN: 978-1-4244-6469-2 
Hwangbo, W. & Kyung, C.M. (2010). A multitransform architecture for H.264/AVC high-
profile coders. IEEE Transactions on Multimedia, Vol.12, No.3, pp. 157-167, (April 
2010), doi: 10.1109/TMM.2010.2041099, ISSN: 1520-9210 
Ito, T.; Bandoh, Y.; Seishi, T. & Jozawa, H. (2010). A coding method for high bit-depth 
images based on optimized bit-depth transform. IEEE International Conference on 
Image Processing (ICIP), pp. 3141-3144, doi: 10.1109/ICIP.2010.5653459, ISBN: 978-1-
4244-7994-8 
www.intechopen.com
 
Variable Bit-Depth Processor for 8x8 Transform and Quantization Coding in H.264/AVC   
 
331 
Lee, J.J.; Park, S. & Eum, N.W. (2008). Design of application specific processor for H.264 
inverse transform and quantization. International SoC Design Conference (ISOCC '08), 
pp. II-57 - II-60, (November 2008), doi: 10.1109/SOCDC.2008.4815683, ISBN: 978-1-
4244-2598-3 
Lavier, P. (2009). Using 10-bit AVC/H.264 encoding with 4:2:2 for broadcast contribution. 
Ateme company. Confidential report. Available from 
 http://extranet.ateme.com/download.php?file=1114 
Lee, S. & Cho, K. (2008). Design of high-performance transform and quantization circuit for 
unified video CODEC. IEEE Asia Pacific Conference on Circuits and Systems, pp. 1450-
1453, (November 2008), doi: 10.1109/APCCAS.2008.4746304, ISBN: 0230019544 
Lee, Y.; Hong, K. & Kim, S. (2010). An adaptive image bit-depth scaling method for image 
displays. IEEE Transactions on Consumer Electronics, Vol.56, No.1, (March 2010), pp. 
141-146, doi: 10.1109/ICCE.2010.5418895, ISSN: 0098-3063 
Li, Y.; He, Y. & Mei, S. (2008). A highly parallel joint VLSI architecture for transforms in 
H.264/AVC. Journal of Signal Processing Systems, Vol.50, No.1, (January 2008), pp. 
19–32, doi: 10.1007/s11265-007-0111-4, ISSN: 1939-8115 
Lin, Y.K.; Li, D.W.; Lin, C.C.; Kuo, T.Y.; Wu, S.J.; Tai, W.C.; Chang, W.C. and Chang, T.S. 
(2008). A 242mW 10mm2 1080p H.264/AVC High-Profile Encoder Chip. IEEE 
International Solid-State Circuits Conference (ISSCC 2008), pp. 314-316, doi: 
10.1109/ISSCC.2008.4523183, ISBN: 978-1-4244-2010-0 
(Links, 2010). Interesting webpage including links to further resources on H.264 and video 
compression. Available from http://www.vcodex.com/links.html 
Liu, Z.; Song, Y.; Shao, M.; Li, S.; Li, L.; Ishiwata, S.; Nakagawa, M.; Goto, S. & Ikenaga, T. 
(2009). HDTV 1080p H.264/AVC encoder chip design and performance analysis. 
IEEE Journal of Solid-State Circuits, Vol.44, No.2, pp. 594-608, (February 2009), doi: 
10.1109/JSSC.2008.2010797, ISSN: 0018-9200 
Ma, Y.; Song, Y.; Ikenaga, T. & Goto, S. (2007). A high throughput multiple transform 
architecture for H.264/AVC fidelity range extensions. Journal of Semiconductor 
Technology and Science, Vol.7, No.4, pp. 247-253, (December 2007), ISSN: 1598-1657 
Malvar, H.S.; Hallapuro, A.; Karczewicz, M. & Kerofsky, L. (2003). Low-complexity 
transform and quantization in H.264/AVC. IEEE Transactions on Circuits and 
Systems for Video Technology, Vol.13, No.7, (July 2003), pp. 598-603, doi: 
10.1109/TCSVT.2003.814964, ISSN: 1051-8215 
Marpe, D.; Wiegand, T. & Gordon, S. (2005). H.264/MPEG4-AVC fidelity range extensions: 
Tools, profiles, performance, and application areas. IEEE Int. Conf. Image Processing, 
pp. 593-596, (Sept. 2005), doi: 10.1109/ICIP.2005.1529820, ISBN: 0-7803-9134-9 
Michael, M.N. & Hsu, K.W. (2008). A low-power design of quantization for H.264 video 
coding standard. IEEE International SOC Conference, pp. 201-204, (September 
2008), doi: 10.1109/SOCC.2008.4641511, ISBN: 978-1-4244-2596-9 
Michell, J.M.; J.M. Solana, J.M. & Ruiz, G.A. (2011). A high-throughput ASIC processor for 
8×8 transform coding in H.264/AVC. Signal Processing: Image Communication, (in 
press), doi: 10.1016/j.image.2011.01.001, ISSN: 0923-5965 
Ngo, N.T., Do. T.T.T., Le, T.M., Kadam, Y.S. & Bermak, A. (2008). ASIP-controlled inverse 
integer transform for H.264/AVC compression. IEEE/IFIP International Symposium 
on Rapid System Prototyping, pp. 158-164, (June 2008), doi: 10.1109/RSP.2008.34, 
ISBN: 978-0-7695-3180-9 
www.intechopen.com
 
Recent Advances on Video Coding 
 
332 
Owaida, M.; Koziri, M.; Katsavounidis, I. & Stamoulis, G. (2009) A high performance and 
low power hardware architecture for the transform & quantization stages in H.264. 
IEEE International Conference on Multimedia and Expo (ICME 2009), pp. 1102-1105, 
doi: 10.1109/ICME.2009.5202691, ISBN 978-1-4244-4291-1 
Park, J.S. & Ogunfunmi, T. (2009). A new hardware implementation of the H.264 8×8 
transform and quantization. IEEE International Conference on Acoustics, Speech and 
Signal Processing (ICASSP), pp. 585-588, doi: 10.1109/ICASSP.2009.4959651, ISBN: 
978-1-4244-2354-5, ISSN: 1520-6149 
Pastuszak, G. (2008). Transforms and quantization in the high-throughput H.264/AVC 
encoder based on advanced mode selection. IEEE Computer Society Annual 
Symposium on VLSI, pp. 203-208, (April 2008), doi: 10.1109/ISVLSI.2008.13, ISBN 0-
7695-2533-4 
Richardson, I.E.G. (2004). H.264 and MPEG-4 Video Compression. John Wiley & Sons (Ed), 
ISBN: 0-470-84837-5 
Ruiz, G.A. & Michell, J.A. (1998). Memory Efficient Programmable Processor Chip for 
Inverse Haar Transform. IEEE Transactions on Signal Processing, Vol.46, No.1, 
(January 1998), pp 263–268, doi: 10.1109/78.651233, ISSN: 1053-587X 
Silva, T.L.; Diniz, C.M.; Vortmann, J.A.; Agostini, L.V.; Susin, A.A. & Bampi, S. (2007). A 
pipelined 8×8 2-D forward DCT hardware architecture for H.264/AVC high profile 
encoder. Proceedings of the 2nd Pacific Conference on Advances in Image and Video 
Technology, pp. 5-15, doi: 10.1007/978-3-540-77129-6_5, ISBN: 3-540-77128-X 978-3-
540-77128-9 
Sims, S.R.F; Mills, J.A. & Topiwala, P.N. (2005). Evaluation of video compression for 8-bit 
and 12-bit IR data with H.264 fidelity range extensions. Proc. SPIE the International 
Society for Optical Engineering, Vol.5807, pp. 329-340, doi: 10.1117/12.603853, ISBN: 
9780819457929 
Sühring, K. (2010). H.264/AVC Software Coordination. Fraunhofer Institute for 
Telecommunications, Heinrich Hertz Institute, Image Processing Research 
Department, Berlin, Germany. Available from http://iphome.hhi.de/suehring/tml 
Wahid, K.; Dimitrov, V. & Jullien, G. (2006). New Encoding of 8×8 DCT to make H.264 
lossless. IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pp. 780-783, 
doi: 10.1109/APCCAS.2006.342137, ISBN: 0470847549 
Wiegand, T.; Sullivan, G.J.; Bjontegaard, G. & Luthra, A. (2003). Overview of the 
H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for 
Video Technology, Vol.13, No.7, (July 2003), pp. 560-576, doi: 
10.1109/ICIP.2005.1529820, ISSN: 1051-8215 
Zhang, Q. & Meng, N. (2009). A low area pipelined 2-D DCT architecture for JPEG encoder. 
IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), (August 
2009), pp. 747-750, doi: 10.1109/MWSCAS.2009.5235989, ISSN: 1548-3746 
Zhang, Y.; Jiang, G. & Yu, M. (2009). Low-complexity quantization for H.264/AVC. Journal 
of Real-Time Image Processing, Vol.4, No.1, pp. 3-12, doi: 10.1007/s11554-008-0098-5, 
doi: 10.1007/s11554-008-0098-5, ISSN: 1861-8200 
www.intechopen.com
Recent Advances on Video Coding
Edited by Dr. Javier Del Ser Lorente
ISBN 978-953-307-181-7
Hard cover, 398 pages
Publisher InTech
Published online 24, June, 2011
Published in print edition June, 2011
InTech Europe
University Campus STeP Ri 
Slavka Krautzeka 83/A 
51000 Rijeka, Croatia 
Phone: +385 (51) 770 447 
Fax: +385 (51) 686 166
www.intechopen.com
InTech China
Unit 405, Office Block, Hotel Equatorial Shanghai 
No.65, Yan An Road (West), Shanghai, 200040, China 
Phone: +86-21-62489820 
Fax: +86-21-62489821
This book is intended to attract the attention of practitioners and researchers from industry and academia
interested in challenging paradigms of multimedia video coding, with an emphasis on recent technical
developments, cross-disciplinary tools and implementations. Given its instructional purpose, the book also
overviews recently published video coding standards such as H.264/AVC and SVC from a simulational
standpoint. Novel rate control schemes and cross-disciplinary tools for the optimization of diverse aspects
related to video coding are also addressed in detail, along with implementation architectures specially tailored
for video processing and encoding. The book concludes by exposing new advances in semantic video coding.
In summary: this book serves as a technically sounding start point for early-stage researchers and developers
willing to join leading-edge research on video coding, processing and multimedia transmission.
How to reference
In order to correctly reference this scholarly work, feel free to copy and paste the following:
Gustavo A. Ruiz and Juan A. Michell (2011). Variable Bit-Depth Processor for 8×8 Transform and Quantization
Coding in H.264/AVC, Recent Advances on Video Coding, Dr. Javier Del Ser Lorente (Ed.), ISBN: 978-953-
307-181-7, InTech, Available from: http://www.intechopen.com/books/recent-advances-on-video-
coding/variable-bit-depth-processor-for-8-8-transform-and-quantization-coding-in-h-264-avc
© 2011 The Author(s). Licensee IntechOpen. This chapter is distributed
under the terms of the Creative Commons Attribution-NonCommercial-
ShareAlike-3.0 License, which permits use, distribution and reproduction for
non-commercial purposes, provided the original is properly cited and
derivative works building on this content are distributed under the same
license.
