An efficient hardware implementation for deblocking filter of AVS decoder  by You-wen, Huang
Procedia Environmental Sciences 11 (2011) 505 – 510
doi:10.1016/j.proenv.2011.12.080
Available online at www.sciencedirect.com
 
Available online at ww .sciencedirect.com
 
Procedia 
Environmental 
Sciences Procedia Environmental Sciences  00 (2011) 000–000 
www.elsevier.com/locate/procedia
 
An efficient hardware implementation for deblocking filter of 
AVS decoder  
Huang You-wen 
College of Information and Engineering  
Jiangxi University of Science and Technology,Ganzhou 341000, China 
 jasonhuang1982@gmail.com 
 
 
Abstract 
In this paper, an optimized hardware implementation for deblocking filter of AVS decoder was proposed according 
to the arithmetic in AVS audio video coding standard. In order to reduce the bandwidth requirement of SDRAM, 
internal RAM is used to cache reference data for deblocking filter in the hardware design. Meanwhile, the storage 
structure of reference data is reasonably arranged for accelerating filtering process. Based on the proposed data 
structure, both the SDRAM data writing operation and the macroblock filtering operation can be done at the same 
time. The design is implemented on an Altera Stratix II FPGA, the synthesis and simulation results indicated that 
the hardware cost was low, and the design can meet the demand of the real-time video decoding. 
 
 
Keywords:AVS; Hardware Implementation; Deblocking Filter; Video Decoding 
1.Introduction 
AVS is China’s second generation audio and video decoding standard [1]. In AVS standard, complex 
coding arithmetic is used, and the coding efficiency is equivalent to H.264 [2]. 
Block-based coding framework and lossy compression arithmetic are mostly used in the current video 
coding standard, including AVS. As a result, block distortion is introduced to the decoded pictures. To 
reduce block distortion, a filter which is called deblocking filter is applied to each decoded macroblock.  
As we know, the bandwidth of SDRAM and operation timing of deblocking filter are limited in 
hardware design. In this paper, an optimized hardware implementation for deblocking filter of AVS 
decoder was proposed. Internal RAM was used to cache reference data to reduce the bandwidth 
requirement of SDRAM. Meanwhile, a storage structure of reference data was proposed to make the 
outputting operation and filtering operation work at the same time. As a result, filtering process was 
significantly accelerated.  
1878-0296 © 2011 Published by Elsevier Ltd.
 Selection and/or peer-review under responsibility of the Intelligent Information Technology Application  Research Association.
Open access under CC BY-NC-ND license.
 
 
© 2011 Published by Elsevier Ltd.
 Selection and/or peer-review under responsibility of the Intelligent Information Technology Application
 Research  Association.
Open access under CC BY-NC-ND license.
506  Huang You-wen / Procedia Environmental Sciences 11 (2011) 505 – 510[键入文字] 
 
2.Deblocking filter algorithm overview 
A block's vertical and horizontal edges to be processed by deblocking filter are shown in Fig. 1. p2, p1, 
p0 are three pixels of one block and q0, q1, q2 are three pixels of the neighboring block. The six pixels 
are in the same row or in the same column of a picture. 
p2 p1 p0 q0 q1 q2
p2
p1
p0
q0
q1
q2
Vertical Edge
Horizontal 
Edge
 
Figure 1. Horizontal and vertical edge 
In AVS standard, there are several conditions that determine whether an 8×8 block edge will be 
filtered or not. Different algorithm is defined in AVS standard, the one used in calculation depends on the 
strength of filtering operation. And there are three levels of deblocking strength defined in AVS coding 
standard. The algorithms for luma block are shown as follows: 
y Level 0: The values of six pixels of the neighboring blocks which are shown in Fig. 1 do not change. 
The edge will not be filtered. 
y Level 1: Define ap=Abs(p2-p0), aq=Abs(q2-q0).  
If ap<β && Abs(p0-q0)<((α>>2)+2), change the values of p0 and p1 according to (1) and (2). (α and β 
are thresholds of block boundary) 
P 0  =  ( p 1  +  2 *  p 0  +  q 0  +  2 )  > >  2                   ( 1 ) 
P 1  =  ( 2 * p 1  +  p 0  +  q 0  +  2 )  > >  2                   ( 2 ) 
P0 and P1 are the new filtered values. 
If aq<β && Abs(p0–q0)<((α>>2)+2), change the values of q0 and q1 according to (3) and (4).  
Q 0  =  ( q 1  +  2 * q 0  +  p 0  +  2 )  > >  2                   ( 3 ) 
Q 1  =  ( 2 * q 1  +  q 0  +  p 0  +  2 )  > >  2                   ( 4 ) 
Q0 and Q1 are the new filtered values. 
If ap>=β || Abs (p0-q0)>=((α>>2)+2), keep the value of p1 and change the value of p0 according to 
(5). 
P 0  =  ( 2 * p 1  +  p 0  +  q 0  +  2 )  > >  2                  ( 5 ) 
If aq>=β || Abs(p0–q0)>=((α>> 2)+2), keep the value of q1 and change the value of q0 according to 
(6). 
Q 0  =  ( 2 * q 1  +  q 0  +  p 0  +  2 )  > >  2                   ( 6 ) 
(3) Level 2: change the values of p0 and q0 according to (7) and (8). 
P 0  =  C l i p 1 ( p 0 +  d e l t a )                           ( 7 ) 
Q 0  =  C l i p 1 ( q 0  -  d e l t a )                          ( 8 ) 
delta is defined by (9) 
d e l t a  =  C l i p 3 ( – C , C , ( ( ( q 0  – p 0 ) * 3  +  ( p 1 – q 1 ) + 4 ) > > 3 ) )  ( 9 ) 
If ap<β, change the value of p1 and q1 according to (10 )and (11). 
P1=Clip1(p1+Clip3(–C,C,(((P0–p1)*3+(p2–Q0)+4) 
> > 3 ) ) )                                      ( 1 0 ) 
Q 1 = C l i p 1 ( q 1 + C l i p 3 ( – C , C , ( ( ( q 1 – Q 0 ) × 3 + ( P 0 – q 2 ) + 4 ) 
> > 3 ) ) )                                      ( 1 1 ) 
The algorithms for chroma block is similar to algorithms for luma block except that p1 and q1 will not 
change. 
507Huang You-wen / Procedia Environmental Sciences 11 (2011) 505 – 510[键入文字] 
 
3.Proposed hardware architecture 
Deblocking filter greatly improves image quality, while also increases the complexity of the decoder. 
A lot of hardware implementation for H.264 standard has been proposed by scholars [3,4], but because of 
difference between algorithm, it can not be directly used for AVS. 
The deblocking filter not only needs the reconstruction data of the current decoding macroblock, but 
also needs the data of top and left neighboring macroblock. In accordance with the data access methods, 
there are two hardware design schemes:  
The first option is to store reconstruction data in external SDRAM as shown in Fig. 2, deblocking filter 
fetches data directly from the SDRAM to process and stores the filtered data back to the SDRAM. 
Convenient data accessing and simple address decoding are the advantages of this approach, but it needs 
frequent reading and writing on the SDRAM. For SD and HD video stream, the bandwidth on the external 
SDRAM is hard to satisfy the high requirements. 
The second option is shown in Fig. 3. It caches the reconstruction data on an on-chip memory [5]. 
The filter uses the data of upper and left neighboring macroblock and the reconstruction of current 
decoding macroblock stored in on-chip ram to complete the filtering operation.  
The second approach requires additional on-chip cache RAM, the requiring hardware resources is 
larger compared to the first approach, but the amount of SDRAM access is significantly reduced. In 
dealing with SD or HD video stream, SDRAM bandwidth is tight, and thus in this paper, the latter option 
is selected in hardware implementation. 
       
Figure 2. Data storing scheme 1       Figure 3. Data storing scheme 2 
Imagine N is the current macroblock needs to be filtered, the position of data for filtering is shown in 
Fig.4. 
n5
Chroma  Cr
e0
e1
e2
e3
e4 e6
e5 e7
n0
n2
n1
n3
Luma
m1
2Ln 3 Ln 5Ln
m0
m2 m3
Macroblock M Macroblock N
m5
e11
e9
Chroma  Cb
n4
4 Ln
Macro
block 
M
Macro 
block      
N
m4 e8
e10
2 Lm 3 Lm 4 Lm 5 Lm
Macro
block 
M
Macro 
block      
N
  
Figure 4. Data position for filtering 
n0~n3 are four luma blocks of one macroblock, n4 is the chroma Cb block, n5 is the chroma Cr block. 
macroblock M is the left-adjacent macroblock, m0~m5 is its sub-blocks. 2L and 3L include the last three 
lines of luma data from sub-block 2 and 3 which are of the top-adjacent macroblock. 4L and 5L include 
the last three lines of chroma data from sub-block 4 and 5 which are of the top-adjacent macroblock. 
There are 8 edges between luma blocks require to be filtered. e0~e3 are the four vertical edges, e4~e7 are 
508  Huang You-wen / Procedia Environmental Sciences 11 (2011) 505 – 510[键入文字] 
 
the four horizontal edges. Also there are two edges in each chroma block require to be filtered. e8 and e9 
are vertical filtering edges, e10 and e11 are horizontal filtering edges. (Due to the different filtering 
strength, every chroma boundary is divided into two parts, each contains four pixels.) 
To accelerate the filtering process, the data described above is stored in RAM1 which is an on-chip 
ram. The storing data structure proposed in this paper is shown in Fig.5. 
 
Figure 5. Data storage for RAM1 
Imagine N is the current macroblock to be filtered. It needs to use block 2L, 3L, 4L and 5L from the 
above macroblock. As a result, after the filtering of the upper macroblock completes, corresponding data 
must be cached in on-chip RAM. m1, m3, m4 and m5 are block data from the left-side macroblock. The 
four blocks are used in filtering, so they also need to be cached. Meanwhile, in order to improve the 
efficiency of writing operation for SDRAM , m0 and m2 from the left macroblock are also cached. In 
addition, the values of n1 and n3 which are generated in filtering will change during the processing of the 
next macroblock. Therefore it also needs to be cached in the RAM. 
the on-chip ram is 64-bits wide, each storage unit saves the data of 8 pixels. As 2L, 3L, 4L and 5L 
contain three lines of data, 12 units is required for each maroblock. For standard-definition video whose 
resolution is 720 × 576, every line of a macroblock consists of 45 macro blocks. As a result, a total of 540 
storage units are needed. In addition, 64 units are needed to cache n1, n3, m1 and m3. To save the time 
required for data movement, this paper uses ping-pong structure to store m1 and n1, m3 and n3. 
In order to cache the current calculated macroblock data, RAM2 is used for caching intermediate 
results, storage structure for RAM2 is shown in Fig. 6. RAM2 is a dual-port RAM with 64-bits wide data 
bus. The read and write operations can be done simultaneously to get faster data processing speed. In 
dealing with the luma blocks, RAM2 stores intermediate results for n0 and n2; when dealing with chroma, 
RAM2 stores intermediate results for n4 and n5. 
 
Figure 6. Data storage for RAM2 
The hardware architecture of AVS deblocking filter designed in this paper is shown in Fig. 7.  
P_transpose and Q_transpose are two modules used for data transposition. The module consists of 3*8 
units. Each unit is an 8-bits register for storing luma data or chroma data of a sample point. On vertical 
edge filtering, data of 8 pixels moves from top to bottom into the transpose unit. Three of them participate 
in linear filtering. On horizontal edge filtering, 8 sample values of 3 lines move from top to bottom at first, 
and then change to horizontal movement. It gets three sample values out of the transpose module to 
participate in linear filtering each time. The results move back into the transpose module circularly. After 
8 filtering operation complete, the final data movement changes to the vertical direction, the results will 
509Huang You-wen / Procedia Environmental Sciences 11 (2011) 505 – 510[键入文字] 
 
be moved from top to bottom and write to the on-chip RAM.  
 
Figure 7.  Hardware Architecture 
During the filtering process, block m1, m3, n0, n2, n1, 2Ln and 3Ln are the data source of P_transpose, 
block n0, n2, n1 and n3 are the data source of Q_transpose. So P transpose module has 2 input interfaces 
for RAM1 and RAM2, Q transpose module has three input interfaces for RAM1, RAM2 and rebuilding 
RAM. As a result, mux1 and mux2 in Fig. 7 are used for data selection.  
After the completion of Filtering operation, mux5 and mux6 make the output data write to a choice of 
RAM1 or RAM2. In the data updating period, data in RAM2 needs to dump into RAM1. In order to 
reduce the amount of the data selectors, data entered into the transpose module by mux1 at first, and then 
output to ram1. The cost of this approach is the extra three clock cycles for writing operation.  
4.Simulation and synthesis results 
In this paper, hardware implementation for AVS deblocking filter was designed and simulated in 
Modelsim. Data generated from test vectors had been compared with those generated from reference 
software rm52j. The results showed that the design realized the algorithm described in AVS standard.  
The design was synthesized and implemented in EP2S60F672C5ES device which was an Altera FPGA, 
the clock frequency was limited to 100MHz. The hardware resources used in design are shown in Table I. 
TABLE I. SYNTHESIS RESULT 
ALUT Registers Memory 
2057 959 39680 bits 
5.Conclusions 
This paper uses the on-chip RAM to cache intermediate data of AVS deblocking filter. It reduces the 
bandwidth requirement of SDRAM. Compared to Zhong-Hua Huang's design [6], the latter uses 8*8*8 
bits of registers to complete a block of data's transposition, all of the 8*8 blocks in a macroblock would 
be conducted 2 times of transposition. In this paper, two 3*8*8 bits of registers have been used, only 
those data involved in filtering get in the transpose array. Meanwhile, filtering operation of two blocks 
can be done in parallel. As a result, the design not only saves registers, but also speeds up the processing. 
The latter design needs post-processing for three different data regions to output filtering data. Due to the 
data storing scheme described above, the design in this paper make filtering operation and data output 
operation perform concurrently, it can complete a macroblock’s filtering in 236 clock cycles which meets 
real-time decoding of the large resolution video stream. 
510  Huang You-wen / Procedia Environmental Sciences 11 (2011) 505 – 510[键入文字] 
 
Reference 
[1] GB/T 20090.2-2006．“Information technology, advanced voice and video coding, part II: video”. 
[2] Rao K.R., Do Nyeon Kim, “Current video coding standards: H.264/AVC, Dirac, AVS China and VC-1,” IEEE 
Southeastern Symposium on System Theory, 2010, pp. 1-8.  
[3] Loukil H., Ben Atitallah A., Masmoudi N., "Hardware architecture for H.264/AVC deblocking filter algorithm," 
Proceedings of the International Multi-Conference on Systems, Signals and Devices, 2009, pp. 1-6. 
[4] Vijay S., Chakrabarti C., Karam L.J., "Parallel deblocking filter for H.264 AVC/SVC," IEEE Workshop on Signal 
Processing Systems, 2010, pp.116-121. 
[5] Qingming Yi, Chao Zhang, "The adaptive loop filter design based on AVS standard," International Colloquium on 
Computing, Communication, Control, and Management, 2009, pp. 409-413 
[6] Huang Zhonghua, Zhi Cheng, “Design and Implementation of AVS Loop Filter Based on FPGA,” Computer Engineering, 
Vol.33, No.6, 2007, pp.1581-1583 
