An Efficient Hardware Architecture for Full-Search Variable Block Size Motion Estimation in H.264/AVC by Seung-man Pyen et al.
An Efficient Hardware Architecture for Full-Search 
Variable Block Size Motion Estimation in H.264/AVC 
Seung-Man Pyen1, Kyeong-Yuk Min1, Jong-Wha Chong1, Member, IEEE       
Alberto L. Sangiovanni-Vincentelli2, Fellow, IEEE 
 
1Dept. of Electronic Engineering, Hanyang University, Seoul, Korea 
2Dept. of Electrical Engineering and Computer Sciences   
College of Engineering UC Berkeley, CA. USA 
E-mail: himani815@hotmail.com 
Abstract. In this paper, we propose a high speed hardware architecture for the 
implementation of full-search variable block size motion estimation (VBSME) 
suitable for high quality video compression. In the high-quality video with large 
frame size and search range, the memory bandwidth is mainly responsible for 
throughput limitations and power consumption in VBSME. The proposed 
architecture is designed for reducing the memory bandwidth by adopting 
"meander”-like scan for a high overlapped data of the search area and using on-
chip memory to reuse the overlapped data. We can reuse the previous candidate 
block of 94% to the current one and save about 23% memory access cycles in a 
search range of [-16, +15]. The architecture has been prototyped in Verilog 
HDL, simulated by ModelSim and synthesized by Synopsys Design Compiler 
with Samsung 0.18um standard cell library. Under a clock frequency of 
51MHz, The simulation result shows that the architecture can achieve the real-
time processing of 720x576 picture size at 30fps with the search range of [-
16~+15]. 
1   Introduction 
A video sequence usually contains a significant amount of temporal redundancy. The 
block matching algorithm (BMA) based on the motion estimation and compensation 
is widely used in many video coding standards such as H.26x, MPEG-1, -2, and -4 to 
remove temporal redundancy. The fixed block size block matching algorithm 
(FBSBMA) is to divide the current frame into several macroblocks, and to search for 
a best matched block within a search range in a reference frame. In case of FBSBMA, 
if a macroblock consists of two objects moving into different directions, the coding 
performance of the macroblock is worse than that of two objects moving into one 
direction. To compensate for the demerits of FBSBMA, variable block-size block 
matching algorithm (VBSBMA) is adopted in the advanced video coding standards. 
In the H.264/advanced video coding (AVC), VBSBMA consists in partitioning a 
macroblock into 7 kinds of blocks including 4x4, 4x8, 8x4, 8x8, 8x16, 16x8 and 
16x16 as it is shown in Fig. 1. 2      Seung-Man Pyen1, Kyeong-Yuk Min1, Jong-Wha Chong1, Member, IEEE       Alberto 
L. Sangiovanni-Vincentelli2, Fellow, IEEE 
 
 
Fig. 1. The various block sizes in H.264/AVC 
In this way, the coding performance is improved, because BMA performs the 
function of making a macroblock segmented into 7 different sizes. Although VBS-
BMA achieves higher coding performance than that of FBS-BMA, it requires a high 
computation effort since 41 motion vectors of 7 different sizes should be computed 
for each macroblock. Therefore, many efficient hardware architectures such as 
systolic array [8], 1-D processing element (PE) array [6] and 2-D PE array [4] [7] [10] 
have been proposed for implementing VBS-BMA. The 1-D PE array is a simple 
structure, as it is easier to control and less gates than a 2-D PE array, but it is normal 
to search the sum of absolute difference (SAD) against only one row or a column of 
the macroblock at a time. On the other hand, the 2-D PE array is a complex structure 
as it is more difficult to control and has more gates than a 1-D PE array is and has, but 
it should compute the motion vector of a search point at a time. So 2-D array is 
suitable structure for high quality video. In the high-quality video with large frame 
size and search range, the memory bandwidth is a major bottleneck in the motion 
estimation architecture. To reduce memory bandwidth, the motion estimation 
architecture has to access each pixel just once and reuse the overlapped data. The 
architecture is designed for reducing the memory bandwidth by adopting “meander”-
like scan in order to access to each pixel just once and by using on-chip memory to 
reuse the overlapped data. The proposed architecture can perform full-search 
VBSBMA and it can achieve all 41 motion vectors of the macroblock. 
The rest of this paper is organized as follows: 
In section II, the proposed architecture is described in detail. The experimental result 
and the conclusion are given in section III and IV, respectively. An Efficient Hardware Architecture for Full-Search Variable Block Size Motion 
Estimation in H.264/AVC      3 
2   The Proposed Architecture 
Fig. 2 shows the block diagram of the proposed architecture, which consists of an on-
chip memory, a ME control unit, a 16x16 PE array, a SAD adder-tree and a 
comparator unit. 
 
Fig. 2. Schematic overview of the proposed architecture 
To reuse the overlapped data, the search area static random access memory (SRAM) 
is the SRAM to store the pixel data of the search area and the current block SRAM is 
the SRAM to store the pixel data of the current macrobock. The ME control unit 
generates the address signals of the search area SRAM and the current macroblock 
SRAM, and generates the control signals to control other blocks. 16x16 PE array 
computes sixteen 4x4 SADs of the 16x16 candidate block. The Adder-tree is 
employed to compute the SADs of all the 41 subblocks from the SADs of the 4x4 
subblocks. Comparator unit should find out the best matched MV with the minimum 
SAD. 
2.1   The Data scheduling for Data-Reuse 
In the full-search BMA, each current macroblock should be compared with all the 
candidate blocks in search area in order to find the best matched macroblock. So the 
full-search BMA require a high memory bandwidth. In order to reduce memory 
bandwidth, we adopted the "meander"-like scan (Fig.3) of the search area that obtains 
the high overlapped data of the search area. 
 4      Seung-Man Pyen1, Kyeong-Yuk Min1, Jong-Wha Chong1, Member, IEEE       Alberto 
L. Sangiovanni-Vincentelli2, Fellow, IEEE 
 
 
Fig. 3. The “meander”-like scan of the search area 
If the search point moves into the horizontal direction, the candidate blocks are many 
overlapped again and again in comparison with the previous candidate block. 16x18 
bit pixels data of the present candidate block can be only changed in comparison with 
the previous candidate block. In the case of Raster order scan, 16x16 pixels data of 
the candidate block can be changed when the search point is changed to the next line. 
As 16x16 pixels data of the candidate block is changed when the search point is 
changed to the next line, 16x16 PE array needs more 15 cycles than "meander"-like 
scan in order to load the pixel data of the candidate block. In order to reduce 15 cycles 
each of PE needs search data buffer to store the pixel data of the candidate block [14]. 
The proposed architecture adopted a "meander"-like scan of the search area that gives 
an actual support to simple control and efficient memory size. The "meander"-like 
scan format has the three directions of the data-flows. In the odd line of the search 
area, the direction of the data-flow moves into the right. In the even line of the search 
area, the direction of the data-flow moves into the left. When the computations for the 
rows of the search area are finished, the direction of the data-flow moves into the 
bottom. 
2.2   The PE Architecture 
Fig.4 shows the architecture of the PE. An Efficient Hardware Architecture for Full-Search Variable Block Size Motion 
Estimation in H.264/AVC      5 
 
Fig. 4. Processing element 
The PE consists of a current pixel register (CPR), a reference pixel register (RPR), a 3 
input multiplexer, and a computing unit. The CPR stores the pixel data of the current 
macroblock, and the RPR stores the pixel data of candidate block. The 3 input 
multiplexer in the PE selects the direction of the data-flow. The computing unit 
computes the absolute difference between the CPR and the RPR at every cycle. 
2.3   The Process Unit (PU) Architecture 
Fig.5 shows the architecture of the PU. 
 
Fig.5. Processing unit 
The PU consists of a 4x4 PE and 5 adder-trees. Sixteen SAD, which are the outputs of 
a 4x4 PE array, can be added to get a 4x4 SAD by using the 5 adder-trees. 6      Seung-Man Pyen1, Kyeong-Yuk Min1, Jong-Wha Chong1, Member, IEEE       Alberto 
L. Sangiovanni-Vincentelli2, Fellow, IEEE 
 
2.4   The Whole Architecture 
The proposed architecture is illustrated in Fig.6. 
 
 
Fig. 6. The whole architecture 
The architecture is composed of 16 processing units (PU), a SAD adder-tree and a 
comparator unit. Two SRAM modules are used to store the search area and the 
current macroblock. Each SRAM module is composed of 16 SRAM's in order to 
manipulate address generation. In the SAD adder-tree (Fig.7), the SADs of sixteen 
4x4 subblocks are used to obtain the SADs of all the 41 subblocks of 7 different sizes. An Efficient Hardware Architecture for Full-Search Variable Block Size Motion 
Estimation in H.264/AVC      7 
 
Fig. 7. The SAD adder-tree 
The comparator unit that composed of 41 comparing elements (Fig.8) find out the 
minimum distortions as well as the corresponding motion vectors. 
 
 
Fig. 8. The Comparing element 
Comparing element is consisted of a comparator and two registers. R1 register stores 
the minimum SAD for comparison and R2 register stores the motion vector. 
Table.1 shows the data sequence in the PE array for the current macroblock and the 
search area. C(x,y) is the pixel data in the current macroblock and R(x,y) is the pixel 
data in the search area. 8      Seung-Man Pyen1, Kyeong-Yuk Min1, Jong-Wha Chong1, Member, IEEE       Alberto 
L. Sangiovanni-Vincentelli2, Fellow, IEEE 
 
Table 1. Data-flow Schdule  
CLK  1st column  2nd column  --  16th column 
--  PE0  PE1 -- PE15  PE16  PE17  -- PE31  --  PE240  PE241  -- PE255 
0               -       
C(0,0) 
R(-16,-16) 
-- --  --  --  -- --  --  --  --  --  -- --  --  -- 
15 
C(0,0) 
R(-16,-16) 
C(0,1) 
R(-16,-15) 
-- 
C(0,15) 
R(-16,-1) 
C(1,0) 
R(-15,-16) 
C(1,1) 
R(-15,-15) 
-- 
C(1,15) 
R(-15,-1) 
-- 
C(15,0) 
R(-1,-16) 
C(15,1) 
R(-1,-15) 
-- 
C(15,15) 
R(-1,-1) 
16 
C(0,0) 
R(-15,-16) 
C(0,1) 
R(-15,-15) 
-- 
C(0,15) 
R(-15,-1) 
C(1,0) 
R(-14,-16) 
C(1,1) 
R(-14,-15) 
-- 
C(1,15) 
R(-14,-1) 
-- 
C(15,0) 
R(0,-1s6) 
C(15,1) 
R(0,-15) 
-- 
C(15,15) 
R(0,-1) 
17 
C(0,0) 
R(-14,-16) 
C(0,1) 
R(-14,-15) 
-- 
C(0,15) 
R(-14,-1) 
C(1,0) 
R(-13,-16) 
C(1,1) 
R(-13,-15) 
-- 
C(1,15) 
R(-13,-1) 
-- 
C(15,0) 
R(1,-16) 
C(15,1) 
R(1,-15) 
-- 
C(15,15) 
R(1,-1) 
-- --  --  --  -- --  --  --  --  --  -- --  --  -- 
46 
C(0,0) 
R(15,-16) 
C(0,1) 
R(15,-15) 
-- 
C(0,15) 
R(15,-1) 
C(1,0) 
R(16,-16) 
C(1,1) 
R(16,-15) 
-- 
C(1,15) 
R(16,-1) 
-- 
C(15,0) 
R((30,-16) 
C(15,1) 
R((30,-15) 
-- 
C(15,15) 
R((30,-1) 
47 
C(0,0) 
R(15,-15) 
C(0,1) 
R(15,-14) 
-- 
C(0,15) 
R(15, 0) 
C(1,0) 
R(16,-15) 
C(1,1) 
R(16,-14) 
-- 
C(1,15) 
R(16,0) 
-- 
C(15,0) 
R((30,-15) 
C(15,1) 
R((30,-14) 
-- 
C(15,15) 
R((30,0) 
48 
C(0,0) 
R(14,-15) 
C(0,1) 
R(14,-14) 
-- 
C(0,15) 
R(14, 0) 
C(1,0) 
R(15,-15) 
C(1,1) 
R(15,-14) 
-- 
C(1,15) 
R(15,0) 
-- 
C(15,0) 
R(29,-15) 
C(15,1) 
R(29,-14) 
-- 
C(15,15) 
R(29,0) 
-- --  --  --  -- --  --  --  --  --  -- --  --  -- 
78 
C(0,0) 
R(-16,-15) 
C(0,1) 
R(-16,-14) 
-- 
C(0,15) 
R(-16, 0) 
C(1,0) 
R(-15,-15) 
C(1,1) 
R(-15,-14) 
-- 
C(1,15) 
R(-15,0) 
-- 
C(15,0) 
R(-1,-15) 
C(15,1) 
R(-1,-14) 
-- 
C(15,15) 
R(-1,0) 
79 
C(0,0) 
R(-16,-14) 
C(0,1) 
R(-16,-13) 
-- 
C(0,15) 
R(-16, 1) 
C(1,0) 
R(-15,-14) 
C(1,1) 
R(-15,-13) 
-- 
C(1,15) 
R(-15,1) 
-- 
C(15,0) 
R(-1,-14) 
C(15,1) 
R(-1,-13) 
-- 
C(15,15) 
R(-1,1) 
80 
C(0,0) 
R(-15,-14) 
C(0,1) 
R(-15,-13) 
-- 
C(0,15) 
R(-15, 1) 
C(1,0) 
R(-14,-14) 
C(1,1) 
R(-14,-13) 
-- 
C(1,15) 
R(-14,1) 
-- 
C(15,0) 
R(0,-14) 
C(15,1) 
R(0,-13) 
-- 
C(15,15) 
R(0,1) 
-- --  --  --  -- --  --  --  --  --  -- --  --  -- 
1039 
C(0,0) 
R(-16,15) 
C(0,1) 
R(-16,16) 
-- 
C(0,15) 
R(-16, 30) 
C(1,0) 
R(-15,15) 
C(1,1) 
R(-15,16) 
-- 
C(1,15) 
R(-15,30) 
-- 
C(15,0) 
R(-1,15) 
C(15,1) 
R(-1,16) 
-- 
C(15,15) 
R(-1,30) 
 
At the beginning of VBSBMA, 16x1 8bit pixels data of the current macroblock in the 
current block SRAM are inputted to the 16x16 PE array during the first 16 cycle. And 
16x1 8bit pixels data of the candidate block in the search area are inputted to the 
16x16 PE array, in the same cycle. 16x1 pixels data of the current macroblock is 
inputted in the 16x16 PE array during the 16 cycle, and 16x1 pixels data of the 
candidate block is inputted in the 16x16 PE array at every cycle. After the placement 
of the pixel data of the current macroblock and the pixel data of the candidate block 
during the 16 cycles, the sixteen 4x4 SADs of the candidate block can be computed 
cycle by cycle. The calculation of 41 SADs in the adder-tree needs 1 more cycle after 
computing sixteen 4x4 SADs. Therefore, for the search range of [-16 ~ + 15], the 
calculation of all motion vectors needs 16+1+32x32 = 1041 clock cycles. 
3   The Simulation and Synthesis Result 
The architecture has been prototyped in Verilog HDL, simulated by ModelSim and 
synthesized by Synopsys Design Compiler with Samsung 0.18um standard cell 
library. The total gate count is about 128k. The gate of 16x16 PE array requires 90.7k An Efficient Hardware Architecture for Full-Search Variable Block Size Motion 
Estimation in H.264/AVC      9 
and the rest gates are spent on the comparator unit to find out the minimum SAD, the 
adder-tree to obtain the SAD of all the 41 subblocks, and the control unit. For the 
search range of [-32~ +31], the total 58kb on-chip memory is required to store the 
search area and the current macroblock. Table.2 shows the performance of the 
proposed architecture. 
 
Table 2. The Performance of the proposed architecture 
Algorithm Full  Search 
Number of PE  256 
Searching range  32x32, 64x64 
Gate count  128k 
On-chip memory  58kbits 
Process  Samsung 0.18um standard cell library 
Block size  4x4, 4x8, 8x4, 8x8, 8x16, 16x8, 16x16 
 
Table.3 shows the comparison between the proposed architecture and other full-
search VBSME architectures. 
Table 3. Comparison of four hardware architecture for VBSME 
algorithm [6]  [7]  [5] proposed 
Number of PE  16  16x16  16x16  16x16 
Searching 
range 
32x32 
16x16  64x64  64x64, 
32x32 
64x64 
32x32 
Gate count  108k  -  154k  128k 
On-chip  
memory  -  96k bits  60k bits  58kbits 
Process 0.13um  0.5um 0.28um 0.18um 
Block size  7 kind of 
block sizes 
16x16, 
8x8, 4x4 
7 kind of 
block sizes 
7 kind of 
block sizes 
 
From the table.3 it can be observed that the architecture requires less gate count and 
on-chip memory. 
4   Conclusion 
We proposed a high speed hardware architecture of VBSME suitable for high quality 
video compression in this paper. The architecture consists of a 16x16 PE array, an 
adder tree and comparator unit to find out 41 motion vector and minimum SADs for 
the 7 different blocks. The proposed architecture can reduce the memory bandwidth 10      Seung-Man Pyen1, Kyeong-Yuk Min1, Jong-Wha Chong1, Member, IEEE       
Alberto L. Sangiovanni-Vincentelli2, Fellow, IEEE 
 
by adopting a "meander”-like scan for a high overlapped data of the search area and 
using on-chip memory to reuse the overlapped data. Compared with the raster order 
scan of the search area, the “meander”-like scan can save about 23% memory access 
cycles in a search range of [-16, +15]. The architecture allows the real-time 
processing of 720x576 picture size at 30fps with a search range of [-16~+15], under a 
frequency of 51MHz. 
References 
1. De Vos, L., Schobinger, M.: VLSI Architecture for a Flexible Block Matching Processor, 
Vol. 5. NO. 5. IEEE Transactions on Circuits and Systems (1995) 
2. Jen-Chieh, Tuan., Tian-Sheuan, Chang., Chein-Wei, Jen.: On the Data Reuse and Memory 
Bandwidth Analysis for Full-Search Block-Matching VLSI Architecture, Vol. 12. IEEE 
Transactions on Circuits and Systems for Video Technology (2002) 
3. Komarek, T., Pirsch, P.: Array Architectures for Block Matching Algorithms, Vol. 36. IEEE 
Transactions on Circuits and Systems (1989) 
4. Yu-Wen, Huang., Tu-Chih, Wang., Bing-Yu, Hsieh., Liang-Gee, Chen.: Hardware 
Architecture Design for Variable Block Size Motion Estimation in MPEG-4 AVC/JVT/ 
ITU-T H.264, Vol. 2. Proc. IEEE International Symposium on Circuits and Systems (2003) 
5. Min-ho, Kim., In-gu, Hwang., Soo-Ik, Chae.: A Fast VLSI Architecture for Full-Search 
Variable Block Size Motion Estimation in MPEG-4 AVC/H.264, Vol. 1. Asia and South 
Pacific (2005) 
  6. Yap, S.Y., McCanny, J.V.: A VLSI Architecture for Advanced Video Coding Motion 
Estimation, Proc. IEEE International Conference on Application-Specific Systems, 
Architectures, and Processors (2003) 
7. Kuhn, P.M., Weisgerber, A., Poppenwimmer, R., Stechele, W.: A Flexible VLSI 
Architecture for Variable Block Size Segment Matching with Luminance Correction, IEEE 
International conference on Application-Specific Systems, Architectures, and Processors 
(1997) 
8. Chen, Y.K.. Kung, S.Y.: A systolic design methodology with application to full-search 
block-matching architectures, VLSI Signal Processing (1998) 
9. De Vos, L., Stegherr, M.: Parameterizable VLSI Architectures for the Full-Search Block-
Matching Algorithm, Vol. 36. IEEE Transactions on Circuits and Systems (1989) 
10. Rahman, C.A., Badawy, W.: A Quarter Pel Full Search Block Motion Estimation 
Architecture for H.264/AVC, IEEE International Conference on Multimedia and Expo 
(2005) 
 