An improved parallel architecture for MPEG-4 motion estimation in 3G mobile applications by Xu, Donglai et al.
AN IMPROVED PARALLEL ARCHUTECTURE FOR MPEG-4 MOTION 
ESTIMATION IN 3 6  MOBILE APPLICATIONS 
Donglai Xu, Rui Gao 
SST, University of Teesside 
Middlesbrough, TS1 3BA, UK 
d.xu@tees.ac.uk 
ABSTRACT 
A'high-parallel VLSl core architecture for MPEG-4 motion 
estimation is proposed in this paper. It possesses the 
characteristics of low memory bandwidth and low clock rate 
requirements, thus primarily aiming at 3G mobile applications. 
Based on a one-dimensional tree architecture, the architecture 
employs the dual-registerhuffer technique to reduce the preload 
and alignment cycles. As an example, full-search block matching 
algorithm has been mapped onto this architecture using a 16-PE 
array that has the ability to calculate the motion vectors of QClF 
video sequences in real time at 1 MH2 clock rate and using 15.5 
Mbyteds memory bandwidth. 
1. INTRODUCTION 
The third generation (3G) wireless system provides the 
high-speed mobile platform with Internet Protocol (IP) [I], which 
allows the implementation of many types of IP-based internet 
applications, such as e-mail service, web page browsing and 
imagelvideo transmission. Among these, real-time video 
applications represent an important part of mobile multimedia [2]. 
However, due to the inherent data intensity of video, compression 
techniques are required to reduce bit rates. This is achieved 
largely by exploiting temporal data redundancy in video streams 
such as motion estimation (ME) techniques. Since the ME 
operations can take up to 80% of the computational burden of a 
complete video compression procedure, it is the most important 
component in real-time video applications 131. Many VLSl 
architectures for ME have been proposed. However, most of them 
target at MPEG-I12 video coding applications, such as 
videophone, video conferencing, video broadcasting, etc. These 
architectures are not particularly suitable for mobile and low 
power applications [4]. In this paper, a high parallel and low 
power consumption architecture that is based on a 
one-dimensional tree architecture is presented [5]. It features the 
high data utilisation by using parallel pipelining and the low clock 
rate by introducing the dual-registerhuffer technique that reduces 
idle clock cycles. 
The rest of the paper is organised as follows. In section 2. 
two typical ME algorithms are briefly described. Section 3 
presents the proposed VLSl architecture in detail with emphasis 
on the key component PE array. In section 4, the performance of 
the architecture is analysed in terms of the minimum clock rate 
and the minimum memory bandwidth requirement. Finally, the 
conclusions are drawn in section 5.  
Hadj Batatia 
IRIT 
ENSEEIHT, Toulouse, 31071, France 
batatia @enseeiht.fr 
2. MOTION ESTIMATION ALGORITHMS 
Recently, MPEG-4 video standard has been introduced to cover 
wireless multimedia applications 121. It adopts block-matching 
algorithms with alpha hinary plane to achieve motion estimation 
[3, 61. Figure I illustrates the principle of the block matching 
motion estimation technique. First, the video frames are 
segmented into N XN non-overlapping rectangular bloc&. Every 
block within the current frame is matched to the corresponding 
blocks within a search area on the previous frame. A marching 
crirerion, or disrorfion function, that measures the similarity 
between the current block and candidate block is calculated. Then, 
a morion vecror to the position of the candidate block, which has 
the minimum measurement with the current block, is generated to 
replace the real movement of the objects in a compressed video 
stream 131. Thus, the temporal redundancy within a video 
sequence is reduced. 
Previous F ~ n r  
Figure 1.  Block-matching 
2.1. Full-search block-matching algorithm 
Because of its low distortion and regular data flow, the full-search 
block-matching has been one of the most widely used ME 
algorithms. In this algorithm, the current block located at the pixel 
(x,y),  as shown in the figure 1 ,  is matched to every candidate 
block within a (2p+N- / )x (Zp+N-IJ  search window, where /-p, 
p - l ]  is the pixel search range. For every candidate block with a 
displacement (dx, dy), a sum of absolute difference (SAD) is 
calculated, which is given by 
where Ik(m,n), ik.,(m,n) are the intensity values of the pixels 
0-7803-7663-3/03/$17.00 02003 IEEE 
This paper was originally published in the Proceedings of the 2003 IEEE 
lntemational Conference on Acoustics, Speech, & Signal Processing, 
Apnl6-IO, 2003, Hong Kong (cancelled). Rephted with permission. 
111 - 441 
located at position '(it, n) in 'current and previous blocks, 
respectively. Similar SAD for next cbdidate block is calculated 
and compared to the existing SAD. The block giving the smaller 
SAD is kept as the minimum candidate. This process continues 
until all blocks are matched and a final minimum SAD is obtained. 
The motion vector is considered as being the displacement (dx, dy) 
of the block corresponding to this minimum 131. 
2.2.Object-based motion estimation 
Recently finalised MPEG-4 standard emphasises object-based 
motion estimation. which estimates the movements ofthe objects 
in a video sequence, rather than blocks 161. To support the 
arbitrary-shaped objects motion estimation. an alpha binary plane 
has to he defined. The alpha plane contains the information of 
whether a pixel is inside the object or  not [3]. Thus, the SAD for 
the object can be represented below: 
where alpha (x, y) is the binary value for the (x. y )  pixel in the 
current block. The value is one when the pixel is inside the object; 
otherwise, i t  is zero as shown in the figure 2. 
: I  lo^ , n  . . ~ .  ... . .~ ... ~ . 
0 : o  . ' O  0 I) 0 i 
. .  . .  . .  D~ ! 0 7 ~ o  'q 0 ! 
0 0 : o  : o  
0 :(I I d  ... . . . , . . . . , 
. o  , L )  0 i o  0 : o  ~ 
Figure 2. Alpha binary plane 
...~ . ~ ~ ~.~~~ 
. .  
: 
3. PROPOSED ARCITECTURE 
In this section, we describe the main components of the proposed 
architecture and give details of the PE array, which is the most 
computationally intensive part of the system. 
3.1. System overview 
The figure 3 shows the block diagram of the ME architecture, 
which includes five components: memory unit, address generator, 
PE array, minimum unit and control CPU [3]. 
The memory unit is divided into two modules. One is to store 
current frame data and alpha plane data; the other is fur previous 
frame data. The address generator computes the addresses, at 
which the candidate pixels for the block matching are stored. It 
also fetches the pixel data from memory unit and feeds them into 
the PE array. The PE array computes the absolute difference 
between previous and current frames and sends result to the 
minimum unit. Then, the SADs of all parallel-processed blocks 
are generated in the minimum unit, and these SADs are compared 
to find the minimum one to be stored in the minimum SAD 
register. Meanwhile, a minimum flag signal is output to the 
control CPU, which, jointly with address generator. gives the 
location where a motion vector is found. 
ME ArchitoClure 
.................................................. 
Figure 3. System block diagram 
3.2. PE array 
The PE array is the key component of the ME architecture. It 
predominantly determines the performance of the system in t e r m  
of memory bandwidth and minimum clock rate for real-time 
processing. Based on a one-dimensional tree architecture 
presented in the 151, the PE array architecture uses additional 
preload cycles and to increase the parallelism of the data flow, a 
group of parallel-pipelined processing elements have been 
adopted, as shown in the figure 4. 
MY 
llX0 ..do.d bus..? 
Pipll,.., 4 I.bil. s-r.I.i.l bus. '". pR"io"l 1.0. - I)~bil, prd".d b".. rilhl b i l l  lvrcvnrlll &I. 
anrbil ro..lpb.pl.llrd.l. 
...... 
Figure 4. PE array architecture 
In this architecture, motion estimation is canied out through 
two stages, preload and matching. In preload cycles, as shown in 
the figure 5 ,  the current block data and the alpha plane data are 
preloaded into the PE may. They are stored locally in the 
appropriate PES. Then, as illustrated in the figure 6, in the 
matching cycles, the previous block data are loaded into the PES 
by parallel pipelining. Before matching to the preloaded current 
data, the previous data must align with the current data. It takes 
NpE clock cycles to align the previous data with the current data, 
where NpE is the number of PES. While the SAD calculation starts, 
the previous data shift from the left to the right within PE array 
until they match the corresponding current data already in the PE 
array. In every clock cycle. NpE absolute difference values for 
each of the parallel-processed blocks are calculated they are 
summed up by a group of the adders in the PE array (Figure 4). 
The summed result is then sent to the minimum unit to calculate 
the SADs for each of the matching points and finds the minimum 
SAD for motion vector. 
111 - 442 
...... 
Current data and alpha dala 
1 . 1  I I. 2 I 
1 1.3 I 1 .4  j ............. 
Figure 5 .  Preload cycles 
prruiou. d.l4 
...... ____ R+,c, .  *.,P"'i"".d.,. 0 R * g i $ , c r ~ < " r e m r < " ! a  .,"h..I.".d.,. -$JQ$J& 
1 . 1  1 1,z I 1 . 3  I 1 . 4  1 
Dam undligned 
............. 
[ I  
...... 
Figure 8. Double register architecture 
I . ,  1 1 . 2  1 I . J  
Data unaligned 
............. 
Figure 6. Matching cycles 
3.4. High parallelism 
Apparently, there should be NpE processing elements in the 
PE array. Here, we assume that NpE is 16. and as an example, full  
search BMA has been chosen to evaluate this architecture. In this 
case. N,,blocks of the candidates can be processed simultaneously, 
and the pipelining can be organised as in the figure 7. 
Figure 7. Pipeline Organisation 
3.3. Dual registerhuffer 
The architecture presented in the section 3.2 suggests that the 
current data and the alpha plane data need to be loaded only once 
when processing N p  candidate blocks. Hence, as the N p  is 
increased, the bandwidth required for processing.curren1 data is 
sharply reduced. However, extra clock cycles are needed to 
preload current data and align the previous data with current data, 
thus the PES are in idle status during the preloading and the 
aligning. For instance, NpE/4 (with four preload bus) cycles are 
needed to preload current data and alpha plane data. and NPE 
cycles are needed to deal with data alignment. This can cause high 
clock speed requirement. To solve this problem, dual 
registerhuffer Structure has been introduced in processing 
elements. As illustrated in the figure 8. in each of the PES. there 
are two %bit registers for the previous data and two 9-bil registers 
for the current and alpha plane data, respectively, to allow 
preloading and matching to be performed simultaneously. While 
the PE is matching the data in register Group A, the following 
data are preloaded into register Group B. In addition. when the 
matching operations of the data in Group A are completed, the PE 
switches operational mode to match the data in Grouo B. while 
the Group A register is during preloading cycles. 
Different from other architectures presented in the [3, 4, 51, 
high-parallel processing can be easily achieved on the proposed 
architecture This is due to high number of blocks (more than 16) 
can be processed simultaneously. Figure 9 illustrates the data flow 
organisation of the architecture processing 32 blocks in parallel. 
I n  this illustration the pixel search range is I-8, 71 and the block 
size is 16x16. When the data of the first row (from [ I ,  I ]  to [1,16]) 
of the current block is in the PE may,  as shown in figure 9 (a), all 
previous data need to be matched in the search range, as shown in 
the figure 9 (b). The data required to match the first sixteen blocks 
is the first row ofthe search range, as shown in the figure 9 (c). To 
achieve high-parallel processing. we simply load the data from 
the blocks 17 to 32 (i.e., the second row of the search range) after 
completing the matching of blocks I to16 without changing the 
current data, as shown in figure 9 (d). Hence, there is no need to 
access the memory for another group of current data. nor 
preloading cycles. Funhermore, with dual registerhuffer 
structure. the data in the second row are loaded into register 
Group B at the same time of matching the first row data in register 
Group A. This allows the alignment cycles to be skipped for the 
previous data. 




*,I p ' n , , " u l  . i q Y l r r d  lor 
,he  r l r l l  r"X " 1  Ihr rYrrr"l w n i *  
/I> ( h ,  
.................. .................. 
.......................... 
, j ,  
. ............. 
, d .  
Figure 9. Paralleled data flow for previous block (search range) 
111 - 443 
4. PERFORMANCE ANALYSIS 
The architecture is aiming at the 3G mobile platform. which 
currently has 64 kbits bandwidth to upload and transfer data. Thus, 
a minimum compression rate of 7 0  is required to achieve real time 
video applications with acceptable visual quality 141. If we adopt 
QCIF as typical video format in mobile applications, 
uncompressed and compressed video data per frame will he 
(176x144~8)11024 = 198 kbits and 198/70 = 2.829 kbits, 
respectively. Therefore, the video transmission rate over the 3G 
platform should be 6412.829 = 22.628 frames. 
Taking into consideration the bandwidth requirements of 
audio and protocol, the maximum frame rate i s  going to he 20 
frames per second, which determines the minimum clock rate for 
real-time processing. 
4.1. Minimum clock rate analysis 
To meet the real-time processing condition. (NhxN,.J/(NxNJxfps 
current blocks have to be matched per second. where Nhx N ,  is the 
frame size (176 x144 for QCIF), and N x N  is the block size 
(16x16 in our case). With a search range [ -p ,  p - I ] ,  for every 
current block, there are N,,,, = 2px2p candidate blocks. Therefore, 
(NhxN~J/(NxN)xfpsx(2p)xlzp) pairs of blocks must be matched 
every second. The current blocks are divided into a group of N p B  - 
pixel sub-blocks, in which the pixels can he matched 
simultaneously within the PE array. The number of clockcycles to 
match a sub-block with N, candidate blocks, which are processed 
simultaneously. is defined as C,,,h. In addition, the number of the 
clock cycles needed to preload current data and alpha plane data is 
defined as CPm, and the number of cycles for matching and 
aligning are defined as Cnrllrrh and C,,., respectively. We have 
csh = fC,,+ C, ,sn+ CnujshJXNn,b 
N.a = ( N X N Y N P E ;  CD,= N P E I N , , ~ ;  Cz,r(xn = N ~ e - 1 ;  C,mmr, = 2P 
where the N ,  is the number of processing elements; Npm is the 
number of preload buses; N,.bis the number of sub-blocks within 
the current block. Moreover, CsdN, ,  is the number of clock cycles 
required to match a sub-block. Therefore, the minimum clock rate 
required is given by: 
Cdk = 
INpE INp,? + ( N p E  - 1 )  t Zplx N 1  x(2p)’  x.@sx N,, x N ,  
N ,  X N =  x N~~ 
For the PE array with the dual registerhuffer structure, there 
are only NpE - I  preload cycles in the beginning of the motion 
estimation process. Hence. the minimum clock rate can be 
calculated as: 
The above formulas suggest that the minimum clock speed 
required decreases while N ,  increases for both single and dual 
register structures. And, for smaller N p ,  the dual register based 
architecture requires much lower clock speed than that based on a 
single register. 
4.2. Minimum memory handwidth requirement analysis 
Power consumption i s  another important consideration for the 
intended mobile applications. For an ME algorithm, memory 
access operations are the predominant factor contributing to the 
power consumption, rather than the clock rate [3]. ,For the 
parallel-pipelined architecture, as illustrated in the figure 7, the 
total amount of data fed to the PE array per second. MbLr. i s  equal 
to the quantity of memory access for every candidate block 
multiplied by the number of candidate blocks. Therefore. 
M6,v=(~curr~n8ur1phn+Q~I.l.;,,u.~)lNy XNsu X ~anX/PSXNh x ,IN2 
where the Q Il,mn, d. ,,lVhll i s  quantity of current and alpha data 
memory access for every sub-block; and the Q pmvjo,,, is quantity 
of previous memory access for Np candidate subblocks. Then, 
MI,,.. =
If the preload bus is 9-bit wide, 8 bits are needed for current block 
and I hit for the alpha plane data. The matching data bus i s  8-hit 
wide for the previous data. This formula showr that the memory 
bandwidth i s  sharply reduced while N,, increases, especially when 
N,, i s  falling into the range of 16 and 32. In addition, the 
architectures with single and dual registerbuffer structures have 
the same quantity of memory access per second. Therefore. they 
have the same minimum memory bandwidth requirement. 
[ N P E  x 9 + ( N  + 2 p - 1 ) x 8 ] x N z  x(2p)’xfp.vxNh x N ,  
N ,  X N ~  x N~~ 
5. CONCLUSIONS 
This paper presents a parallel VLSl architecture for motion 
estimation. aiming at 3G mobile applications. Initial analysis 
shows that the architecture requires relatively low memory 
bandwidth and clock rate, therefore suitable for low power 
consumption and low cost VLSl desigolimplementation. 
Moreover, due to the adoption of the dual-register structure. the 
architecture significantly speeds up data processing and therefore 
provides high throughput. These make the architecture ideal for 
the mobile video applications. 
6. REFERENCES 
[ I ]  E.L.H. Zonya, “Evolution in the Technological Revolution: 
Preparing for 3G Wireless Technology,’’ Narional Urban Lmgrce 
Technology Policy Aled.  January, 2001. 
[2] S.N. Fabri, S. Worral, A. Sadka, and A. Kondor, Real-rime 
Video Comniunicarions over CPRS, University of Surrey, UK. 
[3] P. Kuhn, Algorifhms. Complexity Analysis And V U 1  
Arckirecrure for  MPEC-4 Motion Esrimnrion, KLUWER 
ACDEMIC PUBLISHERS, London, 2000. 
[4] A.J. Roach and A. Moini. “VLSI Architecture for Motion 
Estimation on a Single-chip Video Camera.” Visea1 
Communications and Image Processing 2000. Proceedings of 
SPIE, Vol. 4067. 2000. 
[SI Y.S. Jehng, L.G Chen, and T.D. Chiueh, “An Efficient and 
Simple VLSl Tree Architecture for Motion Estimation 
Algorithms,’’ IEEE TRANSACTIONS ON SIGNAL PROCESSING 
VOL41, NO. 2, FEB. 1993. 
[6] E. Touradj, C. Home, “MPEG-4 Natural Video Coding - An 
Overview,” Signal Processing: Image communicarion, Vol. 15, 
pp.365-385.2000. 
111 - 444 
