Implementation of BMA based motion estimation hardware accelerator in HDL by Jugade, Nachiket
UNLV Retrospective Theses & Dissertations 
1-1-2008 
Implementation of BMA based motion estimation hardware 
accelerator in HDL 
Nachiket Jugade 
University of Nevada, Las Vegas 
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds 
Repository Citation 
Jugade, Nachiket, "Implementation of BMA based motion estimation hardware accelerator in HDL" (2008). 
UNLV Retrospective Theses & Dissertations. 2374. 
https://digitalscholarship.unlv.edu/rtds/2374 
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV 
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the 
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from 
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself. 
 
This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized 
administrator of Digital Scholarship@UNLV. For more information, please contact digitalscholarship@unlv.edu. 
IM PLEM EN TATIO N  OF B M A  BASED MOTION ESTIMAI ION HARDW ARE
ACCELERATOR IN  HDL
by
Nachikct Jugadc
Bachelor o f Science in Electrical Engineering 
University o f  Pune, India 
2006
A  thesis submitted in partial fu lfillm ent 
o f the requirements for the
Masters of Science Degree in Electrical Engineering 
Department of Electrical and Computer Engineering 
Howard R. Hughes College of Engineering
Graduate College 
University of Nevada Las Vegas 
August 2008
UMI Number: 1460531
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy 
submitted. Broken or indistinct print, colored or poor quality illustrations and 
photographs, print bleed-through, substandard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, If unauthorized 
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 1460531 
Copyright 2009 by ProQuest LLC.
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17, United States Code.
ProQuest LLC 
789 E. Eisenhower Parkway 
PC Box 1346 
Ann Arbor, Ml 48106-1346
IJNTV Thesis ApprovalThe Graduate College 
University of Nevada, Las Vegas
■JULY 15 . 2008
The Thesis prepared by
NACHIKET JUGADE
E n titled
IMPLEMENTATION OF BMA BASED MOTION
ESTIMATION HARDWARE ACCELERATOR USING HDL
is approved in partial fulfillment of the requirements for the degree of
MASTER OF SCIENCE IN  ELECTRICAL ENGINEERING
C ) 7 /A /g ) 2 r
Examination Committee Member
Exarmination Committee Member 
Graduate College Faculty Representative
Examination Committee Cha
Dean o f the Graduate College
11
ABSTRACT
Implementation of BMA Based Motion Estimation Hardware Accelerator In HDL
by
Nachikct Jugadc
Dr. Henry Sclvaraj, Examination Committee Chair 
Professor o f Electrical Engineering 
University o f Nevada, I .as Vegas
Motion Estimation in MPEG (Motion Pictures Experts Group) video is a temporal
prediction technique. The basic principle o f  motion estimation is that in most cases,
consecutive video frames w ill be similar except for changes induced by objects moving
w ith in the frames. Motion Estimation performs a comprehensive 2-dimensional spatial
search for each luminance macroblock (16x16 pixel block). MPEG does not define how
this search should be performed, 'fh is is a detail that the system designer can choose to
implement in one o f many possible ways. It is well known that a full, exhaustive search
over a wide 2-dimensional area yields the best matching results in most cases, but this
performance comes at an extreme computational cost to the encoder. Some lower cost
encoders might choose to lim it the pixel search range, or use other techniques usually at
some cost to the video quality whieh gives rise to a trade-off.
Such algorithms used in image processing are generally computationally expensive. 
FPGAs are capable o f running graphics algorithms at the speed comparable to dedicated
111
graphies chips. A t the same time they are configurable through high-level programming 
languages, e.g. Verilog, VFIDL. The work presented entirely focuses upon a Hardware 
Accelerator capable o f performing Motion Estimation, based upon Block Matching 
Algorithm. The SAD based Full Search M otion Estimation coded using Verilog H D L 
relies upon a 32x32 pixel search area to find the best match for single 16x16 macroblock. 
Keywords: Motion Estimation, MPEG, macrobloek, FPGA, SAD, Verilog, V H D L
IV
TABLE OF CONTENTS
ABSTRACT...................................................................................................................  ni
v ii
L IST OF FIGURES.......................................................................................................
ACKNOW LEDG EM ENTS...................................................................    v ii i
CHAPTER 1 INTRODUCTIO N................................................................................  1
1.1 Introduction to motion estimation theory......................................................  1
1.2 Block matching algorithm theory...................................................................  9
1.2.1 Introduction.............................................................................................  9
1.2.2 Full search block matching algorithm..................................................  11
1.2.3 2D logarithmic search............................................................................  12
1.2.4 Three step search..................................................................................... 14
1.2.5 Parallel Hierarchical One-Dimensional Search (PHODS)................  16
CHAPTER 2 BACKGROUND OF M PEG ...............................................................  18
2.1 Background and Overview..............................................................................  18
2.1.1 Color conversion............................................    23
2.1.2 Motion estimator..................................................................................... 23
2.1.3 Motion compensator. ............................................................................  24
2.1.4 Discrete cosine transformation........................................................   26
2.1.5 Quantization............................................................................................  27
2.1.6 Huffman Coding...................................................................................... 28
CHAPTER 3 IM PLEM ENTATIO N OF M OTION ESTIMATION
HARDW ARE ACCELERATO R................................................................................ 31
3.1 Introduction......................................................................................    31
3.2 Block diagram description..............................................................................  32
3.2.1 Reference frame storage........................................................................  33
3.2.2 Current frame storage............................................................................. 34
3.2.3 Reference frame control........................................................................  35
3.2.4 Cunent Frame Control........................................................................... 38
3.2.5 SAD module............................................................................................  39
3.2.6 State Machine.......................................................................................... 45
3.2.7 Pipelining approach................................................................................  46
CHAPTER 4 RESULTS..............................................................................................  49
CHAPTER 5 CONCLUSION AN D FUTURE RECO M M ENDATIO NS  54
B IBLIO G R AP H Y........................................................................................................... 55
VI
LIST OF FIGURES
Figure 1.1
Figure 1.2
Figure 1.3
Figure 1.4
Figure 2.1
Figure 2.2
Figure 3.1
Figure 3.2
Figure 3.3
Figure 3.4
Figure 3.5
Figure 3.6
Figure 3.7
Figure 3.8
Figure 3.9
Figure 4.1
Figure 4.2
Figure 4.3
Illustrations o f two consecutive frames................................................... 8
Illustration o f  Block Matching Algorithm  (B M A ) ....................................  10
2D Logarithmic search.................................................................................. 14
Three step search...........................................................................................  15
Video encoder block diagram...................................................................... 23
Illustration o f Fluffman coding....................................................................  29
Motion estimation block diagram...............................................................  32
Sliding motion o f the reference maeroblock w ith in  the search area... 36
Sliding window architecture.....................................................................  37
Architecture to find the lower among A  and B .........................................  42
16x1 SAD architecture.................................................................................  43
16x16 SAD architecture...............................................................................  44
State Machine................................................................................................  46
Logic w ith combinational logic delay.....................................................  47
Logic w ith reduced combinational logic delay...................................... 47
(a) and (b) Illustrations o f cun ent macroblock and reference search
area respectively............................................................................................  49
Simulation results for motion estimation without pipelining...............  50
Simulation results for motion estimation w ith pipelining....................  51
vu
ACKNOWLEDGEMENTS
I take this opportunity to acknowledge the people who have helped me during the 
course o f my thesis work. M y special thanks to my advisor Dr. Henry Seivaraj who has 
helped me and guided me in the right direction throughout the course o f this thesis. His 
invaluable support and encouragement has made this possible. I would also like to thank 
my parents and my sister who are my greatest support system and encouraged me to work 
hard during the period o f  2 years. Without them, it would have been not only d ifficu lt, 
but also impossible. I would also like to thank my committee members. Dr. Venki, Dr. 
Regentova and D r Gewali for their timely support and encouragement. Last but not the 
least, I would like to thank my B348 lab mates. Ram, Anin, Vikram and Ashwini w ith 
whom 1 enjoyed the lighter moments during my time in this lab. 1 cannot imagine these 2 
years without such wonderful fellow lab mates. Thank you everyone!
vin
CHAPTER 1
INTRODUCTION
1.1 Introduction to motion estimation theory
Image compression, whether for still pictures or motion pictures (e.g., video), plays 
an important role in Internet and multimedia applications, digital appliances such as 
HDTV, and handheld devices such as digital cameras and mobile phones. Compression 
allows one to represent images and video w ith a mueh smaller amount o f data and 
negligible quality loss. The reduction in data decreases storage requirements (important 
for embedded devices) and provides higher effective transmission rates (important for 
Internet enabled devices). Unfortunately, implementing a compression seheme can be 
especially d ifficu lt. For performance reasons, implementations arc typically not portable 
as they are tuned to specific architectures. And while image and video compression is 
needed on embedded systems, desktop PCs, and high end servers; implementing all 
probable architectures separately is not cost effective. Furtheimore, compression 
standards are also continuously evolving, and thus compression programs must be easy to 
modify and update.
In the last few years there has been a growing trend to design very complex and 
efficient processing systems by integrating already developed and dedicated cores which 
implement, in a particularly efficient way, certain specific and critical parts o f the main 
system. Such design approach can either be conducted in order to obtain very complex
1
and autonomous processing architectures, or to implement specific and dedicated 
processing structures that w ill be integrated w ith other larger scale processing modules, 
in the form o f co-processors, to alleviate the computational burden. As a consequence, a 
significant amount o f quite different processing modules have been proposed and made 
available, providing an easy integration w ith the target processing systems and a 
substantial reduction o f the design effort. To attain such objective, these processing cores 
have to fo llow  strict design methodologies, in order to provide an easy and efficient 
implementation in a broad range o f target implementation technologies (e.g.: FPGA, 
ASIC, etc.) [2 |.
Recently, we are witnessing a new trend in embedded processor design that is again 
quickly reshaping the embedded processor design. Instead o f implementing the time 
critical tasks in ASICs, these tasks are to be implemented in field-programmable gate 
arrays (FPGA) structures or comparative technologies [6, 7]. Since FPGAs have the 
advantages such as
■ Increased flex ib ility : The functionality o f the embedded processor can be quickly 
changed without requiring another roll-out o f the embedded processor itse lf and 
design faults can be quickly rectified. It also allows for quiek adaptation o f new 
(possibly unforeseen) developments.
■ Sufficient perfoimance: The performance o f FPGAs has increased tremendously 
and is quickly approaching that o f ASICs [2 |. This seems to be mainly due to the 
faster adaptation o f new technological advancements by FPGAs than by ASICs.
■ Faster design times: Faster design times are achieved by re-using intellectual 
property (IP) cores or by slightly modifying them. More importantly, high-level
design languages (such as Verilog, V H D L etc) can be used in the design process 
and thereby speeding it up significantly.
Field-programmable gate array is a semiconductor device containing programmable 
logic components called "logic blocks", and programmable interconnects. Logic blocks 
can be programmed to pertbim the function o f basic logic gates such as AND, and XOR, 
or more complex combinational functions such as decoders or mathematical functions. In 
most FPGAs, the logic blocks also include memory elements, which may be simple tlip - 
fiops or more complete blocks o f  memory. A  hierarchy o f programmable interconnects 
allows logic blocks to be interconnected as needed by the system designer, somewhat like 
a one-chip programmable breadboard. Logic blocks and interconnects can be 
programmed by the customer or designer, after the FPGA is manufactured, to implement 
any logieal function— hence the name "field-programmable".
FPGAs are usually slower than their application-specific integrated circuit (ASIC) 
counterparts, cannot handle as complex a design, and draw more power (for any given 
semiconductor process). But their advantages inelude a shorter time to market, ability to 
re-program in the field to fix  bugs, and lower non-recuiTing engineering costs. Vendors 
can sell cheaper, less flexible versions o f their FPGAs which cannot be modified after the 
design is committed. The designs are developed on regular FPGAs and then migrated into 
a fixed version that more resembles an ASIC. To configure ("program") an FPGA we 
specify how we want the ehip to work w ith a logic circuit diagram or a source code using 
a hardware description language (HDL). The H D L form might be easier to work w ith 
when handling large structures because it's possible to just specify them numerically 
rather than having to draw every piece by hand. On the other hand, schematic entry might
allow for a more tight specification o f what you want. For this purpose V H D L and 
Verilog HDL are popular. SystemC is also in the race and popular w ith  embedded 
systems designers. Going from schematic/HDL source files to actual configuration, the 
source tiles are fed to a software suite from the FPGA vendor that through different steps 
w ill produce a file. This file  is then transferred to the FPGA via a serial interface or USB 
(JTAG) interface or to external memory device like an EEPROM. . The literature survey 
revealed that many companies have done extensive research on this part o f  video 
encoding. The H D L code for motion estimation core is not easily available on the World 
Wide Web. So the code was written from scratch for this project after comprehensive 
literature survey. For simulation purposes a smaller test image is used and due to the 
generic nature o f the explained architecture it  can be extended to test larger images but 
w ith more complex debugging equipments and techniques.
Digital video compression entails the utilization o f many coding techniques w ith the 
ultimate goal to reduce the size o f the digital representation o f a video sequence. The 
same techniques used to compress digital pictures, e.g., in the JPEG picture standard, can 
be applied to single video frames. Such techniques exploit the fact that colors in 
photographic images tend to only gradually change when traversed from one side to 
another. In the video coding case, the tact that subsequent video frames do not differ 
much can be sim ilarly exploited in order to increase compression efficiency. A ll coding 
techniques can be categorized into two main categories, namely lossy and lossless 
techniques. Lossy coding techniques remove pel infonnation that the human eye is 
unable to perceive using coding techniques such as the discrete cosine transform and 
quantization. The information that has been removed in most cases cannot be exactly
regained, but it  usually can only be approximated. On the other hand, lossless coding 
techniques do not remove any information. Instead, it exploits redundancies, i.e., 
similarities, between pels found in and between video frames whieh results in the 
representation o f pel information using fewer bits. A  lossless coding technique is 
predictive coding which predicts current pel(s) using reference pel(s) and then store the 
difference(s) between the prediction and the current pel(s). Assuming redundancy 
between pels, the differences are usually small and can be coded using less bits than the 
coding o f the original pels. Predictive coding can use pels from the same video frame as 
reference pels (intra-coding) or pels from other video frames (interceding). Inter-frame 
predictive coding can contribute to the overall compression efficiency, because 
consecutive video frames are usually similar, i.e., they do not d iffer much. In this sense, 
the reference pels can be found in a reference frame located at the same position as the 
cunent pels in the eurrent to be coded frame. This approach can also be used to capture 
scene changes by choosing the reference frames in the near future o f the cunent (to be 
encoded) frame instead from its past. However, such a straightforward approach has one 
major drawback. Objects in a video scene tend to move around resulting in poor 
compression perfoimance o f the straightforward inter-frame predictive coding method, 
because pels located at the same location in consecutive frames are now quite different.
In video coding, similarities between video frames can be exploited to achieve higher 
compression ratios. However, moving objects w ith in a video scene diminish the 
compression efficiency o f the straightforward approach that only considers pels located at 
the same position in the video frames. In order to achieve higher compression efficiency, 
motion estimation was introduced in an attempt to accurately capture such movements. It
is performed for every macroblock, i.e., an array o f 16x16 pels, in the to be encoded 
frame by finding its ‘best’ match in a reference frame. The most commonly used metric is 
the “ Mean Absolute Differenees”  (M A D ), which adds up the absolute differences 
between eorresponding elements in the macrobloeks. The M AD  operation is very time- 
eonsuming due to the complex nature o f the absolute operation and the subsequent 
multitude o f additions. In [3], a parallel hardware implementation was proposed to speed 
up the M A D  computation process.
Motion Estimator is one such module deserving a particular attention in the scope o f 
digital video coding. This block enjoys its own independenee as it is not constrained by 
any video coding protocols and its functionality is solely based upon the designers’ 
creativity, need and application. In fact, although this block is often regarded as one o f 
the most important operations in video coding to exploit temporal redundancies in 
sequences o f images, it often represents most o f the computation cost o f these systems 
[1]. As a consequence, real-time Motion Estimation is usually only achievable by 
adopting specialized VLSI structures to implement this processing task. Motion 
Estimation in MPEG video is a temporal prediction technique. The basic principle o f 
motion estimation is that in most cases, consecutive video frames w ill be similar except 
for changes induced by objects moving w ith in the frames. In the triv ia l case o f zero 
motion between frames (and no other différences caused by noise, etc.), it is easy for the 
encoder to efficiently predict the current frame as a duplicate o f  the prediction frame. In 
such as case, the only infoimation necessary to transmit to the decoder becomes the 
syntactic overhead necessary to reeonstruct the picture from the original reference frame. 
When there is motion in the images, the situation is not as simple.
Motion estimation techniques form the core o f video eompression and video 
processing applieations. Motion estimation extracts motion infonnation from the video 
sequence. The motion is typically represented using a motion vector (x, y). T he motion 
vector indicates the displacement o f  a pixel or a pixel block from the cuirent location due 
to motion. Motion infonnation is used in video eompression to find best matching block 
in reference frame to caleulate low energy residue, used in scan rate conversion to 
generate temporally interpolated frames. It is also used in applications such motion 
compensated de-interlacing, video stabilization, motion tracking etc. Varieties o f motion 
estimation techniques are available. There are pel-recursive techniques, which derive 
motion vector for each pixel. There is the phase plane correlation technique, which 
generates motion vectors via correlation between eurrent frame and referenee frame. The 
computational complexity o f a motion estimation teehnique can then be determined by 
three factors: Search algorithm. Cost function/evaluate function and Search range
parameter.
Actually, we can reduce the complexity o f the motion estimation algorithms by 
reducing the complexity o f the applied search algorithm and/or the complexity o f the 
selected cost function. A  full search algorithm evaluates all the weights in the search 
window, and a more efficient, less complex search algorithm w ill decrease the search 
space. Intuitively, one might expect that the ideal processor for reducing temporal 
redundancy is one that tracks every pixel from frame to frame. This is computationally 
intensive, and such methods do not provide reliable tracking due to presence o f noise in 
frames. Instead o f tracking individual pixels from frame to frame, video coding standards 
only allow tracking infonnation for 16x16 pixel regions, commonly referred to as
macroblocks [1J. The macroblock dimension o f 16x16 is chosen because it  provides a 
good compromise between providing efficient temporal redundancy reduction and 
requiring moderate computational requirements.
Let the two consecutive frames in Fig. 1.1 be denoted as frame (t - 1) and frame (f). In 
the first stage, we segment frame (t) into non-overlapping 16x16 pixel regions 
(macrobloeks), and for each 16x16 block we determine a eorresponding 16x16 pixel 
region in frame (t-1).
I
Fig. 1.1 Illustration o f two consecutive frames
Using coiTesponding 16x16 pixel region from frame (t-1), the temporal redundancy 
reduction processor generates a representation for frame (t) that contains only the changes 
between the two frames. I f  the two frames have a high degree o f  temporal redundancy, 
then the difference frame would have a large number o f pixels that have values near zero 
[1]. For example, in F ig.1.1, there is a high degree o f temporal redundancy, as evidenced 
by the sim ilarity o f features in both frames. On the other hand, i f  frame (t) were 
completely different than frame (t-1), then the temporal redundant reduction processor 
may fail to corresponding regions between two frames. The other techniques w ill be
discussed in detail in the later part o f this chapter. The most popular technique is Block 
Matching Algorithm . The implementation described here uses Block Matching 
Algorithm. The implementation is based upon a proposed motion estimation accelerator 
module in [3, 5]. The extension is implemented w ith  a special hardware for alignment o f 
reference frames and the required control circuitry. A  32x32 pixel search area o f the 
referenee frame is used as standard for each current frame. The implementation differs 
from [3, 5] due to the pipelining approach which considerably reduces the total 
computation time for finding the best match.
1.2 Block matching algorithm theory
1.2.1 Introduction
Block Matching A lgorithm  (BM A) is the most popular motion estimation algorithm. 
Block Matching A lgorithm  calculates motion vector for an entire block o f pixels instead 
o f individual pixels. The same motion vector is applicable to all the pixels in the block. 
This reduces computational requirement and also results in a more accurate motion vector 
since the objects are typically a cluster o f pixels. Block Matching A lgorithm  is illustrated 
in Fig. 1.2 [14]. The current frame is divided into pixel blocks and motion estimation is 
performed independently for each pixel block. Motion estimation is done by identifying a
t
pixel block from the reference frame that best matches the cunent block, whose motion is 
being estimated. The reference pixel block is generated by displacement from the cunent 
block’s location in the reference frame. The displacement is provided by the Motion 
Vector (M V). M Y  consists o f is a pair (x, y) o f horizontal and vertical displacement 
values.
Search region
Motion Vector -  MV1(x,y)
'OXJ
o o o a ^Gference Block 
b-o-oxj
Reference Frame
Current Block ]
p t j o ' d  
|d o o c( 
o o o q 
b x L O _ d
Current Frame
Fig. 1.2 Illustration o f  Block Matching A lgorithm  (BM A)
In video coding terminology, the match is being performed between rectangular 
regions; this is refeiTed to as a block matching criterion, and search techniques to find the 
motion vectors, that yield the smallest Mean Absolute Difference (M AD ), are referred to 
as Block Matching Algorithms. There are various criteria available for calculating block 
matching. We focus ourselves to M AD . Let the pixels o f the macroblock in the current 
frame be denoted as C (x+k, y+1) and the pixels in the referenee frame be denoted as 
R(x+i+k, y+j+1). The cost function becomes
Mean Absolute Difference (M AD ) =
AM V-l
M A
2 j  + k ,y + l ) -  R{x +  i +  k, y + j  + f))\ ( Eq 1.1)
A-T) l~0
10
Sum o f  Absolute Differences (SAD)
A /-I V - l
2  2  + A j;+ Z )  - /((% + / +  + /) )  | (Eq 1.2)
k-0 1-0
In video coding standards, N = M  = 16. The best matching block is the block R(x+i, 
y+ j) for which M A D  (i, j )  is minimized. Thus, the coordinates (i, j )  for which M A D  is 
minimized define the motion vector. Basically, M A D  is obtained by dividing SAD by the 
product o f M N  i.e. 256. In hardware it indicates a shift o f value to the right by 8 
positions, since 2^ = 256. M A D  provides fa irly good match at lower computational 
requirement. Hence it is widely used for block matching. There are various other criteria 
also available such as cross coirelation, maximum matching pixel count etc. The 
reference pixel blocks are generated only from a region known as the search area. Search 
region defines the boundary for the motion vectors and lim its the number o f blocks to 
evaluate. The height and width o f the search region is dependant on the motion in video 
sequence. The available computing power also detennines the search range. Bigger 
search region requires more computation due to increase in number o f evaluated 
candidates. Typically the search region is kept wider (i.e. w idth is more than height) since 
many video sequences often exhibit panning motion. The search region can also be 
changed adaptively depending upon the detected motion.
1.2.2 Full search block matching algorithm
Among all the BMAs, Full-Seaich Block Matching A lgorithm  (FSBMA) is the most 
popular. FSBMA evaluates every possible pixel block in the search region. Hence, it can 
generate the best block matching motion vector. This type o f B M A  can give least
11
possible residue for video compression. But, the required computations are prohibitively 
high due to the large amount o f candidates to evaluate. For typical values for broadcast 
TV  (I = 720, J = 480 and F = 30), motion estimation based on full-search algorithm 
requires 29.89 GOPS (Giga operations per second) for a search area o f 32x32 pixels [1].
The FSBMA is usually used in the hardware implementation o f Motion Estimation, 
because o f its simplicity, regularity, and optimum result. The most commonly used 
metric to determine the best match for FSBMA in hardware is the Mean Absolute 
Differences. Main goal is to compute the m inimum M A D  from among all the candidate 
blocks. To do this, search iteration is performed for eaeh eandidate block. The M A D  adds 
up the absolute differences between corresponding elements in the candidate and 
reference block. The M A D  cost function is described in Equation (1).
Field Programmable Gate Arrays supports a high number o f processor elements (PE) 
in parallel mode. This property can be used to process, at the same time, all SAD 
operations from a MPEG macroblock in a search area. W ith this real time video encoder 
for Motion Estimation can be reached [15].
1.2.3 2D logarithmie search
2D Logarithmic search is very similar to binary search and it tests lim ited candidates. 
In the first step, the [-p, p] search rectangle is divided into two areas: one inside a
- p  p
(at integer pixel location) rectangle and one outside it. Further-more, instead o f 
searching the whole area, the Block Matching Criteria is computed for nine locations: at 
(0, 0) and at the eight major points in the perimeter o f the area. That is, i f  the distance 
between these points is d,, we compute the m inimum at (0, 0), (0, d,), (0, -di), (-d|, 0), 
(di, 0), (d), di), (di, -d|) and (-di, -d,). The distance d| is given by d, = 2'^ "', where k =
12
[log2p]. For example for p = 7, k = 3, di = 4 pixels. Using the best mateh location as the 
starting point, we then look for the best match in the eight perimeter points at distance d] 
which is d]/2. We continue this process until the k"’ search, where the eight perimeter 
search locations are spaced by one point. A fter these eight locations have been examined, 
we determine the location that yields the smallest criteria.
As shown in Fig. 1.3, during the first iteration, a total o f five candidates are tested. 
The candidates are centered around the current block location in a diamond shape. The 
step size for first iteration is set equal to ha lf the search range. For the second iteration, 
the centre o f  the diamond is shifted to the best matching candidate. The step size is 
reduced by ha lf only i f  the best candidate happens to be the centre o f the diamond. I f  the 
best candidate is not the diamond centre, same step size is used even for second iteration. 
In this case, some o f the diamond candidates are already evaluated during first iteration. 
Hence, there is no need for block matching calculation for these candidates during the 
second iteration. The results from the first iteration can be used for these candidates, fhe 
process continues t i l l  the step size becomes equal to one pixel. For this iteration all eight 
surrounding candidates are evaluated. The best matching candidate from this iteration is 
selected for the cunent block. I  he number o f  evaluated candidate is variable for the 2D 
logarithmie seareh. However, the worst case and best ease candidates ean be ealeulated. 
For I = 720, J = 480, F = 30 and search area o f 32x32, logarithmic search requires one 
GOP. The eomplexity o f logarithmic search is only 3.3 percent o f the complexity o f fu ll 
search [1]
Final iteration candidates 
Third iteration candidates 
Second iteration candidates 
First iteration candidates
Fig. 1.3 2D Logarithmic search
1.2.4 Three step search
In a three-step search (TSS) algorithm, the first iteration evaluates nine candidates as 
shown in Fig. 1.4. The candidates are centered around the current block’ s position. The 
step size for the fist iteration is typically set to ha lf the search range. During the next 
iteration, the search centre is shifted to the best matching candidate from the first 
iteration. Also, the step size is reduced by half. The same process continues t ill the step 
size becomes equal to one pixel. This is the last iteration o f the three-step search 
algorithm. The best matching candidate from this iteration is selected as the final
14
candidate. The motion vector corresponding to this candidate is selected for the cuiTent 
block. The number o f candidates evaluated during three-step search is very less compared 
to the fu ll search algorithm. The number o f evaluated candidate is fixed depending upon 
the step size set during the first iteration. For example, the computational complexity 
associated w ith 25 search locations is 777.6 MOPS (M illion  operations per second) [1].
Final iteration candidates 
Second iteration candidates 
First iteration candidates
Fig. 1.4 Three step search
15
1.2.5 Parallel Plierarchical One-Dimensional Search (PHODS)
Unlike the logarithmic search, the search in this search strategy is done independently 
along the two dimensions. The search algorithm is as follows;
1. For a [-p, p] search region let S = 2 and set the origin o f the search space at 
search location (0,0). Denote the origin as (di, dj).
2. In parallel, compute the
a. i-axis local minimum: Among the three locations (di-S,dj), (di,dj), 
(di+S,dj), find the location that yields the smallest M AD . Set dj to the j  
coordinate o f  this location.
b. j-axis local minimum: Among the three locations (di,dj-S), (di,dj), 
(di,dj+S), find the location that yields the smallest M AD . Set dj to the j 
coordinate o f this location.
Set S = - .
2
Repeat step 2, until S= 0. The final (di,dj) is the motion vector that yields the best match 
for the macrobloek in the cun-ent picture. For the case o f p = 7, we need to examine 13 
search locations, which for frames at 720 x 480 resolution and 30 frames/s coiTcsponds to 
404.35 MOPS. Parallel Hierarchical One-Dimensional Search has two distinct 
advantages over TSS: (1) the M A D  calculations are parallelizable, and (2) it  has regular 
data flow , since the search locations are always along the i-axis and the j-axis.
Both logarithmic and the PHODS methods belong to the class o f fast algorithms that 
reduce motion estimation complexity by reducing the number o f search locations that are 
used in determining the minimum M AD. For p = 7, compared to fu ll search method, the 
complexity is reduced from 6.99 GOPS to 404.35 MOPS [I  j. Fast algorithms that work
16
in reduced search space assume that M A D  (i,j) increases monotonically as the search area 
moves away from the best matched location. Such algorithms perfonn as well as the fu ll- 
search method i f  this assumption holds; however, in practice the assumption often fails, 
since not all the search locations are visited and the search for a global minimum may get 
trapped into a local minimum. Moreover, it is easy to parallelize Full-Search architectures 
whereas logarithmic algorithms require complex control mechanisms.
17
CHAPTER 2 
BACKGROUND OF MPEG
2.1 Background and overview
Video pictures in today’ s digital era pose a problem o f compression. Uncompressed 
digital video pictures take up enonirous amounts o f information. I f  you were to record 
digital video to a CD without compression, it  could only hold about five minutes, and 
that's without any sound. MPEG standards reduce the amount o f data needed to represent 
video, at the same time manages to retain very high picture quality. The Moving Picture 
Experts Group, commonly refeiTcd to as simply MPEG, is a working group o f 
International Organization for Standardization (ISO)/ International Electrotechnical 
Commission (lEC) charged w ith the development o f video and audio encoding standards. 
Its first meeting was in May o f 1988 in Ottawa, Canada. As o f late, MPEG has grown to 
include approximately 350 members per meeting from various industries, universities, 
and research institutions. MPEG's officia l designation is ISO/IEC JTCI/SC29 W G II. 
ISO/IEC JTC I is Joint Technical Committee 1 o f the ISO and the I EC. It deals w ith all 
matters o f information technology. MPEG has standardized the follow ing compression 
formats and ancillary standards:
18
• MPEG-1 : In itia l video and audio eompression standard. Later used as the standard 
for Video CD, and includes the popular Layer 3 (MP3) audio compression format.
• MPEG-2: Transport, video and audio standards for broadeast-quality television. 
Used for over-the-air digital television ATSC, DVB and ISDB, digital satellite 
TV  services like Dish Network, digital cable television signals, SVCD, and with 
slight modifications, as the .VOB (Video OBject) files that carry the images on 
DVDs.
• MPEG-3: Originally designed for HD TV, but abandoned when it was realized 
that MPEG-2 (w ith extensions) was sufficient for HDTV.
•  MPEG-4: Expands MPEG-1 to support video/audio "objects", 3D content, low 
bitrate encoding and support for D igital Rights Management.
In addition, the follow ing standards, while not sequential advances to the video 
encoding standard as w ith MPEG-1 through MPEG-4, are retened to by sim ilar notation:
• MPEG-7: A  multimedia content description standard.
• MPEG-21 : MPEG describes this standard as a multimedia framework.
MPEG compresses high data imagery and slightly affects the picture quality, which is 
not notable to the human eye. l  ire illusion o f  movement in TV  and cinema pictures is 
actually created by showing a sequence o f still pictures in quick succession, each picture 
changing a small amount from the one before. We cannot detect the individual pictures - 
our brain 'smoothes' the action out, A  dumb analogue TV  picture sends every part o f 
every picture, but digital MPEG video is much smarter. It looks at two pictures and 
works out how much o f the picture is the same in both. Because pictures don't change
19
much from one to the next, there is quite a lot o f repetition. The parts that are repeated 
don't need to be saved or sent, because they already exist in  a previous picture. These 
parts can be thrown out. Digital video also contains components our eyes can't see, so 
these can be thrown out as well. MPEG-2 is a popular coding and decoding standard for 
digital video data. MPEG-2 encoding uses both lossy compression and lossless 
compression. Lossy compression permanently eliminates infonnation from a video based 
on a human perception model. Humans are much better at discerning changes in color 
intensity (luminance infoimation) than changes in color (chrominance infonnation). 
Humans are also much more sensitive to low  frequency image components, such as a 
blue sky, than to high frequency image components, such as a plaid shirt. Details which 
humans are like ly to miss can be thrown away without affecting the perceived video 
quality. Lossless compression eliminates redundant infonnation while allowing for its 
later reconstruction. Similarities between adjacent video pictures are encoded using 
motion prediction, and all data is Huffman compressed. The amount o f lossy and lossless 
compression depends on the video data. Common compression ratios range from 10:1 to 
100:1. Certain sections o f  video are more complicated than other sections. When there is 
lots o f action and tine detail it's much more d iftlcu lt to encode properly than a slow 
moving scene w ith large areas o f the same color or texture in the picture. MPEG deals 
w ith this by concentrating its efforts and data use on the complicated parts. This means 
that the video is encoded in the best possible way. In MPEG, video is represented as a 
sequence o f pictures, and each picture is treated as a two-dimensional array o f pixels 
(pels). The color o f each pel is consists o f three components: Y  (luminance), Cb and Cr 
(two chrominance components).
20
The process can be explained in short as following: The encoder operates on a 
sequence o f pictures. Each picture is made up o f pixels amanged in a 16x16 array known 
as a macroblock. Macroblocks consist o f a 2x2 array o f blocks (each o f  which contains an 
8x8 array o f pixels). There is a separate series o f macroblocks for each color channel, and 
the macrobloeks for a given channel are sometimes downsampled to a 2x1 or 1x1 block 
matrix. The compression in MPEG is achieved largely via motion estimation, which 
detects and eliminates similarities between macrobloeks across pictures. Specifically, the 
motion estimator calculates a motion vector that represents the horizontal and vertical 
displacement o f a given macroblock (i.e., the one being encoded) from a matching 
macroblock-sized area in a reference picture. The matching macro block is removed 
(subtracted) from the current picture on a pixel by pixel basis, and a motion vector is 
associated w ith the macroblock describing its displacement relative to the reference 
picture. The result is a residual predictive-code (P) pieture. It represents the difference 
between the current picture and the reference picture. Reference pictures encoded without 
the use o f motion prediction are intra-coded (1) pictures. In addition to forward motion 
prediction, it is possible to encode new pictures using motion estimation from both 
previous and subsequent pictures. Such pictures are bidirectionally predictive-coded (B) 
pictures, and they exploit a greater amount o f temporal locality. Each o f  1, P, and B 
pictures then undergoes a 2-dimensional discrete cosine transfonn (DCT) which separates 
the pieture into parts w ith varying visual importance. The input to the DCT is one block. 
The output o f the DC 1 is an 8x8 matrix o f frequency coefficients. The upper left corner 
o f the matrix represents low frequencies, whereas the lower right corner represents higher 
frequencies. The latter are often small and can be neglected without sacrifieing human
21
visual perception. The DCT coefficients are quantized to reduce the number o f bits 
needed to represent them. Following quantization, many coefficients are effectively 
reduced to zero. The DCT matrix is then run-length encoded by emitting each non-zero 
coefficient, followed by the number o f zeros that precede it, along w ith the number o f 
bits needed to represent the coefficient, and its value. The run-length encoder scans the 
DCT matrix in a zig-zag order to consolidate the zeros in the matrix. Finally, the output 
o f the run-length encoder, motion vector data, and other information (e.g., type o f 
picture), are Huffman coded to further reduce the average number o f bits per data item. 
Fhe compressed stream is sent to the output device.
[n order to achieve high compression ratio, we must use hybrid coding techniques to 
reduce both spatial redundancy and temporal redundancy. In the MPEG coding, there are 
two kinds o f blocks: the 16x16 (pels) macro-block and the 8x8 (pels) basic block. The 
basic block is used when the DCT is performed and the macro-block is used for motion 
estimation. The encoding o f a video stream is done in several steps. Each o f the steps 
separately depicted in Fig. 2.1 are explained below. In the Fig. 2.1, DCT stands for 
Discrete Cosine Transform, Q for Quantization, IDCT for Inverse Discrete Cosine 
Transform, IQ for Inverse Quantization. The FRAMES block stores two frames at a time
i.e. currently encoded frame and a previously encoded frame. These two frames are used 
to estimate the motion occurred between two consecutive frames o f the video sequence.
22
SUBIRACl'OR
QIJAN'I'IZBD DC'r 
COKFFICIl'NTS
VIDHO
OUT
MOTION
Vl'XTORS
DCT
IDCT
INPUT
VIDIX)
MOTION
KSTIMA'i'iOR
COLOR
CONVERNON
MOTION
COMPENSATOR
TIUPFMAN/RUN 
LLNGTU CODER
Fig. 2.1 Video encoder block diagram
2.1.1 Color conversion
In this step the input color-space is transformed into the YCbCr color-space. 
Furthennore, the chrominances are subsampled by a factor o f two in both the horizontal 
and vertical direction. Thus, a 16x16 block from the video signal results in four 8x8 
luminance blocks, one 8x8 Cb block, and one 8x8 Cr block. These 8x8 blocks are used 
by the DCT. The 16x16 luminance block is used by the motion estimation.
2.1.2 Motion estimator
Motion estimation is the process o f determining motion vectors that describe the 
transformation from one 2D image to another; usually from adjacent frames in a video 
sequence. It is an ill-posed problem as the motion is in three dimensions but the images 
are a projection o f the 3D scene onto a 2D plane. The motion vectors may relate to the
zo
whole image (global motion estimation) or specific parts, such as rectangular blocks, 
arbitrary shaped patches or even per pixel. The motion vectors may be represented by a 
translational model or many other models that can approximate the motion o f a real video 
camera, such as rotation and translation in all three dimensions and zoom. In motion 
estimation an exact 1:1 correspondence o f pixel positions is not a requirement. Applying 
the motion vectors to an image to synthesize the transformation to the next image is 
called Motion compensation. The combination o f motion estimation and motion 
compensation is a key part o f video compression as used by MPEG 1, 2 and 4 as well as 
many other video codecs.
2.1.3 Motion compensator
One method used by various video fomrats to reduce file  size is motion 
compensation. For many frames o f a movie, the only difference between one frame and 
another is the result o f either the camera moving or an object in the frame moving. In 
reference to a video file, this means much o f the infonnation that represents one frame 
w ill be the same as the infoimation used in the next frame. Motion compensation takes 
advantage o f this to provide a way to create frames o f a movie from a reference frame. 
For example, in principle, i f  a movie is shot at 24 frames per second, motion 
compensation would allow  the movie file  to store the fu ll infoimation for every fourth 
frame, fhe only information stored for the frames in between would be the information 
needed to transform the previous frame into the next frame. I f  a frame o f infoimation is 
one M B in size, then uncompressed, one second o f this film  would be 24 MB in size. 
Using motion compensation, the file  size for one second o f the film  could be reduced to a 
little  over 6 MB. More formally, in video compression, motion compensation is a
24
technique for deseribing a picture in terms o f the transformation o f  a reference picture to 
the current picture. The reference picture may be previous in time or even from the 
future. When images ean be accurately synthesized from previously transmitted/stored 
images then the compression efficiency can be improved. In MPEG, images arc predicted 
from previous frames (P frames) or bidirectionally from previous and future frames (B 
frames). B frames are not so popular because the image sequence must be 
transmitted/stored out o f order so that the future frame is available to generate the B 
frames. A fter predicting frames using motion compensation, the coder finds the error 
(residual) which is then compressed using the DCT and transmitted.
In block motion compensation (BMC), the frames are partitioned in blocks o f  pixels 
(e.g. macrobloeks o f 16x16 pixels in MPEG). Each block is predicted from a block o f 
equal size in the reference frame. The blocks are not transfomred in any way apart from 
being shifted to the position o f the predicted block. This shift is represented by a motion 
vector. To exploit the redundancy between neighboring block vectors, (e.g. for a single 
moving object covered by multiple blocks) it is common to encode only the difference 
between the current and previous motion vector in the bit-stream. The result o f this 
differencing process is mathematically equivalent to global motion compensation capable 
o f panning. Further down the encoding pipeline, an entropy coder w ill fake advantage o f 
the resulting statistical distribution o f the motion vectors around the zero vector to reduce 
the output size. If is possible to shift a block by a non-integer number o f  pixels, which is 
called sub-pixel precision. The in-between pixels are generated by inteipolating 
neighboring pixels. Commonly, half-pixel or quarter pixel precision is used. The 
computational expense o f sub-pixel precision is much higher due to the extra processing
25
required for interpolation and on the encoder side, a much greater number o f potential 
source blocks to be evaluated.
The main disadvantage o f block motion compensation is that it  introduces 
discontinuities at the block borders (blocking artifacts). These artifacts appear in the form 
o f sharp horizontal and vertical edges which are easily spotted by the human eye and 
produce ringing effects (large coeffieients in high frequency sub-bands) in the Fourier- 
related transform used fo r transform coding o f the residual frames.
Block motion compensation divides up the current frame into non-overlapping 
blocks, and the motion compensation vector tells where those blocks come from (a 
common misconception is that the previous frame is divided up into non-overlapping 
blocks, and the motion compensation vectors tell where those bloeks move to). The 
source blocks typically overlap in the source frame. Some video compression algorithms 
assemble the cunent frame out o f pieces o f several different previously-transmitted 
frames. Frames can also be predicted from future frames. The future frames then need to 
be encoded before the predicted frames and thus, the encoding order does not necessarily 
match the real frame order. Such frames arc usually predicted from two directions, i.e. 
from the I- or f^-frames that immediately precede or fo llow  the predieted frame. These 
bidirectionally predicted frames are called B-frames. A  coding scheme could, for 
instance, be IBBPBBPBBPBB.
2.1.4 Discrete Cosine Transfonn (DCT) and Inverse Discrete Cosine Transform (IDCT)
DCT is a lossy compression scheme where an N x N image block is transfonned from 
the spatial domain to the DCT domain. DCT decomposes fhe signal into spatial frequency 
components called DCT coefficients [23]. The lower frequency DCT coefficients appear
26
toward the upper left-hand corner o f the DCT matrix, and the higher frequeney 
coefficients are in the lower right-hand corner o f the DC'F matrix. The Human Visual 
System (I I VS) is less sensitive to errors in high frequency coefficients than it is to lower 
frequeney coefficients. Because o f this, the higher frequency components can be more 
finely quantized, as done by the quantization matrix. Each value in the quantization 
matrix is pre-sealed by m ultiplying by a single value, known as the quantizer scale code. 
This value can range in value from one to 112 and is modifiable on a macroblock basis. 
D ivid ing each DCT coefficient by an integer scale factor and rounding the results 
accomplishes quantization. This sets the higher frequency eoetficients (in the lower right 
comer), that are less significant to the compressed picture, to zero by quantizing in larger 
steps. I'he low  frequency coefficients (in the upper left comer), are more significant to 
the compressed picture, and are quantized in  smaller steps. The goal o f quantization is to 
force as many o f the DCT coefficients to zero, or near zero, as possible w ith in the 
boundaries o f the prescribed bit-rate and video quality parameters. Thus, since 
quantization throws away some infomiation, it is a lossy compression scheme.
'fhe data compressed at the transmitter needs to be decompressed at the receiver. 
IDCT is used to decompress DCT compressed data in the decoder. DCT and IDCT are 
two o f the most computation intensive funtions in compression. I'herefore, a fast and 
optimized DCT/IDCT implementation is essential in improving the perfomiance o f the 
video coder and decoder.
2.1.5 Quantization and Inverse Quantization
Quantization is done to achieve better compression. Quantization reduces the number 
o f bits needed to store information by reducing the size o f the integers representing the
27
information in the scene. These are details that the human visual system ignores, fhis 
step represents one key segment in the m ulti- compression process. A  reduction in the 
number o f bits reduces storage capacity needed, improves bandwidth, and lowers 
implementation costs. Quantization is the process o f selectively discarding visual 
information without a significant loss in the visual effect. Quantization reduces the 
number o f bits needed to store an integer value by reducing the precision o f  the integer. 
Each discrete cosine transform (DCT) component is divided by a separate quantization 
eoefticient, and rounded to the nearest integer. The larger the quantization coefficient 
(i.e., coefficient weighting), the smaller the resulting answer and associated bits needed to 
express the DCT component. In the reverse process, the fractional bits are "rounded" and 
are recovered as zeros, constituting a precision loss from the original number. 
Quantization could be considered as input data binning where the number o f bins is less 
than the number o f possible input values. The number o f bins is decided by the 
quantization factor Q. I f  the input data range is from one to 60, and i f  Q is 5, then 60/5 is 
12 bins (0 to 5, 6 to 10, and so on). A  different input data range o f 60 is now reduced to 
12 possible bins [24]. The quantized Discrete Cosine transform coded coefficients are fed 
into the quantizer. The quantized coefficients are taken through an inverse quantizer to 
get back the original DCT coefficients. Since quantizing is a lossy process where certain 
DCT coeffieients are thrown away, the inverse quantization w ill not given back all o f the 
original 64 DCT coeffieients. The non-recovered coefficients have the least visual effect 
on the picture.
28
2.1.6 Huffman Coding
Frequently oeeurring symbols are assigned short code words whereas rarely occurring 
symbols are assigned long code words. The resulting code string can be uniquely decoded 
to get the original output o f the run length encoder. The code assignment procedure 
developed by Fluffman is used to get the optimum code word assignment for a set o f 
input symbols. The procedure for Huffman coding involves the pairing o f symbols. The 
input symbols are written out in the order o f decreasing probability. The symbol w ith the 
highest probability is written at the top, the least probability is written down last. The 
least two probabilities are then paired and added. A  new probability lis t is then formed 
w ith one entry as the previously added pair. The least symbols in the new list are then 
paired. This process is continued t i l l  the list consists o f only one probability value. The 
values "0" and "1" are arbitrarily assigned to each element in each o f the lists. Fig. 2.2 
shows the follow ing symbols listed w ith a probability o f occurrence where: A  is 30%, B 
is 25%, C is 20%, D is 15%, and E = 10% [25].
A — 30 00
A — 30
B — 25 01
B - - 2 5
C — 20 f f
-----F — 25
D — 15 fOO
0  — 20
fOf
E — 10 ■
00
Of
to
11
- G — 45 1
A — 30 00
B — 25
Of
-H — 55 
G -— 45
Fig. 2.2 Illustration o f huffman coding
29
Steps in Huffman coding
1. Adding the two least probable symbols gives 25%. The new symbol is F
2. Adding the two least probable symbols gives 45%. The new symbol is G
3. Adding the two least probable symbols gives 55%. The new symbol is H
4. W rite "0" and " 1 " on each branch o f the summation arrows. These binary values
are called branch binaries.
5. For each letter in each column, copy the binary numbers from the column on the 
right, starting from the right most column (i.e., in column three, G gets the value 
"1" from the G in column four.) For summation branches, append the binary from 
the right-hand side column to the left o f  each branch binary. For A  and C in 
column three append "0" from H in column four to the left o f the branch binaries. 
This makes A "00" and B "01".
Completing step 5 gives the binary values for each letter: A  is "00", B is "01", C is 
"11", D is "100", and E is "101". The input w ith the highest probability is represented by 
a code word o f length two, whereas the lowest probability is represented by a code word 
o f length three.
CHAPTER 3
IM PLEM ENTATIO N OF MO FION ESTIM ATIO N HARDW ARE ACCELERATOR
3.1 Introduction
The main purpose o f  this project was to build a lab prototype o f a motion estimation 
hardware accelerator which can be easily mounted to a general purpose RISC processor. 
There are varieties o f implementations described by research teams all around the world, 
but none provide any modules or codes in implementation and testing o f a motion 
estimator module in hardware. I'h is was the very reason which prompted me to write a 
Verilog H D L code for the motion estimator hardware accelerator. The HD L approach 
facilitates reconfigurability and m odifiability. The entire code has been written in Verilog 
HDL. During the course o f the project number o f problems were faced and tackled. 
Control circuitry had to be redone a couple o f  times w ith respect to optimized control and 
output. Various algorithms were studied and the perfomiance descriptions provided in [1] 
gave a clear idea about advantages and disadvantages o f  each algorithm. The literature 
survey gave a fa ir idea o f the recent research being camied out in this field [16, 17, 18, 
19, 20, and 21]. Though the logarithmic and much superior estimation algorithms assist 
in achieving faster computations, the complex control associated w ith them and the 
probability o f error left w ith the option to choose Full Search Block Matching Algorithm 
as the candidate for implementation. The advantages like parallelizable structures and 
ease o f implementation o f Full Search Block Matching A lgorithm  described in [1, 2, 15,
31
and 181 makes it  an ideal algorithm to be implemented when it comes to FPGA based 
systems. The very features have been fu lly  utilized in the implementation which is an 
extension to the proposed SAD motion estimation architecture explained in [4, 5].
The hardware accelerator consists o f the SAD module, CuiTcnt frame control. Reference 
frame control and the frame storages. The output is the motion vectors which describe 
where in the 2-dimensional area is the best match found. The SAD module fomis the core 
o f the whole system. The follow ing block diagram in Fig. 3.1 gives an idea o f different 
components and the next section explains the working o f each module in detail.
3.2 Block diagram description
Current 
frame storage
SAD
MVs
State machine
Current Reference
frame control frame controlJ L
Reference 
frame storage
Fig. 3.1 Motion estimation block diagram
32
3.2.1 Reference frame storage
The current macrobloeks are 16x16 blocks which contain the current frame 
information and have to be compared w ith the reference macrobloeks which arc already 
stored. For this project, M A T LA B  is used to segregate the macrobloeks. But, I w ill 
explain it here theoretically to give an idea. The Frame Grabber board [22] which we 
have has a Video RAM  installed on it which stores frames o f moving pictures and has a 
huge FIFO structure to store each pixel one by one in it as explained in [22]. The FIFO 
read pointer in itia lly  points to the location where the luminance (Y) information o f the 
first pixel o f the current frame is stored. The Chrominance components (Cb and Cr) o f 
the pixel are stored in  and 2"'  ^ location. The Frame grabber can be programmed to 
make the read pointer go to the user desired location which facilitates the segregation o f 
32x32 search area o f the reference frame as well as the 16x16 macroblock o f the current 
frame. Moreover, for motion estimation module only luminance infonnation is used to 
compute the motion vectors. So the Frame grabber can be programmed to just read and 
send Luminance pixels o f the required 256 bytes. The increment pointer makes it even 
easier to hop from pixel 16 to pixel 257 to achieve the sliding macroblock effect. For my 
implementation I have made use o f Irfan V iew  ©software to first segregate the individual 
frames from the movie [8]. Further I have written a code in M A T IA B '' ’^ '© to segregate 
individual current frame into 16x16 macrobloeks which can be directly fed into Block 
RAMs in the FPGA and also to segregate the 32x32 search area o f the reference frame. 
This is just for the functional simulation purposes. Once Irfan V iew  grabs into individual 
frames from the movie, the M A TLA B  code segregates the individual macrobloeks and 
search areas into specific text files which are read by the HD I. code testbench. The
testbench operates in a sequential fashion. So this takes up time during simulation. In 
real-time the Video R A M  buffer has to be used and data has to be read from the same. 
For this an on-ehip or off-ehip SDRAM ean be used. Normally, all the FPGA vendors 
sell the SDRAM eontroller HDL codes as it is very complex and is out o f the scope o f 
this project.
3.2.2 Current frame storage
For the current frame storage is done in a sim ilar way as the reference frame storage. 
The only difference is segregation has to be done for 16x16 macrobloeks. It requires less 
memory and is faster as each time only 256 bytes o f luminance pixels have to be read. As 
explained earlier the testbench method helps in testing the functional simulation but is not 
helpful in speed up due to its sequential operation, fh is  can be avoided by concurrent 
operation which can be achieved in case o f  an on-chip or off-chip memory. I f  two 
different R A M  modules are used w ith the read cycles properly synchronized by a FSM 
then while one port is being utilized to work a macroblock the other port can be used to 
read the second 16x16 macroblock and store it in the second Current Frame FIFO 
explained in the next section. Similar can be applied to the Reference Frame Sliding 
W indow Controller explained in the further sections. I ’his w ill help in nu llify ing wait 
times for the SAD module in reading the required data.
To simulate this on-chip memory is used, which in case o f FPGAs is the Block RAMs 
(BRAMs). The BRAMs can be instantiated w ith the help o f the CORE Generator feature 
in-built in the X ilin x  ISE. CORE Generator is a graphical interactive design tool that 
enables us to create high-level modules such as memory elements, math functions and 
communications and 10 interface cores. We can customize and pre-optimize the modules
to take advantage o f the inherent architectural features o f the X ilin x  FPGA architectures, 
such as Fast CatTy Logic, SRLl6s, and distributed and block R AM  [9|.
The instantiation can be done by referring to the text file  which is segregated using 
M A T LA B  as a Coefficients file  (.coe) file  [10]. This file loads up to initialize the 
BRAM s as per the values in the .coe file. For the coefficients file  some syntaetieal rules 
have to be followed otherwise the CORE generator outputs an error message. A fter the 
BR AM  is instantiated a my ram .m if (Memory Initialization Format) file  is generated 
which contains the values we fed as the Coefficients tile. Only COE files may be used as 
inputs to cores for the purpose o f specifying initialization values for memory cores and 
for specifying coefficient values. M IF files can only be generated as output files for use 
in HDL behavioral simulations. They cannot be used to specify in itia l values when a core 
is generated. M IF files w ill always be written out for memory (in binary fonnat only), 
based on values specified in any input COE files (or default values, as the ease may be).
3.2.3 Reference frame control
This module is a sliding window controller which sweeps across the 32x32 search 
area i.e. 1024 pixels. The Fig. 3.2 gives a clear picture o f the sliding window movement. 
The Macroblock at each specific point as shown in the figure is latched and fed to the 
SAD module for further computation. The sliding window gives an accurate account o f 
overall search area. The probability o f finding the best match increases in this case.
30
\ i n
SEARCH AREA
r e i 'e:r tNCE
FRAME
Ml'
SEARCH AREA
REI'ERE:NCE
FRAME
SEARCH AREA
REEKRENCE 
FRAME
MB
SEARCH AREA
rf:f e r e n c e
f r a m e :
Fig. 3.2 Sliding motion o f the reference macroblock with in the search area
The hardware for this is as shown in the Fig. 3.3. The FIFOs and D Flip Flops are 
connected in such a way that at every clock cycle a new set o f  16x16 i.e. 256 values are 
available. These values are fed into the SAD module at once. Each value in the FIFO and 
the D Flip Flop are a byte long accounting only for the luminance information o f the 
pixels in the image. I'he feeding in o f this structure takes up most o f the critical time, 
which can be properly synchronized to achieve the desired speed up by using two 
memories and two such structures. While one is active the other structure can start 
acquiring new set o f macroblock values and a control circuit can be set up easily.
36
DFF
-«— • • •  < -
DFF DFF FIFO
1 15 16 16 deep
DFF ^  e  #  #  ^ DFF DFF FIFO
33 47 48 16 deep
#  #  # DFF FIFO
80 4— 16 deep 4----
•
•
DFF
4— •  •  •  4- DFF DFF FIFO481 495 496 16 deep
Fig. 3.3 Sliding window architecture
INPUT
'['here is an in-eguiarity involved in the architecture shown in Fig. 3.3. There are some 
values in the search area which are redundant i.e. when the window reaches the rightmost 
part o f the search area. During this time interval, the window has values which exactly do 
not define any particular macroblock in the search area, 'fhe values define nothing but an 
iiTegular macroblock which consists o f values from rightmost part o f the first row and left 
most part o f the second row o f  the search area. This is controlled by a control signal 
during which the SAD module does not capture any values. So only those valid values 
which define the search area properly are captured and used for finding the best match.
This avoids any incorrect results. This structure captures only 289 valid candidate 
macrobloeks for computation. This module assists in selecting all the macrobloeks in the 
search area without having any complex control c ircu itiy or coding techniques. The SAD 
module takes up almost 38 clock cycles as w ill be discussed in the next section, 'fhat 
forms the frequency o f the reference control sliding window operation which we call as 
reference elk. This frequency o f operation can be increased by achieving pipelining. 
Pipelining approach has also been adopted here, which w ill be explained in later sections. 
By breaking down the combinational logic and inserting registers, the critical path can be 
comprehensively reduced to 3 times increase in the operating frequency o f  the Reference 
Frame eontroller sliding window module. W ith minor extra overheads a faster operation 
can be achieved.
3.2.4 Current Frame Control
This module is a 8-byte Shift Register (SR) which shifts the 256 different luminance 
values o f the cunent macroblock. As soon as all the 256 values have been clocked in the 
shift register, all the values are latched into the 256in-256out structure, which then feeds 
concurrently into the input o f SAD block. The synchronization is a b it complex but not as 
complex as in the case o f logarithmic algorithms. In this case, the coefficients values 
stored in the current macroblock text files are called in the testbench. The text file  
emulates a R AM  which outputs consecutive RAM  location values one by one. So, the 
cuiTcnt macroblock text file  writes each luminance value in the Shift register location and 
all the 256 values are latched at one time. The latched contents are maintained t il l the 
whole search area is swept and a final motion vectors are obtained. For this, i f  another 
such structure is used w ith the same memory feeding in SR#2 through a Multiplexer,
while SR#1 is maintained for SAD operation, the wait times can be avoided for the SR#2 
to f i l l  up. So t i l l  the time SR#1 is busy finding the best match, memory can f i l l  in SR#2 
which can be ready w ith the next macroblock to go for second SAD operation. For now, 
only one structure operation is simulated.
3.2.5 SAD module
Ih is  section describes the SAD operation and the possible parallel implementations 
as proposed and implemented in [4, 5]. Though the theoretical documentation was 
available, no Verilog codes were available. The Sum o f Absolute Differences considers 
all data units A, and Bj to be unsigned 8 bits numbers. The general algorithm computing 
the Sum Absolute Difference o f two blocks is depicted in Equation ( I) . This section first 
describes the 16x1 SAD operation and then goes further to explain the extension to 16x16 
SAD. A  direct approach in computation the SAD consists o f the fo llow ing steps:
• Compute (A i  - Bj) for all 16x16 pixels in the two bloeks A  and B
•  Déterminé which (A j  - B j)  are negative and produce (B j - A|) in that ease as the 
absolute value, else produce (A j  - B j)
•  Perform the accumulate operation to all 16x16 absolute values.
By determining the smallest o f both operands and subtracting it from a constant, it 
becomes possible to eliminate the absolute operations. This subtraction is a trivia l 
operation, i f  the constant is chosen correctly. The smaller o f  two operands is determined 
by inverting one o f the operands, and computing the carr y-out which would arise from 
the addition o f both operands. The smaller operand is inverted, which means that its value 
changes to (2^ -- 1 -  X ) = (255 -  X). Both inverted smallest and the largest values are 
passed to the adder-trec, which corrects for this constant, 'fhe above two steps can be
39
earned out in parallel for 16 pels. The result is 32 8-bit values, on which the follow ing 
steps are applied. The conection term is added to account fo r the (2" -  l ) ’ s introduced by 
the inverting o f the smallest value. I f  the number o f pels on which the unit is operating is 
a power o f 2, the correction tenn is equal to that number, as the sum o f the 2" adds up to 
one “ simple eliminatable b if ’. I f  the number o f  pels the unit operates on is not a power o f 
two, we also have to account for the additional per pel. The resulting rows passed to the 
adder tree and the correction-term is 33 rows, are reduced to 2 rows by using a Wallace 
tree carry save adder scheme as proposed in [11, 12, and 13]. In this final step, a fu ll 
summation o f the two remaining rows is performed. The total sum o f all constants, which 
has to be discarded, is the carry out o f this addition.
To summarize, the first step is performed by computing [A ’ 4- B], where A ’ stands for 
inverted A. In ease no carry was generated, this means that B is not greater than A  and 
thus B should inverted. Otherwise, A  should be inverted. Next to passing the operands to 
an adder tree, an additional correction term must be added to counter the effects o f using 
inverted values, fhe adder tree reduces the adder terms two terms which are then passed 
to an adder. For precise mathematical details o f the approach, we refer to [3, 4].
In the previous section, the significance o f motion estimation in video coding is 
mentioned. An important metric used in motion estimation is the sum o f absolute 
differences (SAD). The absolute difference operation ean be implemented in several 
ways: serial, per column in parallel, per row in parallel, and fu lly  parallel. The 
implementation described in [5] focuses on the SAD 16 operation that performs the SAD 
on one row o f a macroblock (16x1). A ll the input values are 8-bit unsigned binary 
numbers. By iteration or parallel execution o f the SAD 16 operation, the complete SAD
40
operation for the 16x16 macroblock can be performed. First, the steps necessary to 
perform the 16x1 SAD operation in more detail:
• Determine the smaller o f the two operands: As suggested in [3, 4], it is only
necessary to determine whether (A ’ + B) produces a carry or not.
• Invert the smallest operand: I f  no carry was produced then B must be inverted; 
otherwise, A  must be inverted. This is done by utiliz ing an EXOR operation.
• Pass both operands to an adder tree: A fter inverting either A  or B, the operands
must be passed to an adder tree. Thus, the values (A ’ , B) or (A , B ’) are passed
further.
•  Add a correction term to the adder tree: Also an additional eorrection term must
be added to the adder tree which is 16 in this case i.e. adding 1 to each o f the 16
blocks.
• Reduce the 33 addition terms to 2: A ll 33 addition terms must be reduced to 2
terms before the final addition can be applied. This can be done using an 8-stage
carry save adder tree using 243 carry save adders.
•  Add the remaining two terms using an adder: The final two addition terms are 
added using a 8-bit carry lookahead adder for the most significant bits. The result 
is a 13-bit unsigned binary number. However, as stated in [4, 5], the most 
significant bit o f  this result can be disregarded resulting in a final 12-bit unsigned 
binary number.
In Fig. 3.4, the first three steps are depicted. The determination whether the addition 
(A ’+ B) generates a carry is performed without actually calculating the addition. Instead, 
this is achieved by only utiliz ing certain parts w ith in a carry-lookahead adder that
41
calculate the carry. The resulting carry and inverted carry are fed to two EXORs that w ill 
invert the correct tenn.
Invert
Om ty  generator
4 out B out
Fig. 3.4 Architecture to find the lower among A  and B
The inversion o f either As or Bs for all 16 absolute operations can be carried out in 
parallel and can be fed to an adder tree at the same time [4|. Fig. 3.5 depicts the eomplete 
SADI 6 operation that has been implemented in [5]. Next to the parallel exeeution o f the 
first three steps, the figure also depicts the addition o f a eorreetion tenn o f 16, the 33 to 2 
reduction tree, and the final 2 to 1 reduction. The implementation is synchronous and 
fu lly  pipeline-able.
42
16
#  »  #
A  B
L I
33 -> 2 reduction 2 - >  1 
reduction
Fig. 3.5 16x1 SAD architecture
The 1 6 x 1 6  SAD operations shown in Fig. 3.6, is the implementation earried out in 
Verilog for this project. The results have been compared w ith the implementation o f [5]. 
As in [5], for parallel operation o f 16x16 SAD, there is only one additional 32 to 2 
reduction tree (see Fig. 3.6) when compared to the SAD 16 x 1 unit depicted in Fig. 3.5. 
This reduction tree is o f similar complexity as the 33 to 2 one. For the SAD module o f 
[16] the output is obtained in 27 cloek cycles. The first 33 to 2 module requires 8 cloek 
cycles, second 32 to 2 takes another 8 clock cycles and final 2 to 1 reduction tree takes 
another 9 clock cycles.
43
16 A B A  B A B
t—L J—L i_Xr
A  B
j — L
#  #  #
33 -> 2 redncnaa
16 A  B A A B A B
# # *
Li
33 -> 2 reductio'a
16 A B  A B  A B  A B
^ . L i  i „ i  i —L _i—1
, j
# # # ]
33 -> 2 redttctic'U
16 A B
e
*
A  B A B A  B
L _ L  L L  L _ i j_i
* # #
33 -> 2 ïs-iiction
3 2 - > 2
I'e ttiic tio ii
2 - > l
re d u c t io n
Fig. 3.6 16x16 SAD architecture
44
The 16x16 SAD operation implemented here forms the critical path o f the whole 
design whieh eontrols the reference frames feeding pels to the SAD module. To extend 
above module and get the motion vectors as final output, a comparator module, a motion 
vector decoder module and some safety margin is considered. So the final output in this 
case is gotten after 38 clock cycles. The reference control sliding window circuitry is 
clocked at f/38 cycles for proper operation. But w ith the pipelining approach this very 
frequency is increased by approximately 3 times. So w ith the pipelining the combination 
logic is broken down into modules operating at faster frequency. So the final operation 
frequency o f operation at which the Referenee sliding window controller operates is f/14.
3.2.6 State machine
The State maehine controls the address provided to the BRAM s and the data input to 
the motion estimation module. It is a fa irly  simple state machine which utilizes the one- 
hot state encoding approach. The fo llow ing figure shows the state machine which 
controls the ciuTent frame control.
There are two state machines running concuiTcntly. One controls the reference frame 
and other controls the current frame. The states are as shown.
State 1 = "START": This state initializes the state machine
State 2 = ADDR INTf: The addresses o f BRAMs are initialized to all Os
State 3 = EN RAM : The BITAMs are enabled and data is read
State 4 = STAY: In this state, internally another straightforward state machine (explained 
in section explaining ‘ reference frame eontrof ) is activated which controls the flow  o f 
valid MBs with help o f ‘ SAD control’ signal
45
State 5 = NEW ADDR: In this state, after SAD operation over whole reference search 
area is completed, a new start address is fed and operation begins from EN R AM  state 
Same state machine is used for current frame except the BRAM  with curtent MBs 
w ill be enabled for 256 addresses only and BRAM  for reference frame search MBs for 
1008 addresses to compute for all valid macroblocks.
a d d rrd y
rst
State
!rst
laddr rdy
State
addr rdy
■addr <1008
State
addr == 1008
 ^r
!sad endState
sad end
State laddr rdy
Fig. 3.7 State Machine
3.2.7 Pipelining approach to increase the frequeircy o f operation
fhe pipelining approach is explained as follows w ith an example.
• Consider a combinational logic between two registers as shown in Fig. 3.8 below.
46
COMBlNA'l'IONAL
LOGIC
40ns
CI.K
Fig. 3.8 Logic w ith combinational logic delay
The frequency o f operation w ill depend upon the combination logic path 
delay, setup time and the clock to output delay o f the flip-tlops. Let us just 
consider the combinational logic path delay for the time being. I f  the delay is 
40ns, then our clock frequency becomes 25MFIz. Now we see the pipelining 
approach.
• As shown in fig. 3.9, the combinational logic can be broken down into blocks 
w ith smaller delays.
REG
8ns22ns
CI.KCI.K
Fig. 3.9 Logic w ith reduced combinational logic delay
47
Now, we can insert an intenncdiatc register after each smaller delay block and 
increase the frequency o f operation. W ith pipelining the above 40ns combinational logic 
delay is divided into two combinational logic blocks having delay as 22ns and 18ns as 
shown in Fig. 3.9. Consider the maximum o f the two and so 45 M Hz becomes our 
maximum operating frequency. Thus, the frequency increases 1.8 times, which gives a 
considerable speedup in the whole operation.
48
CHAPTER 4 
RESULTS
I ’he architecture uses a sliding window controller which sweeps the reference image 
to find the best macroblock match. The valid candidate blocks used to compare w ith the 
current macrobloek are 289. A  control signal controls which candidate blocks are valid to 
be computed and for which ones the SAD value should be registered. For test purposes an 
image 80x32 pixels is used. But the module is compatible to any size o f image as the 
architecture is generic provided the macroblocks and search area block are segregated 
and stored in the memory. The reason for selecting an 80x32 test image was due to ease 
in debugging and it required lesser simulator memory. The outputs involved lot o f values 
at each instant o f time and the generic nature o f the module makes it compatible for any 
size o f image. The time taken to output may vary as the image grows larger in size. This 
module is speeitlcally best utilized for smaller pixel size o f images. For example, Fig. 
4.1a shows a 16x16 macroblock and the Fig. 4.1b shows the 32x32 search area where the 
current macrobloek w ill be searched (Images enlarged from nonnal).
(a) (b)
hg. 4.1 (a) and (b) Illustrations o f current macrobloek and reference search area
respectively
49
Fig. 4.2 shows the simulation results for the non-pipelined version o f  the operation. I f  
the main clock is 50 MFIz, our reference frame circuit derived clock w ill be divide-by-38, 
which is 1.3 MFIz. According to the simulation, the total time taken to compute the best 
match for one macroblock is 778.64 us. Tbe final best match motion vectors for a fu ll 
search o f one cuiTcnt macrobloek is available at the rate o f 1.3 KFIz. So, the computation 
o f best match o f one macrobloek takes almost 1000 reference clock cycles. This time 
considerably reduces for the pipelined version. FPGAs which can operate at very high 
frequencies like 130MHz, 200MFlz etc also assist in speeding-up the operation.
Name V ^ ... 1 * 100 ■ . zpO • . 300 . 400 • 500 . . 600 . 700 . 800 ' ' 9(
1+1R- currinl 
;l R= refinl 
R= elk 
R= refcik 
R'- oil
R" end_sirn 
R= fifo_en 
t+i R= mem 
S  R= refmeml 
El R= relmem2 
l±i R= refmemS 
S  R= refm em l 
l+i R= refmemS 
SlR= i
Ei *■ best sad I il,8XX Xooc
E " m ^  1 Xoo
El m vf : Kxx X X""Xo2 X06
Fig. 4.2 Simulation results for motion estimation without pipelining
50
As shown in Fig. 4.3, for the same test image, we use the pipelined approach. Here 
for a main clock o f 50 MFIz, the derived clock is divide-by-14 i.e. reference clock is 3.5 
MHz. Thus the output is available after 284.48 us i.e. 3.5 KHz. Thus, for the 1000 
reference clock cycles/MB, the amount o f  time taken to get the final motion vectors for 
one macroblock reduces considerably. The time reduces by 491.52 us per current 
macrobloek best match computation.
Name Value 1 . 100 . 1 . 200 . 1 1 . 300 1 . 400 , . 500 . , . 600 , . ?i
it! R= currinl ffOD
i±l R= refin!
R= elk
R= refcik
R= cir
R= end_sim
R= fifo_en
It! R= mem <
E]R= refmeml Ï
El R= refmern2 K
EI R= refmemS K
[+] R= refmem4 K
El R= refmernS if
l±]R=i
El *■ best_sad K'xxx X)f027 XoOG
E] »■ mvX
Œ l ^ m v f  i Kxx XX'oi
Fig. 4.3 Simulation results for motion estimation w ith pipelining
51
Thus from above results, we come up with the following equation, which tells us how
much frames per second is supported by this architecture -
% := -------------------------------   (Eiq 4.1)
(ret clock cycles/MB) x N x Total MBs
Where,
X  = frames per second which can be supported for the given fmax 
fmax = maximum clock frequency
N  = fmax divided by N  gives the reference clock frequency at which reference 
sliding window protocol operates.
Based upon the above equation some o f the projected results are tabulated as follows. 
These results target Common Intermediate Format (CIF). C l F is a format used to 
standardize the horizontal and vertical resolutions in pixels o f YCbCr sequences in video 
signals, commonly used in video teleconferencing systems. It was first proposed in the 
11.261 standard. CIF was designed to be easy to convert to PAL or NTSC standards. 
QCIF means "Quarter CIF". To have one fourth o f the area as "quarter" implies, height 
and width o f the frame are halved. Tenns also used are SQCIF (Sub Quarter CIF), 4CIF 
(4x CIF) and 16CIF (16% CIF). SIF (Source Input Format) is practically identical to CIF, 
but taken from MPEG-1 rather than ITU  standards. SIF based systems is 352 x 240. 
Projected results for some o f them are tabulated in Table 4.1.
52
Table 4.1 Results describing the frames per second supported by the arehitccture
Format
Video
Resolution
No. of
MBs
FPGA family/Clock 
(fmax)
Supports fps
Non-
Pipelincd
Pipeline
d
SQCIF 128x96 48
Spartan-3/50MFIz 25fpa 60 fps
Spartan-3 
DSP/130MHz 60 fps 60 fps
Virtex-4/225MHz 60 fps 60 fps
QCIF 176x144 99
Spartan-3/5 OMIlz 15 fps 30 fps
Spartan-3 
DSP/130MHz 301pa 60 fps
Virtex-4/225MHz 60 fps 60 fps
CIF 352x288 396
Spartan-3/50MFIz 3 fps 9 fps
Spartan-3 
DSP/130MHz 8 fps :25 fpa
Virtex-4/225MHz 15 30 fps
SIF 352x240 330
Spartan-3/50MHz 4 fps 10 fps
Spartan-3
DSP/130MHZ 10 fps 25 fps
Virtex-4/225MHz 15 50 fps
53
(:H /L rfjE B i5
CONCLUSION AND FUTURE RECOMMENDATIONS
Motion Estimation in MPEG video is a temporal prediction technique. The basic 
principle o f motion estimation is that in most cases, consecutive video frames w ill be 
sim ilar except for changes induced by objects moving w ith in  the frames. Motion 
Estimation perfonns a comprehensive 2-dimensional spatial search for each huninance 
macroblock. MPEG does not define how this search should be performed. This is a detail 
that the system designer can choose to implement in one o f many possible ways. The 
motion estimation hardware accelerator based on a Full Search Block Matching 
Algorithm is implemented in Verilog HDL. State Machines for reference SAD control 
and reference BR AM  control can be merged together for sim plicity in coding. In the case 
o f described implementation, the codes were done in a hierarchical order due to which 
the state machines are split and are presented in that fashion. The reconfigurable nature 
o f FPGAs w ill make it easier to implement and make the core and re-test it. This core i f  
tested w ith a general purpose RISC processor like the X ilin x ’s Microblaze w ill make it 
easier to segregate the macroblocks and aid for the achieving the projected timelines. The 
core can be instantiated in the Microblaze and pixels can be fetched using Fast Simplex 
L ink interface at a faster rate. The generic nature o f the module defines a 32x32 search 
area for each cuiTcnt macrobloek and hence any size o f image can be used for testing 
purpose.
54
BIBLIO G RAPHY
1. V. Bhaskaran and K. Konstantinides. Image and Video Compression Standards: 
Algorithms and Architectures. Kluwer Acad. Publish., 2nd edition, June 1997.
2. l iago Dias, Nuno Carlos André Sebastiâo, Nuno Roma, Paulo Flores, I.eonel Sousa, 
Programmable IP core for motion estimation: comparison o f FPGA and ASIC based 
implementations. In IV  Jomadas sobre Sistemas Reconfiguraveis - RLC2008, pages 
109 116, February 2008.
3. S. Vassiliadis, E.A. Hakkennes, J.S.S.M. Wong, G.G. Pechanek, "The Sum-Absolute- 
Difference Motion Estimation Accelerator," euromicro, p. 20559, 24 th.
EUROMICRO Conference Volume 2 (EUROMICRO'98), 1998
4. S. Vassiliadis, E. Hakkennes, S. Wong, and G. Pechanek. The Sum-Absolute- 
Difference Motion Estimation Accelerator. In Proceedings o f the 24"’ Euromicro 
Conference, 2000.
5. Wong, S.; Vassiliadis, S.; Cotofana, S., "A  sum o f absolute differences 
implementation in FPGA hardware," Euromicro Conference, 2002. Proceedings. 28th 
, vol., no., pp. 183-188, 2002
6. J. Hauser and J.Wawrzynek. Garp: A  MIPS Processor w ith a Reconfigurable 
Coprocessor. In Proceedings o f the IEEE Symposium o f Field-Programmable Custom 
Computing Machines, pages 24-33, A pril 1997.
55
7. D. Cronquist, P. Franklin, C. Fisher, M. Figueroa, and C. Ebeling. Architecture 
Design o f Reconfigurable Pipelined Datapaths. In Proceedings o f the 20th 
Anniversary Conference on Advanced Research in VLSI, pages 23-40, March 1999.
8. http://www.irfanvicw.net/
9. http://download.xiIinx.com/direct/ise9 ju toria ls/ise9tut.pdf
10. http://www.xilinx.com/support/documentation/application_notes/xapp463.pdf
11. C. Wallace. A  suggestion for parallel multipliers. IEEE Trans. Electron. Comput., I3C 
13:14-17, 1964.
12. L. Dadda. Some schemes for parallel multipliers. A lta Frequenza, 34:349-356, May 
1965.
13. Vassiliadis, S.; Floekstra, .1.; Chiu, FI.-T., "Array multiplication scheme using (p, 2) 
counters and pre-addition ," Electronics Letters , vol.31, no.8, pp.619-620, 13 Apr 
1995
14. ‘'M otion estimation techniques in video processing’. Electronic Engineering Times 
India, August 2007
15. Joaquin Olivares; Ignacio Benavides; Javier Flormigo; Julio V illalba; Em ilio Zapata, 
"Fast Full-Search Block Matching A lgorithm  Motion Estimation Alternatives in 
FPGA," Field Programmable Logic and Applications, 2006.
16. R. Srinivasan and K.R. Rao, “ Predictive Coding based on Efficient Motion 
Estimation,”  IEEE Trans. Commun., vol. COM-33, no. 8, 1985, pp. 888-896.
17. S. Kappagantula and K.R. Rao, “ Motion Compensated Interframe Image Prediction,”  
IEEE Trans. Commun., vol. COM-33, no. 9, 1985, pp. lOI 1-1015.
56
18. L.G. Chen, W.T. Chen, Y.S. Jehng, and T.D. Chiueh, “ An Efficient Parallel Motion 
Estimation A lgorithm  for Digital Image Processing,”  IEEE Trans. Circuits System 
Video Technology., vol. 1, no. 4,1991, pp. 378-385.
19. M.J. Chen, L.G. Chen, and T.D. Chiueh, “ One-dimensional fu ll Search Motion 
Estimation A lgorithm  for Video Coding,”  IEEE Trans. Circuits Syst. Video Technol., 
vol. 4, no. 5, 1994, pp. 504-509.
20. M. Brimig and W. Niehsen, “ Fast Full-search Block Matching,”  IEI3E Trans. Circuits 
Syst. Video Technol., vol. I I ,  no. 2, 2001, pp. 241-247.
21. R. L i, B. Zeng, and M .L. Liou, “ A  New Three-step Search Algorithm  for Block 
Motion Estimation,”  IEEE Trans. Circuits Syst. Video Technology, vol. 4, no. 4, pp. 
438/442, Aug. 1994.
22. http://www.digitalcreationlabs.com/support.htm
23. http://www.xilinx.com/support/documentation/application_notes/xapp610.pdf
24. http://www.xilinx.com/support/documentation/appIication_notes/xapp615.pdf
25. http://www.xilinx.com/support/dociimentation/application_notes/xapp616.pdf
57
VITA
Graduate College 
University o f  Nevada Las Vegas
Nachiket Jugade
Address;
1555 E Rochelle Ave Apt 147 
Las Vegas, N V 89119
Degree:
• Bachelor o f  Engineering, Electronics and Telecom Engineering, 2006 
University o f Pune, India
Professional Experience:
• Teaching Assistant, University o f Nevada Las Vegas, Aug 2006-May 2008
• Engineering Intern, Aldec Inc. Las Vegas, Jun 2007-Aug 2007
Special Honors and Awards:
• Magna Cum Laude -  Member o f National Scholars’ Honor Society
• Bally Technologies’ Graduate Scholarship Recipient, 2007-2008
• Member o f Tau Beta Pi -  The Engineering Honor Society
• Vice President, Indian Student Association (ISA) at U N LV, 2007-2008
Thesis Title:
Implementation o f B M A  Based Motion Estimation Hardware Accelerator in HDL
Thesis Examination Committee:
Chaiiperson, Dr. Henry Selvaraj, Ph.D.
Committee Member, Dr. lim m a Regentova, Ph.D.
Committee Member, Dr. Muthukumar Venkatesan, Ph.D.
Graduate College Representative, Dr. Laxmi Gewali, Ph.D.
58
