Software and hardware techniques for accelerating MPEG2 motion estimation by Shahrukh Agha (3372908)
University Library 
•• LO';lghhprough 
.Umverslty 
AuthorlFlhng Title ........ .A.G.l1l\.t .. S .: ....................... . 
........................................................................................ 
Class Mark ........................... :1 .................................... . 
Please note that fines are charged on ALL 
overdue items. 
0403271894 
~III 

SOFTWARE AND HARDWARE 
TECHNIQUES FOR 
ACCELERATING MPEG2 MOTION 
ESTIMATION 
by 
Shahrukh Agha 
October 2006 
-
I 
t ~ j • • l,l '!' • .}, ~ l I 
,. >~ 1 ~ ... ), ...: • 
I~,·~",,~li .... ~-:. ~ I' 
, " 
I ' " ! I '( , r,_" J , I 
!~ ___ L w___ ~-.. __ ~---\ 
I ¥ "I: 
, 
i-- - - ,---- ----1 
, 
; ': ) I 
,---'- '-- ' . -- ------t 
t ' "I L .. _ _.' ______ "':!J 
•• Loughborough 
• Univer..ity 
Pllkington Library 
Date 0'/"..1 2001 
Class T 
Ace 
No. 
Acknowledgements 
To my uncle Dr. S. S. Durrani, mother Mrs Zahida Durrani, grand mother Mrs 
Mahboob Sultan and family who have always encouraged me and supported 
me financially and to my supervisors Dr Vincent M. Dwyer and Dr Vassilios 
Chouliaras who have always been very kind and helpful. 
2 
CONTENTS 
PREFACE 
Abstract 
Chapter 1 Multimedia Transmission and Storage 
1.1 Transmission and Storage of Video Images 
1.2 MPEG Standardization 
1.3 Structure of Thesis 
1.4 An Introduction to Thesis 
Chapter 2 MPEGl & MPEG2 
2.1 Introduction 
2.1.1 MPEGl 
2.1.2 MPEG2 
2.2 MPEG2 Video Encoding 
2.2.1 Algorithmic Representation of MPEG2 Encoder 
2.3 Temporal Prediction 
2.3.1 Motion Estimation and Motion Compensation 
2.3.2 Motion Vectors 
2.4 Picture Types 
2.5 Coding Interlaced Video 
2.6 Scalable Extensions 
2.7 Other Enhancements in MPEG2 
2.8 Summary 
Chapter 3 Motion Estimation Algorithm 
3.1 Introduction 
3.2 ME Algorithms 
3.3 Limiting the Search Path within a Reduced Search Area 
3.4 Fast Methods 
3 
7 
12 
16 
17 
20 
20 
22 
23 
24 
27 
28 
32 
33 
36 
37 
39 
40 
41 
45 
47 
47 
3.5 Summary and Conclusions 
Chapter 4 Vector Datapath for Real-Time MPEG2 Encoding 
4.1 Introduction 
4.2 Configurable and Reconfigurable Architectures 
4.3 Data Level Parallelization 
4.4 Vector ISA and Programmers Model 
4.5 DLP Microarchitecture 
4.6 Vector Datapath Hard Macro 
4.7 Thread Level Parallelism 
4.8 Parallel Simulation Methodology 
4.9 Results and Discussion 
4.10 Conclusions 
Chapter 5 Theoretical Analysis of DLP and TLP 
5.1 Theoretical Analysis 
5.2 Conclusions 
Chapter 6 Reduced Bit Full Search Block Matching 
61 
70 
71 
76 
94 
96 
103 
108 
109 
116 
120 
122 
147 
6.1 Introduction 149 
6.2 Architecture I 156 
6.3 Corrected-RBSAD Algorithms and Possible Hardware Realizations 157 
6.4 The Corrected-RBSAD Algorithm 158 
6.5 The General Form of the Hardware 166 
6.6 Reduced-bit VLSI SAD Engine 169 
6.7 Generalised-SAD Architecture 170 
6.8 Results 178 
6.9 VLSI Implementation 179 
6.10 Conclusions 180 
4 
Chapter 7 Summary 
7.1 Summary 
APPENDIX I 
List Of Publications 
References 
5 
186 
190 
207 
Abstract 
The aim of this thesis is to accelerate the process of motion estimation (ME) 
for the implementation of real time, portable video encoding. To this end a 
number of different techniques have been considered and these have been 
investigated in detail. Data Level Parallelism (DLP) is exploited first, through 
the use of vector instruction extensions using configurable/re-configurable 
processors to form a fast System On Chip (SoC) video encoder capable of 
embedding both full search and fast ME methods. 
Further parallelism is then exploited in the form of Thread Level 
Parallelism (TLP), introduced into the ME process through the use of multiple 
processors incorporated onto a single Soc. A theoretical explanation of the 
results, obtained with these methodologies, is then developed for algorithmic 
optimisations. 
This is followed with the investigation of an efficient, orthogonal 
technique based on the use of a reduced number of bits (RBSAD) for the 
purposes of image comparison. This technique, which provides savings of 
both power and time, is investigated along with a number of criteria for its 
improvement to full resolution. Finally a VLSI layout of a low-power ME 
engine, capable of using this technique, is presented. 
The combination of DLP, TLP and RBSAD is found to reduce the clock 
frequency requirement by around an order of magnitude. 
6 
1. Multimedia Transmission and Storage 
1.1 Transmission and Storage of Video Images 
As the world continues to shrink in size, the need to transmit good quality 
video (motion) images from all over the globe, and the variety of purposes for 
which these images are useful, grows accordingly. Currently such purposes 
include satellite TV, internet games, video-on-demand, video-phones, and, 
more recently, Apple's new video iPod, the newer portable media centres, 
telemedicine, and a host of others. 
In our increasingly mobile society portability is 'king'. With the advent 
of easy-to-transport, high-tech products such as cell phones, laptops, personal 
data assistants (PDAs) and portable MP3 players, many of the tasks 
previously restricted to the home desktop personal computer (PC) can now 
also be performed on the move. With portable (mobile) media centres [1], it is 
possible to store, and later access, nearly all of our digital entertainment files 
on a single, lightweight unit about the size of a paperback novel. These units 
are able to handle recorded television programs, movies, home videos and 
music as well as digital photographs. 
Historically this began with cable TV (CATV) which made it possible 
to select TV programmes from large number of channels. This was followed 
by the video rental business which, in combination with a video recorder, 
provided customers the opportunity to select movies when they wanted. This 
service became known as video-on-demand. Nowadays, however, Video-on-
Demand (VoD) includes a much wider range of services and opportunities, 
enabling television viewers to select a video program and have it sent to them 
over a network. The program(s) might be stored in the customer's set-top 
box's huge hard drive and the end-user (customer) could then watch it from 
that hard drive (this is most often the case with satellite TV.) Alternatively a 
customer could watch the video directly from the network's head-end (which 
7 
is the network's operating and storage facilities) or equivalent (as is most 
often the case with cable TV). 
Today's technologies allow telecommunication network operators to 
offer a host of other services, such as home shopping, computer games, and 
movies on demand. These services need to be competitively priced, for 
example, to compete with the video rental business, but offer the advantage 
that customers do not need to travel for the services. These possibilities have 
been reached on the back of the rapidly developing telecommunications and 
electronic industries which are able to boast a wide variety of amazing facts. 
An example of which is that, say, the capacity of a hard disk has doubled 
almost every year at near-constant cost for more than a decade [2-4]. 
With much of the multimedia spectrum, such as TV news, or sporting 
and live music events, video clips have to be transmitted and then played in 
more or less real time. Sending video information in real time requires it to be 
sent at a frame rate of around 25-30 frames per second, if a reasonably smooth 
video clip is to be provided [5-13]. With 10s of Mbits of information held in 
each frame, this represents an awful lot of data to be sent through the 
transmission channel, all of which have limited bandwidths. Consequently 
there is unquestionably a need for some degree of compression of the any 
video data prior to transmission. 
High quality video compression is particularly critical for DVD data. 
For example, the storage requirement for every second of video for 
uncompressed COR-601 (Consultative Committee for International Radio, 
e.g. nOH x 576V x 25Hz), at 4:2:2 resolution [5], of serial digital video is 
approximately 20Mbytes. This means that a two hour long film, in 
uncompressed COR-601 format, would require approximately 144 Gigabytes 
(GB) of storage, and that is before accounting for any accompanying audio. 
With DVDs currently being capable of storing a maximum 4.7 Gigabytes of 
data, compression ratios of approximately 40:1 are required to fit the video for 
a feature film along with the audio and sub-titles onto a single sided 4.7 GB 
DVD disc. 
8 
It is common to use prediction techniques to compress video data. For 
example in video sequences, a square region of the current frame pixel matrix 
s(i,j,k) is sought in the reference frame s(i,j,k-l) (a frame either succeeding the 
current frame or preceding it) in an attempt to find a region which is similar 
to it, the area searched for such a match is determined by a parameter p (the 
half search width), such that (2p+ 1)2 is the number of comparisons made. This 
process is called prediction. Such a technique, known as 'motion estimation 
(ME) and compensation', is used in video coding, for achieving significant 
compression [5-9]. The main operation of ME relies on a distance metric, 
typically the Sum of Absolute Difference (SAD) eqn. (1.1), which assesses how 
different various trial regions of the reference frames are from the region of 
interest in the current frame. Assuming that each portion is a 16 x 16 block of 
the current frame pixel matrix the SAD between current and reference frame 
blocks is 
16 
SAD(m,n) = :Lis(i,j,k)-s(i+m,j+n,k-l)i (1.1) 
Id=1 
which consists of basically three operations: SUB (subtraction), 
ADD(addition), ABS(absolute). From eqn. (1.1) we can derive the number of 
operations per second (Op) used in the calculation of the distance metric for 
the full-search motion estimation algorithm [10]: 
Op = 3. 2p. 2p. Nh. Nv• fr (1.2) 
with the horizontal/vertical image size being Nh and Nv in pel (pixel), fr 
representing the frame rate (frames per second, fps). The memory bandwidth 
(only to access every pel for the distance criterion calculation) is given by, 
Mem = 2. 2p. 2p. Nh .. Nv• fr (1.3) 
For a typical real-time video application using Common Intermediate Format 
(CIF), with p=16, fr=30, Nh=352, and Nv =288 [5-6], [11-13], the above 
equations result in a computational load of 9.34 billion integer arithmetic 
operations (with 8-bit data) per second, and a memory bandwidth of 6.22 
billion 8-bit accesses per second. These numbers have not taken into account 
either the number of implementation dependent operations, or the memory 
9 
accesses for address calculation, result comparison, coding decision and 
control. 
Nevertheless it is clear that, to do this on the move, using the increasing 
device portability and mobility, battery life and hence device power 
consumption has become a crucial issue. The power consumption of the 
CMOS memory section of the SOC system [10] is given as: 
Powermemory = Powerstahc + (f( MBr) + f( MBw». Co.V2dd (1.4) 
in which Powerstahc is the power consumption of the memory without any 
data access (standby mode), Co is the capacitive load, Vdd is the supply 
voltage, and f{MBr) and f(MBw) are referred to as frequencies of memory read 
bandwidth and memory write bandwidth. The above equation shows that 
power consumption contains a term which is proportional to the number of 
memory accesses. This means that many memory accesses can lead to the 
power requirement which is beyond the power delivering capacity of today's 
portable media devices. 
It is clear from the above discussion that there are two issues which are 
both vital to the continued advancement in the area of mobile multimedia 
platforms. One is the problem of device power consumption; especially 
designed hardware is required to make realizable, Iow power, battery-driven, 
and hence portable, multimedia devices. The other issue is video image 
compression; reducing the amount of inessential information is vital for 
efficient transmission of reasonable quality, as well as for storage. The 
exhaustive algorithms used to achieve the compression, although able to 
provide good quality, generally require intense calculations and consequently 
consume too much power for portable devices to be realized. As a result 
alternative methods have been proposed, their aim being to achieve 
compression, maintaining reasonable video quality, but without the burden of 
excessive power consumption [10]. These issues are also the subjects of the 
current thesis. It is, on the one hand, concerned with reducing power 
consumption for the development of Iow power hand-held video devices and, 
10 
on the other, concerned with reducing the complexity of compression 
methods used by such devices to make them usable in real time. 
AIl methods of data compression involve some form of a pre-
processing step, the most common of which is motion estimation/motion 
compensation [8]. Video sequences are divided into frames and as stated 
earlier, for reasonable picture quality, these frames must be transmitted and 
re-played at around 30 frames per second [5-13]. Each frame may be divided 
into square blocks, 16 pixels x 16 pixels, forming what is referred to as a 
macroblock (MB). Macroblocks in a current frame, a frame awaiting 
transmission, are matched to blocks in a reference frame, perhaps a frame that 
has already been transmitted and is stored at the end user. If a perfect match 
can be made then only the relative displacement (the motion vector) between 
the positions of the two macroblocks needs to be sent, compressing the data 
from 16 bytes x 16 bytes to perhaps 1 - 2 bytes. Generally there will be no 
perfect match for any MB, but there will be a 'best one' (according to some 
closeness-of-fit metric), hence an error matrix must also be transmitted. 
However by removing from transmission those pixels that do not change 
value (colour) between frames it is possible to compress the video data. A full 
search of all frames for a best match of all MBs is impossible even for today'S 
fastest processors. However, processing time can be reduced by a number of 
strategies:- these include limiting the area searched for the best match, 
limiting the search strategy in some manner, limiting the accuracy of the 
calculations or of the data or some combination of each etc. 
There are broadly two types of compression methods. One is lossless 
such as Huffman coding (which assigns fewer bits to those data patterns that 
occur most frequently and vice versa) [9], Arithmetic coding, the Lempel-Ziv-
Welch (LZW) algorithms, and many others. For most types of data, lossless 
compression techniques can only reduce the space needed by about 50% 
[9],[14],[15]. The other method is lossy compression. Since lossless 
compression is not enough to achieve our target of real time implementation 
the compression techniques with which we shall concern ourselves here will 
11 
all be lossy. There are different stages through which the video content has to 
be passed in order to compress it [8], [9]. Of these stages subsampling of 
images prior to encoding and quantization are the stages responsible for some 
loss of the information. Quantization involves dividing the video content at 
different frequencies by different integer constants and then rounding the 
result to the nearest integer [16]. This rounded-off result never gets fully 
reconstructed and hence makes the process lossy. High frequency 
components are divided by larger values as the human eye becomes less 
sensitive to contrast at higher frequencies and hence does not notice the loss, 
provided the bitrate is not too low. As a result, the object of a motion 
estimation method is to ensure that the lossy effects of quantization are 
minimized as much as possible and a large number of algorithms exist for this 
purpose [5-9]. However, in whatever way motion estimation/compensation 
is performed, it must comply with the current standards. 
1.2 MPEG Standardization 
Recent progress in digital technology has made widespread the use of 
compressed digital video signals practical. Standardisation has been very 
important in the development of common compression methods, to be used 
with potential new services and products. Standardisation allows new 
services to interoperate with each other and also encourages the investment 
needed in integrated electronics to make the technology cheaper. One drive 
for standardisation comes from MPEG, the "Moving Picture Experts Group" 
[17], which works under the joint direction of the International Standards 
Organization (ISO) and the International Electra-Technical Commission (IEq. 
The group works on standards for the coding (the lossy compression) of 
moving pictures and its associated audio. It is a three-part standard defining 
audio and video compression coding methods together with a multiplexing 
system for interleaving the audio and video data allowing them to be played 
back together [17]. MPEG has approached the growing need for multimedia 
12 
standards in a step-by-step manner, with a number of different phases. Three 
of these phases have been designated MPEG1, MPEG2 and MPEG4. There 
was another phase caIled MPEG3 [18] which was intended as an extension of 
MPEG2 to cater for High Definition Television (HDTV) [5], [9] but was 
eventually merged into MPEG2. 
The MPEG standards involve the use of lossy compression techniques. 
The essence of MPEG (MPEG1 and MPEG2) is its syntax: these are the 
compact tokens that make up the data bitstream. MPEG's semantics then tell 
the decoder how to convert the tokens back, thus re-assembling the original 
stream of samples. The semantics are a collection of rules; a code which 
responds to combinations of bitstream elements. MPEG2 video, which is a part 
of MPEG2 standard, specifies the syntax and semantics of an encoded video 
bitstream [6]. This includes the parameters (bit-rates, picture sizes and 
resolutions, etc.) that may be applied by the video coder, and explains how 
the bitstream should be decoded in order to reconstruct the original picture. 
What MPEG2 does not define, however, is how the decoder and encoder 
should be implemented, only that they should be compliant with the MPEG2 
bitstream. This leaves designers free to develop the best encoding and 
decoding methods possible, whilst still retaining compatibility. 
The range of possibilities within the MPEG2 standard is so wide that 
not all features of the standard are used for every application. Algorithmic 
groupings caIled 'profiles' and parameter sets caIled 'levels' (defined below), 
developed for a variety of applications, have been integrated into the full 
standard [5-9], [11-13]. There are certain circumstances when the decoder 
needs to decode and display a bitstream with limited attributes; e.g. a user 
may want to view a low resolution bitstream at low bitrate. A decoder 
implemented to decode this bitsream only, will generally be smaller in area, 
with reduced power requirement, than a decoder which implements all the 
features of a standard. Similarly a bitstream consisting of all the features of a 
standard will require more bandwidth compared to a bitsteam offering more 
limited features. Hence to implement all the features of the standard in all 
13 
decoders is unnecessarily complex and wasteful of bandwidth, so a small 
number of subsets of the full standard, the profiles and levels, have also been 
defined. 
A profile is a subset of the algorithmic tools and a level identifies a set 
of constraints on parameter values (such as picture size and bit rate) [5-9], [11-
13]. A decoder which supports a particular profile, and a particular level, is 
only required to support the corresponding subset of the full standard and set 
of parameter constraints. The standardisation only defines the bitstream 
syntax and the decoding process. Generally, this means that any decoders 
which conform to a particular specification (of bitstream) should produce 
near identical output pictures. However, decoders may differ in the way that 
they respond to errors introduced in the transmission channel. For example, 
an advanced decoder might attempt to conceal faults in the decoded picture if 
it detects errors in the bitstream. For a coder to conform to a specification, it 
only has to produce a valid bitstream. This condition alone has no bearing on 
the picture quality through the codec, and there is likely to be a variation in 
coding performance between different encoder designs. For example, the 
coding performance may vary depending on the quality of the motion-vector 
measurement, the techniques for controlling the bit rate, the methods used to 
choose between the different prediction modes, the degree of picture 
preprocessing and the way in which the quantizer is adapted according to the 
picture content. 
The MPEG video standard allows MPEG compatible equipment to 
inter-operate, because the bitstreams are standardised. However, the way the 
actual encoding process is implemented to generate the bitstream is up to the 
encoder designer. Therefore, not all equipment will necessarily produce the 
same quality video (at a given bit rate). There will be a range of products 
available, at different price levels, which the consumer can choose from to suit 
their own application. Standardization avoids the development of too much 
proprietary software, with all the attendant syntactic problems, and allows 
users to concentrate on things other than video formats and image 
14 
compression. However many of the details, such as the best method of data 
compression, are still an open question and the subject of a good deal of 
research effort. 
The aim of this effort is to experiment with those parts of MPEG2 
encoder which are not restricted by the MPEG standard, Le. are not 
standardized, such as motion estimation algorithms, ensuring all the time a 
valid MPEG bitstream, through the use of fast and power-efficient algorithms 
implemented both in software and hardware, which are compliant with the 
rules and regulations of the standard. 
The main thrust of recent contributions is to reduce complexity and to speed 
up ME algorithms attempting all the while to match the full search accuracy, 
these are the aims of this thesis. Some of the techniques used in the past year 
or so, while the work for this thesis was carried out, have been to reduce 
complexity [19-22] (by using block sum and block variances, multiple 
references, variable search window sizes etc.), to take advantage of spatial 
and temporal correlations between frames [23], [21] (to speed up the motion 
estimation process while keeping the quality good and power consumption 
low), to hybridize current techniques [24] (to create algorithms suitable for all 
kind of motions (small, moderate, fast etc.», to simplify the measure of match 
between blocks under comparison [19] (by using block sum and block 
variance) and to speed up the matching calculation by parallelization [25-30] 
(Le. executing the ME search for multiple divided blocks/subblocks 
concurrently by using multiprocessor System On Chip (Soq and using Single 
Instruction Multiple Data (SIMD» 
15 
1.3 Structure of the Thesis 
The work in this thesis concentrates on two main areas in this regard. First, 
we look at a software solution which reduces pre-processing by exploiting 
parallelism in existing Motion Estimation algorithms, through the 
introduction of vector commands and multithreaded processors. The second 
area attempts to increase battery life and clock speeds by using a full search 
method, but with a coarser grained metric which allows corrections for 
promising candidates. 
The structure of the thesis is as follows. Chapter 2 details the MPEG 
standard for video transmission, to which any coding solutions must 
conform. Chapter 3 discusses the standard Full Search algorithm together 
with existing methods of reducing processing time through the use of the so 
called 'fast' algorithms, which limit the search strategy to a smaller number of 
promising points. The downside of this is that data reuse technique cannot be 
employed efficiently which makes control flow complicated [10]. Chapter 4 
considers a hardware design (an accelerator) for ME engines and also 
presents possible mechanisms for exploiting data-level and thread-level 
parallelism in the MPEG2 encoder. 
Chapter 5 discusses detailed theoretical complexity of motion 
estimation methods. 
Data reuse (data already fetched from memory can be saved in delay-
line storage elements) is possible for algorithms with high regularity such as 
the full search motion estimation by making use of delay lines and by moving 
data from one processing element to the next processing element in so called 
systolic architectures. These architectures either parallelize candidate motion 
vector positions or SAD metric calculations [10]. This offers the advantage of 
reducing memory bandwidth. In contrast for fast motion estimation 
algorithms, with less regularity, this data reuse cannot be employed 
efficiently. 
16 
Chapter 6 discusses a possible hardware solution to this by using a coarse 
grained metric which uses a less accurate representation of luminance (4-bits 
rather than 8) pixel values. Naturally this will reduce the Peak Signal to Noise 
Ratio (PSNR) which provides a good characterisation of picture quality [31], 
[10]. To circumnavigate this we provide a means of correcting the coarse-
grained to full resolution. Chapter 6 concludes with a discussion of a layout 
for this low power architecture. Finally Chapter 7 gathers together the 
conclusions from this work. 
1.4 Summary of the work presented 
The work presented in this thesis is about an implementation of real time, 
portable, low power video encoder/ communicator. It starts with a study of 
MPEG1 and MPEG2 standards with an investigation into the significance of 
Motion Estimation (ME) in the MPEG2 encoder. This is followed by an 
investigation into the Full Search algorithm. So called Fast ME algorithms are 
also presented before comparing the behaviour (in terms of algorithmic 
complexity) and the quality (PSNR) of these algorithms with that of the Full 
Search algorithm. Although the fast methods appear to alleviate the 
computational burden of Full Search ME algorithm, the data irregularity 
associated with their hardware implementations and the requirement of real 
time video processing suggests a further investigation involving the 
techniques of Data Level Parallelism (DLP) and Thread Level Parallelism 
(TLP) for both Full Search and fast ME algorithms. In order to exploit the 
parallelism inherent in the inner loop of ME process of MPEG2 encoder, the 
Simple Scalar Instruction Set Architecture (SS ISA), a 32-bit RISC with 64-bit 
operation code [32], is extended to accommodate the new vectorized (SIMD 
(Single Instruction Multiple Data)) instructions inserted under the no-
operation implementation (NOP _IMPL) code space. Since the Full Search 
algorithm has an advantage of high data regularity and picture quality 
(PSNR), it is considered along with the fast ME algorithms. 
17 
Algorithmic simulations are carried out for both scalar and vectorized 
binaries in x86 mode (gcc), in order to validate the equivalency of the 
bitstreams generated by the non-vectorized MPEG2 TM5, and vectorized 
binary in x86 mode, using the 'cmp' utility of the Linux OS, as well as in SS 
mode (sslittIe-na-sstric-gcc). This was done for both the Full Search and each 
of the Fast ME algorithms on a number of different video sequences and 
search ranges to get Dynamic Instruction Count (DIq, a metric which acts as 
a algorithmic complexity measure, corresponding to three vector register 
lengths (i.e. a datapath of 4, 8 and 16 bytes). Employment of DLP in the Full 
Search algorithm shows significant complexity reduction whereas for fast ME 
methods wide data parallelism is less significant due to the subsampling. 
After completing this software procedure, a hardware implementaion of the 
DLP, i.e. vector datapathf coprocessor, is obtained. For this purpose a RISe 
based configurable LEON2 (SPARC-V8 compliant) processor is extended 
through a vector coprocessor tightly coupled to the RISC CPU, with the vector 
datapath instructions implemented in VHDL. The extended LEON2 
processor is then simulated and synthesiszed by the help of ESD group, 
Loughborough University, and layouts were generated corresponding to the 
three different configurations of the vector datapath, after a lengthy iterative 
procedure of fIoorplanning and power planning. 
Another efficient technique called TLP is then considered to further 
reduce the computational burden. For this purpose again the SS ISA space is 
extended with the multithreaded instructions (extended context registers) to 
form a Multithreaded Instruction Set Simulator (MTISS). The aim here is to 
parallelize the external loop of ME process by allocating each processor 
context to a portion of the ME process, so that the ME process is run by 
multiprocessors for both Full Search and fast ME algorithms. Simulations 
were then carried out for both MT parallelized and unparallelized Full Search 
and Fast ME algorithms for a number of video sequences and search ranges. 
Relative DIC values thus obtained showed that the use of multiprocessors in 
running of the process of ME represents an efficient solution for both Full 
18 
Search and Fast ME methods. The hardware implementation of these MT 
processors in SoC video encoder architecture is done by the ESD group in 
which multiple instances of configurable and extensible LEON2 processor 
(SPARC-VB compliant, 5-stage pipeline) are created to form 2-way and 4-way 
configurable multiprocessor System on Chip (SoC), augmented with data-
parallel coprocessors, corresponding to 2 and 4 multiprocessor contexts 
respectively. 
The effect of DLP and TLP is then analysed in terms of a simple 
theoretical complexity model for the purpose of optimisation. For this 
purpose an efficient complexity-area product metric (lCAP) is proposed to 
determine the behaviour of intense simulations without running them. The 
ICAP acts as a measure of optimisation between area, speed and power. An 
alternative TLP design is also proposed for a Full Search algorithm which 
involves converting Full Search from spiral to raster, and then inserting TLP 
inside the Full Search algorithm. Although the DIC obtained from this 
alternative TLP is higher than that obtained from spiral Full Search, due to 
coding overhead, it may lead to more efficient MT SoC design due to reduced 
thread context. 
Another efficient orthogonal technique, in which the accuracy of SAD 
metric is reduced by reducing the number of bits (RBSAD), is then 
investigated. Some associated drawbacks are considered and its solutions in 
terms of number of variations including spatio-temporal algorithms are 
presented along with their hardware architecture implementations. Finally 
hardware implementations of ME engines are proposed, one corresponding 
to Full Resolution SAD and one for RBSAD. The hardware (RTL) simulations 
were carried out using ModelSim and the synthesis process uses Synopsys 
Design Compiler with the UMC 0.13 Ilm 8-Cu technology design library and 
fIoorplanning/layouts were generated using Cadence Virtuoso/Encounter. 
Two VLSI layouts were generated one for full resolution SAD ME engine and 
one for RBSAD ME engine. 
19 
2. MPEGl & MPEG2 
2.1 Introduction 
In the previous chapter, the main issues associated with the process of motion 
estimation that includes power, speed and area, were discussed. A brief 
introduction to the MPEG standard, along with the significance of the 
standardization in terms of hardware and bandwidth were discussed. In this 
chapter we shall go into greater detail of the process of motion estimation and 
its effects, as well as some more insight into the features provided by the 
MPEG1 and MPEG2 standards, and their significance. 
2.l.1MPEG1 
The Acronym MPEG stands for Moving Picture Experts Group. It is a 
standard method of transmitting digital video and sound in a compressed 
format using a smaller bandwidth than traditional analog methods [5-9], [11-
13]. MPEG1 was made as a generic standard [6], [7], [33] for the purpose of 
multimedia compression. By generic here is meant that the standard is 
independent of any particular application and therefore comprises mainly a 
toolbox. It is left up to the user to decide, which tools to select for a particular 
application. MPEG1 standardizes coding syntax [6], [7], [8] which includes 
Motion Estimation (section 1.1), motion compensated prediction, Discrete 
Cosine Transformation (DCT), quantization and VLC (variable length coding) 
but does not specify any algorithm (such as a particular motion estimation 
algorithm) to create a bitstream. Similarly a decoding procedure for decoding 
of the bitstream is standardized [7]. 
Additional features of MPEG1 includes support for Random Access to 
the video sequence [6-8] which is achieved by embedding independent access 
points in the form of I frames (a type of frame which is encoded without 
reference to any other frame) into the bitstream, together with some kind of 
20 
interactivity, such as fast forward/reverse search, which comprises a 
scanning of the compressed bitsream and the display of selected frames to 
generate the fast forward or reverse searching. 
Although MPEG1 supports a variety of picture sizes and aspect ratios, 
and a wide range of bit rates, it was basically intended for storage of CIF 
(common intermediate format) video and its associated audio at about 1.5 
Mbps (Mega bits per second) on various digital storage media such as CD 
ROM etc. [8]. The encoded data rate is targeted at l.5Mb/s as this is a 
reasonable transfer rate of a double-speed CD-ROM player (rate includes 
audio and video). VlliHiuality playback is expected from this level of 
compression [34-36]. 
MPEG1 encodes only a certain type of image sequence known as 
progressive video. Video cameras capture images by scanning their elements 
(the pixels) line by line in a raster fashion. If all the lines are scanned in one 
pass then the images are called progresszve frames. If the lines are scanned in 
two passes first scanning odd numbered lines and then even numbered lines, 
then the images are called interlaced frames which are composed of two fields 
separated from each other by half a frame time (1/60 sec) (for a 60 Hz 
system), one field containing odd numbered lines and the other containing 
even numbered lines. 
In order to reach this target bitrate of 1.5 Mbps, the input is usually 
first converted into the MPEG standard input format (SIF). MPEG1 SIF 
includes luminance picture dimensions of 352 pixels x 240 lines and a colour 
space (Y (Luminance), Cb (Chroma blue), Cr (Chroma red» according to 
CCIR (Consultative Committee for International Radio) Recommendation 601 
[8], [5]. CCIR 601 video (720x576x25 Phase Alternating Line (PAL) and 
72Ox480x30 National Television System Committee (NTSC) can be converted 
to lower resolution standard interchange format (SIP), with 360 pixels x 240 
lines for NTSC and 360 x 288 for PAL. For video CDs the number of pixel 
columns are reduced further to 352 (since 352 is multiple of 16 (macroblock 
size». CCIR 601 is a standard for representing uncompressed digital video. 
21 
Resolutions of the luma and chroma components are 8 bits/pixe!. The 
, 
reduced sensitivity of the human eye to higher frequencies allows the colour 
components Cr and Cb to be further subsampled by a factor of 2, horizontally 
and vertically providing a decimated 172x120 pixel image. Pre-process 
filtering such as low pass filtering is also applied to the original input images 
before encoding, in order to remove noise from them. This enhances the 
efficiency of encoding and improves the quality of the decoded images [5-9]. 
For MPEG1, the 'constrained parameters' level provides the following 
set of specific restrictions [6], [36] that are reasonable for the kind of 
moderate-resolution, multimedia applications in the 1-3 Mbps range for 
which MPEG1 was optimised. 
o Horizontal size ;s; 720 pixels 
o Vertical size!> 576 pixels 
oTotal number of macroblocks/picture!> 396 
oTotal number of macroblocks/ second!> 396x25 or 33Ox30 
o Picture rate!> 30 pictures/ second 
o Bit rate ;S; 1.86 Mbps 
o Maximum decoder buffer size: 376,832 bits 
Whilst the first two constraints would appear to allow some fairly large image 
sizes the other constraints are more restrictive. 
2.1.2 MPEG2 
The target bitrate of 1.5 Mbps for MPEG1 made its quality unacceptable for 
most entertainment applications. As a result the MPEG group issued a second 
standard, MPEG2, which acts as a superset of MPEG1 to serve a wide range of 
applications at various bitrates and resolutions [5], [6], [8]. 
Like MPEG1, it is intended to be a generic standard with additional 
features (over and above MPEG1) in its syntax. These features include, 
22 
IJ Efficient encoding of interlaced, field-based pictures. The advantages 
of such coding include adaptivity of MPEG2 to combine fields when 
motion is slow and encode fields separately when motion is fast. 
MPEG2 does this on block by block basis [6], [7]. 
IJ The ability to create a scalable bitstream for the purpose of encoding 
pictures at different resolutions, quality levels and frame rates etc., 
according to the user needs [5], [8], [11-13]. 
IJ Improved quantization and improved coding within a wider range of 
options, such as increase in the quantization range of Dcr coefficients 
(AC) and introduction of alternate scan [5], [8], [9]. 
The MPEG1 and MPEG2 standards are divided into several parts [6]. In both 
cases, Part 1, the 'System.', describes how the various streams (video, audio, or 
generic data) are to be multiplexed and synchronized. Parts 2 and 3, 'Video' 
and 'Audio', define the video and audio compression decoders respectively. 
The final part, Part 4 'Conformance', defines a set of tests designed to aid in 
establishing that a particular given implementation (bitstream) conforms to 
the MPEG standard. 
2.2 MPEG2 Video Encoding 
The MPEG2 video encoding process is composed of the following main parts 
[7], [37], [8]. 
IJ Motion Estimation and Compensation - A means of exploiting 
redundancy between video pictures in order to achieve compression. 
IJ Discrete Cosine Transformation (DCT) - A decomposition of spatial 
blocks of image data in order to exploit statistical and perceptual 
spatial redundancy. The application of the Dcr on a data block 
transforms it into the frequency domain. High frequency components 
can be quantized (see below) using fewer bits without sacrificing much 
quality, as at higher frequencies, contrast is less perceptible by the 
23 
human eye and those frequencies which cannot be detected by the 
human eye can clearly be removed. The combination of ocr, 
quantization, run length coding (RLq and variable length coding 
(VLq leads to further compression. 
o Quantization - A selective reduction in precision with which 
information is transmitted in order to reduce bit-rates, while 
minimizing the loss of perceptual quality. 
o Variable-length coding - A means of exploiting statistical redundancy 
in the symbol sequence resulting from quantization as well as in 
various types of side information (motion vectors, etc.). 
These basic building blocks described above provide the bulk of the 
compression efficiency achieved by the MPEG video standard algorithm. 
2.2.1 Algorithmic Representation of the MPEG2 Encoder 
Figure 2.1 [6] depicts the encoding and decoding algorithms [6], [7]. The 
incoming video sequence (Figure 2.1a) is pre-processed (initially subsampled 
and then filtered), then motion estimation is used to help form an effective 
predictor of the current picture from previously transmitted pictures. The 
motion vectors obtained from motion estimation are sent, if they are to be 
used. The predictor for each block is subtracted from it, and the resulting 
prediction residual undergoes a ocr. The ocr coefficients are then 
quantized, and the quantized coefficients are variable-length coded for 
transmission. The quantized coefficients also undergo reconstruction, inverse 
ocr, and then combined with the predictor, just as they will in the decoder, 
before forming a reference picture for future motion estimation and 
prediction. This process ensures that both the transmitter and receiver use the 
same set of reference images. 
24 
r Reference Frames Inverse 
,+ IOcr Quantlzation 
T , 
Motion 
-!ljPreproce •• lng I-~ EstImation 
-
Input 
Video I Predictor ~ 
- Variable 
./+ Dcr Quantlzatlon~ ungth 
Coding 
Figure 2.1a The MPEG2 Encoder, ref [6] 
Typically, MPEG2 expects chroma subsampling to be consistent with CCIR-
601 prescribed horizontal subsampling [6], [5]. CCIR-601 defines the chroma 
subsampling termed 4:2:2 as shown in Figure 2.2a. The notation 4:2:2 implies 
for every four luma samples there are two co-sited chroma samples 
horizontally. MPEG2 further divides the vertical resolution of chroma 
samples (for more compression). This implies the chroma sub-sampling 
pattern shown in Figure 2.2b[6], termed 4:2:0 sampling. 
Postprocessing mer 
Reference Frames Predictor 
FIgure 2.1b The MEG2 Decoder, ref [6]. 
Inverse 
Quantizalion 
Variable 
Length 
Decoding 
MPEG2 
Bitstream 
The phrase 4:2:0 means the sampling rate of the colour difference signals (Cb, 
Cr), which is 6.75 MHz, is half of the sampling rate of luma signal (Y) of 13.5 
MHz. 13.5 MHz is the standard frequency for digitizing analogue video 
systems and is used by NTSC, PAL and Sequential Color with memory 
25 
(SECAM). The PAL and SECAM [9] systems are especially well suited to this 
kind of data reduction as 13.5 MHz = 3x 4.5 MHz, where 4.5 MHz is the 
colour sub-carrier frequency of PAL (phase Alternating Line) system [5]. 
000000 0 lu'ina ODD ODD 0 t.itna 
• • • • Oiroma • • .' • OUlIllla, ODD ODD 0 00000 0 0 
• • • • ODD ODD 0 o 0 0 0 0' q 0 
• 0 • • 0 • • • ODD ODD 0 ODD ODD 0 
(a) (b) 
Figure 2.2 (a) The subsampling pattern defmed by CCIR-601. (b) The subsampling pattern of 
MPEG2 input video, ref [6]. 
In the chroma subsampling convention, the first digit corresponds to 
luma (Y) then Cb and Cr. However 4:2:0 does not mean that there is no er 
information. Instead it means that for each four samples of luma (Y) there are 
two co-sited chroma samples both horizontally and vertically. 
Whereas for PAL DV (Digital Video) [6], it means that for each four 
samples of luma (Y) there are two samples of Cb in one line. Then for the next 
four samples of luma (Y) there are two samples of Cr and so on. i.e., 4:2:0 in 
one line, then 4:0:2, then 4:2:0 again, and so on. This constitutes a 2:1 
horizontal down-sampling and 2:1 vertical down-sampling. 
The decoder (Figure 2.1b) recreates the variable-length codes, performs 
a reconstruction of DCI' coefficients, an inverse DCI', forms the predictor 
from previous reconstructed pictures (using variable length decoded motion 
vectors), and performs the summations for the generation of the current 
reconstructed picture (which may itself serve to predict future received 
pictures). It then (postprocessing) interpolates and filters the resulting 
pictures for display. The interpolation of chroma samples at the output of the 
26 
decoder (Figure 2.1b) is necessary to compensate for the reduced colour 
resolution (chroma subsampling) [8]. 
2.3 Temporal Prediction 
Video Encoders, involving simple picture differencing techniques such as 
Differential Pulse Code Modulation (DPCM) to predict the current picture 
from a previous one can achieve substantial compression if the input video 
sequence consists of mostly static frames, i.e. a large part is static background 
with little movement in the foreground [6]. The current picture is then 
reconstructed as a summation of its prediction, from the reference picture, 
and the reconstructed residual error. 
This process of picture differencing can be made more effective by 
noting that often many changes in the current picture, are simply caused by 
the motion of objects. Another common occurrence is motion of the entire 
picture caused perhaps by the panning of a camera. Such observations 
motivate the inclusion of a motion model in the image prediction to further 
reduce the amount of spatial information that must be encoded as a 
prediction residual. Motion estimation/ compensation, then, is the application 
of a motion model within the prediction method. However this motion 
estimation/compensation may be less effective for sequences involving large 
motion (Figure 2.3) or abrupt discontinuities in motion like scene cuts. It 
should be noted that there are indeed several types of change, in the video 
scene, that are not well described by the particular motion model chosen and 
these may all reduce the efficiency of motion compensation. For instance, a 
motion model based on translation will be less effective in handling rotations 
orzoorns. 
One way to handle this kind of situation, for predictive pictures, could 
be to extend the techniques, already used by the MPEG2 motion model [17], 
[7] by selecting between intra and non-intra macroblocks (an intra frame is 
simply one for which the current frame is also the predictor). This could be 
27 
done, say, if the variance (a number that indicates how much, on average, 
each of the values in the distribution (pixels or pixel differences) deViates 
from the mean (or centre) of the distribution) of the residual error (the error 
between the current macroblock in the current frame, and the predicted 
macroblock) is greater than the variance of the current rnacroblock. In this 
case the current macroblock would be coded as intra (which indicates motion 
estimation/ compensation is not effective for it due to large error or large 
motion) and vice versa. 
The more the macroblocks are coded as intra the more they indicate 
that ME has been less effective which in turn is a kind of indication for large 
motion and a smaller compression ratio. A mechanism can be developed by 
setting a threshold (at say 50%) and comparing the total number of 
rnacroblocks (belonging to current frame) coded as intra, with the threshold. 
This number will then give some idea about relative amount of large or small 
motion. Such a method can be useful for creating hybridized algorithms such 
as hybridization of 'three step search' and 'four step search' algorithms (these 
are subsampled fast ME algorithms) and for the algorithms which involve 
spatial and temporal correlations. 
2.3.1 Motion Estimation (ME) and Compensation 
The motion model used by MPEG is blockwise translation [6]. For simplicity, 
the blocks are chosen to be a single fixed size of 16 pixels x 16 pixels in the 
luma component. As discussed above, for a typical 4:2:0 chroma subsampling, 
the chroma components are vertically subsampled by a factor of two in both 
dimensions relative to the luma component, consequently each chroma 
component includes an 8 x 8 block of pixels corresponding to the same spatial 
region as the 16 x 16 block from the lurna picture. In case of 4·2:0 color 
subsampling, the 16 x 16 region from the luma, i.e. four 8x81uma components 
and two corresponding 8 x 8 regions from the chroma components is 
collectively known as a macroblock [6], [7]. For 4:4:4 color space encoding, i.e. 
28 
no chroma subsampling, the macroblock has twelve 8x8 blocks (four luma 
components and eight color components) [6], [7]. Similarly for 4:2:2 the 
macroblock consists of eight 8x8 blocks (four luma and four chroma) [6-9]. 
Motion estimation/compensation consists of determining, for a 
macroblock from the current picture, the spatial offset in the reference picture 
at which a good prediction of the current macroblock can be found. This 
offset is termed a motion vector. The mechanics of motion compensation 
involve the use of this vector to extract the predicting block from the reference 
picture, subtract it from the current macroblock, and pass the difference on 
for further compression [6], [38]. The vector chosen for a macroblock is 
applied directly to determine the luma predictor, but is scaled by half in both 
dimensions (corresponding to their subsampled size) before being applied to 
find the chroma predictors [6], [38]. 
Motion estimation consists of finding the best vector (best match) to be 
used in the prediction of the current macroblock. This is typically the most 
computationally expensive activity [8] in an MPEG encoder, but can have a 
substantial impact on compression efficiency [8]. The most straightforward 
approach to motion estimation is to evaluate each individual motion vector 
from an allowed range of possible vectors and select the best one. A good 
criterion for which is best would be the number of bits consumed in coding 
the block using that specified offset [6]. In practice, the criterion is often 
simplified to measuring the Sum of Absolute Difference (SAD) (eqn. (2.1» 
between the current macroblock and the macroblock at the specified offset 
(often evaluated between the luma components only), and the search itself is 
often structured in some manner to reduce the total number of comparisons 
to be made. 
29 
(a) (b) 
(c) (d) 
(e) 
Figure 2.3 (a) A Frame at time instance N-l used for prediction of the content in frame at 
instance N. (b) The Frame to be coded at time instance N. (c) Prediction error image obtained 
without using motion compensation - all motion vectors are assumed to be zero. (d) 
Predjction error image to be coded if motion compensated preruction is employed. (e) 
Prediction error image obtarned by enhanced ME between the frames in (a) and (b). 
30 
The advantage of coding video using the motion compensated prediction of 
frame N from the previous frame N-1 is illustrated in Figures 2.3 a - e for a 
typical test sequence. Note that the images shown in Figure 2.3 (c - e) are 
generated using MATLAB [39] and MPEG2 decoder [40] to reconstruct 
images from MPEG2 bitstream. 
Figure 2.3b depicts a frame ('rotating city' [41]) at time instance N to be 
coded and Figure 2.3a the previous frame at instance N-1, whose 
reconstructed form is stored at both MPEG encoder and decoder. The block 
motion vectors are estimated by the encoder motion estimation procedure 
and provide a prediction of the translatory motion displacement of each 
rnacroblock in frame N with reference to frame N-l. Figure 2.3c depicts the 
frame difference signal (frame N - frame N-1) which is obtained if no motion 
compensated prediction is used in the coding process - thus all motion vectors 
are assumed to be zero and corresponds to a PSNR value (Peak Signal to 
Noise Ratio between the original frames) of 14 dB. PSNR is to be defined as 
PSNR = 201oglO(255jRMS) where RMS is the Root Mean Square Error 
between the current frame and its approximation reconstructed from 
reference frame MBs. 
0 0 0 d 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
• • • • • 0 0 0 0 0 0 0 0 
• • • • • 0 Q 0 0 0 0 0 0 
• • 
.,. • 0 0 0 0 O~O 0 0 
o dOrig~noi Plxels 
• inUlrpol"Uld POOlI. el(1l2,ll2) OIIoet 
Figure 2,4 Hall pixel motion estimation where white CIrcles belong to integer pixel locations 
and black circles belong to hall pixel locations. 
31 
By contrast Figure 2.3d depicts the motion compensated frame difference 
signal when the motion vectors obtained are used for prediction and 
corresponds to a PSNR value (between original and reconstructed) of 38 dB. 
Note that this PSNR value (38 dB) is taken from the output 'statistics' file 
generated by the MPEG2 encoder [40]. It is apparent that the residual signal 
to be coded is reduced using motion compensation when compared to the 
pure frame difference coding of Figure 2.3c. 
The effectiveness of motion estimation/ compensation can be enhanced 
by allowing the search for an effective prediction region in the reference 
picture to include not only positions at integral offsets but also fractional pixel 
offsets [6], [7], [10]. For a fractional pixel offset, the predicting macroblock is 
constructed by linearly interpolating pixel values relative to the nearest actual 
pixels. An example of a (1/2, 1/2) pixel offset prediction block is shown in 
Figure 2.4. The dark circle locations have pixel values x calculated as x = (a + 
b + c + d)/4 where a, b, c and d are value of the closest pixels in the original 
reference picture. Figure 2.3e shows the prediction error image obtained by 
the enhanced ME between the frames in Figures 2.3a and 2.3b and 
corresponds to a PSNR value (between original and reconstructed) of 38.5 dB. 
2.3.2 Motion Vectors 
The motion vector represents the displacement of the best block, i.e. the one 
with the lowest value of the distance criterion SAD (sum of absolute 
difference) [6], which is defined as 
16 
SAD (m, n) =L 15(1, J, k) - S(I+m, J+n, k-1) I (2.1) 
~)=1 
In the above equation, SAD(m,n) is the distance value found between the 
current macroblock (in frame k) and the candidate macroblock (in the frame 
k-1) in the reference frame at position (m,n). So the motion vector can be 
defined as, 
MY = (MY", MYy) = arg(min SAD(m,n» (2.2) 
32 
where MVx and MVy are horizontal and vertical components of the motion 
vector. 
In the case of a motion-compensated macroblock, these motion vectors are 
transmitted. Motion vectors can themselves be coded using the technique of 
prediction [10]: a set of rules that allows construction of a predictor for the 
current motion vector from the last transmitted motion, and only the 
difference is actually transmitted. One method of non-linear MV prediction, 
based on the median calculation of neighbouring blocks, has been used to 
efficiently code the motion vector. The predicted motion vector, P = (Px, Py) 
being denoted as: 
Px = median (MV(1)", MV(2)x, MV(3)x) (2.3) 
(2.4) 
where MV(l) is the previous motion vector in the reference frame, MV(3) is 
motion vector above (spatially) the current motion vector MV, and MV(2) is 
left (spatially) of MV(3). In MPEG2 the predicted motion vector P is subtracted 
from the actual motion vector MV and the resulting difference, motion vector 
difference (MVD), is coded using a variable length code. 
MVD = MV - P (2.5) 
2.4 Picture Types 
In MPEG2, three 'picture types' are defined. The picture type defines which 
prediction modes may be used to code each block [6-8]. 
a Intra pictures (I-pictures) are coded without reference to other frames. 
Moderate compression is achieved by reducing spatial redundancy 
through DCT, but no temporal redundancy is exploited. Such frames 
can be used periodically to provide access points in the bitstream 
(Random Access) where decoding can begin. 
a Predictive pictures (p-pictures) can use the previous 1- or P-picture for 
motion compensation and may be used as a reference for further 
prediction. Each block in a P-picture can either be predicted or intra-
33 
- --------
coded. By reducing spatial and temporal redundancy, P-pictures offer 
increased compression compared to I-pictures. 
tJ Bidirectionally predictive pictures (B-pictures) can use both the 
previous and the next 1- or P-pictures (so called anchor pictures) for 
motion-compensation, and offer the highest degree of compression. 
Each block in a B-picture can be forward, backward or bidirectionally 
predicted or even intra-coded. A B-picture provides the most 
compression (Figure 2.5) since it uses past and future pictures as a 
reference; however, the computation time is the largest. Due to not 
being used as anchor, B- pictures are allocated fewest bits compared to 
1- and P- type pictures with 1- pictures having the highest allocation 
since they act as the first reference in the group of pictures (GOP) 
(Figure 2.6). A typical ratio of bit allocations among 1-, P- and B-
pictures is 5:3:1 respectively. In order to ensure a minimum amount of 
decoder scratch memory to decode B pictures (regardless of the 
spacing between anchor pictures), and the minimum decoding time, 
the coder reorders the pictures from natural 'display order' to a 
'bitstream' order so that the B-picture can be decoded at the same time 
it arrives at the decoder, as shown in Figure 2.6 [6]. This naturally 
introduces a reordering delay dependent on the number of consecutive 
B-pictures [6], [8]. 
There are also three types of prediction. 
tJ Forward prediction: Prediction of a frame or rnacroblock obtained 
from any of the nearest previous anchor pictures (1- and P-). 
tJ Backward prediction: Prediction of a frame or macroblock obtained 
from the nearest anchor pictures (1- and P-) succeeding it. 
tJ Bidirectional prediction: Prediction of a frame or rnacroblock obtained 
by averaging its forward and backward predictions. 
34 
I 
current 
macroblock 
Forward Motion 
Compensation 
B 
) 
Figure 2.5 An example of bidirection prediction 
< 
Backward Motion 
Compensation 
p 
Using bidirectionally predicted blocks allows, for example, effective 
prediction of uncovered background; areas of the current picture that were 
not visible in the past but are visible in the future [6-8]. This situation is 
shown in Figure 2.5, where an object behind the car is hidden in the I frame. It 
partially appears as the car moves forward in the next B frame and the object 
appears fully when the car moves further in the P frame. Hence the current 
macroblock which covers the object in the B-frame has no match in the 
preceding frame (I) but can find a best match in the succeeding frame P . 
.$ -2· -1 0 1 i :$ 4 5 6: 1 if 9 
;' mput Ordor, P03: B-2 B.I io B)1l2 P, B4 Bs 1', It/Sa',r, 
Encoder and Transnut Order ... PDB.il B.IP;3 B)B2~ ll4Bsli~ 
'Dee~~l P.;P03 P03 P.,I>, P, :p"p; p')§' 
:De.:odul.r.mo!y2 ;~, ID 10'10 1.10 loP, P, P, ,p •. 
. Ileeidet Senoich Memory ,,e.. - If,z :8 .• -:81 B; - B~ ~C 
Dispj~d~ ," p03~iI.J.roill B2P,:84::SS P,' 
Figure 2.6 Pictures in Encode and Transmit order and Decoder memory usage. The pictures 
inside the block represent a typical Group of Pictures (GOP). 
35 
Due to improved prediction, less residual error results leading to improved 
compression and quality. All these things are at the expense of increased 
complexity, more decoder memory (to store reference frames prior to the 
arrival of B frame) and reordering delay. 
2.5 Coding Interlaced Video 
MPEG2 encodes both progressive pictures (Frames) and interlaced pictures 
(Fields) [8], [5]. As discussed in section 2.1, interlaced frames are composed of 
fields separated by half the frame time. 
MPEG2 has an adaptive mechanism to switch between field DCT and 
frame DCT. Frame DCT means that the macroblock belongs to either the 
progressive or interlaced frame whereas field DCT means that the macroblock 
belongs to field picture only. In case of progressive input sequences, frame 
DCT is used only, whereas for interlaced input sequences both frame and 
field DCT can be used. 
When the input sequence is interlaced, MPEG2 has two options, either 
to encode the sequence field by field, i.e. each field picture is encoded 
independently, or to concatenate the fields into one frame and then encode 
the result like a composite frame [8], [40]. For interlaced frame, ME is done, 
first, using the 16x16 macroblock belonging to the composite (interlaced) 
frame and a best match is found. Then the same macroblock is divided into 
two 16x8 macroblocks, one belonging to the odd numbered lines (top field) of 
the current composite frame and the other belonging to the even lines (bottom 
field). ME is done using each 16x8 macroblock in the top and bottom fields of 
the anchor frames (depending on the picture type) and at the end all the best 
matches obtained from composite frame as well as fields are compared and 
the best of all is selected [40], [6], [7]. 
For field pictures, ME is accomplished, using the 16x8 macroblock 
belonging to the current field and finding its best match in the top and bottom 
fields of the anchor pictures. There is another mode of ME available for field 
36 
pictures called 16x8 MC (motion compensation) in which the upper half of the 
16x8 macroblock is searched in the top and bottom fields of the anchor 
picture(s). Similarly the lower half is searched in the top and bottom fields of 
the anchor picture(s). Although this can increase the complexity because of 
the extra ME involved compared to progressive frame ME, it does give better 
quality [5-9], [11-13]. 
Field based ME is better for dealing with fast motion because it allows 
the two views of an object to be displaced from each other in the two fields 
due to its motion [6], [8], [5]. This can lead to a smaller effective prediction 
than if such fields are combined in a composite frame (see above) for a frame 
based ME. On the other hand field based ME is less effective in exploiting 
vertical redundancy due to the reduced vertical resolution in fields. 
2.6 Scalable Extensions 
Scalability refers to the ability to decode only a certain part of the bit-stream, 
to obtain video at the desired resolution so that decoders with different 
complexities can decode and display video at different spatia-temporal 
resolutions from the same bitstream [8], [7]. This reduces the cost of 
implementing expensive decoders for those applications which may not 
require all the features (resolution, quality, etc.) allowed within the bitstream. 
For applications such as HDTV, MPEG2 allows the broadcaster to broadcast 
both the HDTV (1280x72Ox50 (progressive» and the normal resolution (CCIR-
601, 72Ox576x25 (PAL» signals. The set-top box would display the 
appropriate signal depending upon the receiver's television. A set-top box is a 
device that enables a television set to become a user interface to the intemet 
and also enables a television set to receive and decode digital television (DTV) 
broadcasts. 
The minimum decodable subset of the bistream is called the base layer [8]. 
All other layers are enhancement layers, which improve on the resolution of 
the base layer video. There are several forms of scalability: 
37 
o Spatial (pixel resolution) scalability provides the ability to decode video at 
different spatial resolutions without first decoding the entire frame and 
decimating it. 
o SNR (Signal to Noise Ratio) scalability offers decodability using different 
quantizer step sizes for the Dcr coefficients. 
o Temporal Scalabllity refers to decodability at different frame rates 
without first decoding each frame. Any combination of the above leads 
to some form of Hybrid scalablllty. 
MPEG2 is designed to address a much larger number of potential applications 
and, in order to be cost-effective in each of these, it has a much larger set of 
compliance points to aid hardware implementation. These points are indexed 
by levels and profiles. 
Levels are the definitions, for the MPEG2 standard, of physical 
parameters such as bit rates, picture sizes and resolutions. The so called Low 
level sets constraints on the image size of approximately half COR-601 both 
vertically and horizontally, namely, 352H x 288V (consumer tape equivalent). 
Main level corresponds to COR-601 sized images (720H x 576V x 25Hz or 
720H x 480V x 29.97Hz (Studio TV». High 1440 level corresponds to HDTV 
(128Ox720 (progressive) up to 1440H x 1152V (Consumer HDTV» and high 
level is for wide-screen HDTV resolutions up to 1920 pixels by 1152 lines [5]. 
Similarly a profile is a 'defined subset of the syntax of the 
specification', [5-9], [11-13]. In other words a profile imposes some bounds on 
the full syntax and it defines which tools or functionalities, such as the 
addition of another quantization and inverse quantization stage in the 
MPEG2 encoder, or the generation of an enhancement layer in addition to the 
base layer, so that the SNR (Signal to Noise Ratio) scalability, described 
above, may be used to produce a bitstream. The Simple profile relates to an 
encoder/ decoder pair, which only uses unidirectional motion estimation and 
prediction (no B pictures). This saves on coder and decoder memory 
requirements but results in a higher bitrate for a given overall quality. Main 
38 
profile is designed for broadcast television applications. It supports B pictures 
and is the most widely used profile. Using B pictures increases the picture 
quality (described above) but adds about 120 ms to the coding delay to allow 
for the picture reordering [5-9]. Both simple and main profiles are non-
scalable because they do not support any of the scalability techniques 
described above. Similarly profiles which support scalability are called 
scalable profiles. 
The combination of a profile and a level produces an architecture 
which defines the ability of a decoder to handle a particular bitstream. For 
example main profile at main level (MP@ML) is a very important combination 
due to its uses in broadcast applications in Europe [5]. Main profile at main 
level uses MPEG2 coding of interlaced pictures with resolutions of nOx480, 
and a 30 Hz frame-rate ('NTSC' pictures), or nOx576 at a 25 Hz frame-rate 
('PAL' pictures). 
2.7 Other Enhancements in MPEG2 
In addition to the zigzag scan (a pattern to arrange quantized coefficients in 
the form of an array starting from the DC Coefficient and ending at the cluster 
of zero coefficients resulting from the quantization of high frequency Dcr 
coefficients), an alternate scan is introduced in MPEG2 which is apparently 
better for interlace coding [37], [8], [6], [7]. Figure 2.7 [8] shows an example of 
the alternate scanning pattern. 
AC coefficients (Ocr) are quantized in the range [-2048, 2047], as 
opposed to [-256, 255] in MPEG1 [8]. The DC coefficient is given 11 bits full 
resolution. 
In MPEG1, integer values ranging from 1 to 31 are used for the 
quantization scale factor, mquant. This factor is used for the fine tuning of the 
quantization steps (bitrate control) at macroblock level to achieve better 
compression ratios. It is determined by the macroblock type (intra, nonintra, 
motion compensated, non motion compensated), picture type and rate control 
39 
1"11 AI l'jt 1';1 -1. ?I ~ 
1/.v 7. ',,", 1)1' /4; ;. 1/ 
J.(' ~ ~ 1t jo:" iI 
, I,ll ~ W Jr , W .11 
" (f f" J) j;( , '4 
-1 
I1 I1 
11 
Figure 2.7 An alternate scan for mterlace coding 
algorithm which takes into account the output buffer fullness and channel 
traffic [6]. It also prevents the need to transmit an entire new quantization 
matrix along with the macroblock. MPEG2 allows for an optional set of 31 
values for 'mquant' that include real numbers ranging from 0.5 to 56. 
2.8 Summary 
In this chapter a brief description of the syntax of MPEG1 and MPEG2 has 
been presented along with their comparison. Section 2.3 describes the 
significance of the process of motion estimation/motion compensation in 
achieving significant compression. 
40 
3. Motion Estimation Algorithm 
3.1 Introduction 
In section 2.3 of the previous chapter, it was discussed that the MPEG2 video 
compression algorithm employs two basic techniques: block-based motion 
estimation/motion compensation [6] for the reduction of the temporal 
redundancy, and transform domain (DeI) coding for the reduction of spatial 
redundancy. The motion estimation/compensation technique is applied both 
in the forward (causal) and backward (non-causal) directions [6], [7]. The 
remaining signal (prediction error) is coded using the transform-based 
technique. The motion predictors, known as motion vectors, are transmitted 
together with the spatial information. 
Motion estimation (ME) has proven to be effective in exploiting the 
temporal redundancy of video sequences and is therefore a central part of the 
lSO/IEC MPEG1, MPEG2, and MPEG4 and the COlT H.261 / ITU-T H.263 
video compression standards. These compression schemes are based on a 
block-based hybrid coding concept, which was extended within the MPEG4 
[10], [6] standardization effort to support arbitrarily shaped video objects. 
Motion estimation algorithms have attracted much attention in research and 
industry, for a number of reasons: 
1. Motion estimation is the computationally most demanding algorithm 
of a video encoder (about 60-80% of the total computational time) [42], [43] 
which consequently limits the performance of the encoder in terms of 
encoding speed. 
2. The motion estimation algorithm has a high impact on the visual 
performance of an encoder for a given bit rate [6], [7], (section 2.3). 
3. Finally, the method to extract motion vectors from the video material is 
not standardized, and is open to optimization [40]. 
Motion estimation is also used for applications other than video 
encoding such as image stabilization [44] (a method of obtaining sharper 
41 
pictures by counteracting camera shake), computer vision [45-46], motion 
segmentation [47-49] and video analysis. However, the requirements for 
these applications differ significantly from those of video encoding 
algorithms as the motion vectors obtained have to reflect the real motion 
within the image sequence, otherwise the algorithms will not show the 
desired results. In video coding the situation is different. Motion vectors are 
used to compensate the motion within the video sequence and only the 
remaining signal (prediction error) has to be encoded and transmitted. 
Therefore, the motion vectors have to be selected in order to minimize the 
prediction error and to minimize the number of bits required to code it. 
Within this thesis, the investigation of motion estimation algorithms is 
focussed on video coding. However, similar concepts are applicable to these 
other areas as well. 
Block-matching algorithms assume that all pixels within a block have 
the same motion [10], [8]. Block-matching algorithms result in proper 
behaviour provided that following prerequisites are met: 
o Object displacement is constant within a 8x8 or 16x16 block of pels 
(pixels) [50]. 
o Pel illumination between successive frames is spatially and temporally 
uniform [51]. 
o Motion is restricted to being translational in nature. 
o The matching distortion (SAD) (eqn. (2.1)) increases monotonically 
from the minimum [52]. 
The latter condition implies that the first derivative does not change sign i.e., 
the SAD does not decrease, as the displaced candidate block moves away 
from the direction of the exact minimum distortion. This is especially 
important for most of the so-called fast motion estimation algorithms. For 
real-life video sequences most of these conditions will generally not be met, 
but a large number of motion estimation algorithms still give good results. 
42 
Motion estimation algorithms can be usefully classified into time-
domain and frequency domain algorithms [53], [54]. A brief description is as 
follows: 
1. Time Domain Algorithms: The time domain algorithms [53], [54] may be 
further divided into matching algorithms and recursive (gradient based) 
algorithms. 
a) Matching Algorithms: These include block-matching and feature matching 
algorithms. Block matching algorithms [8], described in this thesis, find the 
best match on a block by block matching basis. They are considered here due 
to their suitability for parallelized hardware architectures [10]. Feature 
matching [55], [56] algorithms use meta information extracted from current 
block and search area pixels. Meta information is descriptive information 
about the multimedia content, e g. semantic description, such as which 
subjects appear in the video, or information about the colour characteristics of 
a video, etc. 
b) Recursive Algorithms: These include pel recursive algorithms. 
Pel recursive algorithms involve the minimization of a displacement frame 
difference (DFD) (eqn. (3.1», by finding a displacement vector (motion vector) 
at each pixel in an iterative manner, using a gradient based optimisation 
technique [8], [57]. 
DFDz, d = Sn. z - 5".1, z-<i (3.1) 
Where Sn.z is the current frame pixel at location z=(x1, x2) and time instant 'n' 
and 5".1, z-d is the pixel in the previous frame at location z-d = (x1-dx1, x2-dx2), 
where 'd' is the displacement between the two pixels, and instance 'n-l'. 
According to gradient based optimisation, the straightforward way to 
minimize a function of several unknowns is to calculate its partial derivatives 
with respect to each unknown, set them equal to zero, and solve the resulting 
equations. 
There are several algorithms based on the gradient based optimisation 
[8]. In order to easily realize them, they are implemented using numerical 
methods. The simplest numerical optimisation method uses the 'Steepest-
43 
-------
descent method'. The Steepest descent method has many different algorithmic 
forms. One of those forms is Netravali-Robbins algorithm [57] which finds, 
iteratively, an estimate of the displacement vector, which minimizes the 
square of the DFD (eqn. (3.1» at each pixel. The iterative procedure to find the 
displacement vector is given by 
d,+1 = d' - E DFDz,d' V z Sn-1,z-d (3.2) 
where d,+1 is the updated displacement vector found from the previous 
estimate of d', V zS is the spatial gradient, with respect to z, of the pixel 
intensities S, and E is the step size. An appropriate step size is critical for the 
convergence of the algorithm [8], because if E is too small, the moves per 
iteration can be very small, and the algorithm will take too long to converge. 
In eqn. (3.2) the displacement vector is updated in the direction of 
negative gradient (Le., getting closer to the minimum). The aim of each 
iteration is to reduce the DFD, such that, 
DFDz,d,+1 < DFDz,d1 • 
2. Frequency Domain Algorithms: Frequency domain algorithms [58], [53], 
[33] include phase correlation algorithms and algorithms using, for example 
the Dcr or the wavelet transformation (a mathematical transform, similar to 
the more commonly known Fourier transform, in which the wavelet 
coefficients for a certain function contain both frequency and time domain 
information). Phase correlation algorithms [58], [54] are based on the Discrete 
Fourier Transform (OFT) phase difference measurements between frames. It 
is based on the principle that a relative shift (of pixels) in the spatial domain 
results in a linear phase term in the Fourier domain [8]. The phase-correlation 
methods estimate the relative shift between two image blocks by means of a 
normalized cross-correlation function [8]. 
In this thesis we shall be concerned with block-based methods as these 
allow a relatively straightforward exploitation of their parallelism [10]. 
44 
3.2 ME Algorithms 
We turn our attention now to ME algorithms and begin with the Full Search 
[10], [8], [7) which compares the current macroblock to be coded with all 
macroblocks within a reference frame to obtain the best match. As indicated 
in the previous chapter, the computational burden of such an approach is far 
too high for mobile devices so a number of simplifications have been used. 
These involve: 
(i) limiting the search range to a defined search area [8], [7] which 
recognizes the fact that most objects move fairly slowly between frames and 
so the motion vectors are typically small. 
(ii) limiting the scope of the search within this reduced search area to a 
smaller number of promising candidate positions. These promising positions 
are chosen assuming a monotonic SAD distribution. 
(iii) limiting the accuracy of the SAD calculation, possibly as a first sweep. 
The first of these is a standard method used in MPEG2 hardware 
architectures. The search method is regular and the hardware control is 
straightforward, leading to simple hardware designs [10]. The second, the 
group of so called 'fast' algorithms, produces a motion vector much more 
quickly but at the expense of more complex hardware architectures, and 
somewhat worse picture quality (PSNR). Run in software, the complexity of 
the fast methods per candidate block is significantly higher than that of a full 
search of a given search area, due to the overhead in calculating the new most 
promising candidate and rereading reference data. However as there are 
many more macroblocks in a full search, the full search complexity (defined 
here in terms of a Dynamic Instruction Count (DIq) is still much higher. 
Quite which of the fast methods is best for a given size of search area is an 
open question which we investigate in Chapters 4 and 5. The final search 
type, which uses a reduced accuracy in the SAD calculation (eqn (2.1)) as a 
45 
first pass, is the subject of Chapters 6. In order to obtain DIC complexity 
values the various search methods are defined in this chapter. 
The FuIl Search is an Exhaustive Search, it is a motion estimation 
algorithm which evaluates the SAD metric at each candidate motion vector 
(CMV) position in a given search area to find the best match. It is clear from 
equations (1.2) and (1.3) in chapter 1, that a fuIl search algorithm, due to this 
exhaustive nature, requires billions of arithmetic operations per second and 
billions of 8-bit memory accesses per second for a real time encoding process. 
These are the main difficulties which have driven the attempts at speeding up 
the motion estimation process through techniques implemented both in 
software and hardware such as software and hardware implementation of 
Data Level Parallism (DLP) which involves concurrent use of data, Thread 
Level Parallelism (TLP) which involves concurrent use of both caIling 
routines and data, Reduced Bit Sum of Absolute Difference (RBSAD) which 
involves reducing the bits required by SAD metric (eqn. (2.1)) and Fast ME 
algorithms which involves subsampling of the search area etc. 
Fast motion estimation algorithms have both merits and demerits. 
They reduce the (DIq complexity by reducing the number of candidate 
motion vectors but can be trapped into local minima, since they are based on 
the monotonic functions assumptions according to which the error increases 
monotonicaIly as the best match moves away from the global minima, 
however this assumption does not always hold true. This is true especially 
when the search area is large, as the possibility of one or more local minima 
increases with area. A local minimum is a candidate block location where all 
the neighbouring blocks have a greater distortion but the candidate block is 
not the best of the entire search area. Thus the block is only the best within its 
own locality and a better block is available elsewhere in the search area. A 
second difficulty arises in the hardware implementation; fast methods give 
rise to data irregularities [10] due to which the vital technique of data reuse 
cannot be employed very efficiently. This can lead to higher power 
consumption. 
46 
In this chapter we shall describe the majority of the important block based fast 
(sub-sampled (reduced search area» motion estimation algorithms. We start 
by considering the various alternatives for reducing the computational 
burden listed above. 
3.3 Limiting the search path within a reduced Search Area 
This group comprises a number of motion estimation algorithms collectively 
known as fast methods, some of which have been employed in the motion 
estimation part of the MPEG2 video encoder. 
The formula used to calculate the criterion for comparison (SAD) is 
given by 
16 
SAD (m, n) = ~ I S(I, j, k) - S(I+m, j+n, k-l) I 
~J=l 
(2.1) 
(repeated here for convenience), where 'SAD' is the Sum of Absolute 
Differences, the distortion metric between the current macroblock in the 
frame 'k' and the candidate macroblock at (m,n) in the search area in the 
previous frame, 'k-l'. The Mean Square Error (MSE) distortion metric can also 
be used in addition to other metrics, and it gives superior results [10] because 
it can be interpreted as Euclidean distance between two macroblocks, which is 
closer to the human visual perception [10], but the requirement of one 
multiplication per pixel difference makes it more computational intensive 
than the SAD metric. The complexity of SAD can be further reduced by the 
use of summation truncation (an early termination mechanism) in which the 
computation of the SAD value is halted if a currently accumulated partial 
SAD sum exceeds the minimum SAD value calculated so far. 
3.4 Fast Methods - limiting the scope of the search. 
The practical realization of fast and efficient video transmission systems 
require fast video encoders which in turn require an accelerated ME unit since 
47 
it is the computationally complex part of the video coding system (section 
3.1). One way to accelerate the ME process is to reduce the activity of the Full 
Search algorithm by reducing the number of candidate motion vectors (CMV) 
searched. There are different methods which can achieve this task at the 
expense of reduced PSNR. The following sections discuss some of these fast 
ME methods. 
3.4.1 Three Step Search 
Koga et al. introduced one of the first of the fast algorithms known as the 
Three Step Search (TSS) [59]. As its name implies, it includes three steps in its 
searching strategy. As shown in Figure 3.1, in the first step, the eight positions 
marked with a '1' are searched starting from the centre '0' (corresponding to a 
zero motion vector) . 
.1.6.5.4.3.2 10 1234561 . 
2 2 2 
1 1 2 lA 2 
1/ 
" 
3 3 
17 2 3 ~ ~ 
V • 3 3 
1 It 1 
'" 
.... 
w 
1 1 1 ... 
.. 
.. 
.. 
Figure 31 An example of Three Step Search algorithm. Examines 25 points. 
These positions, in the first step, are taken at a distance of 4 pixels (the initial 
step size) from each other and thus form a square window of 9x9. The best 
match (the minimum SAD) found in the first step is shown as circled '1' 
48 
(Figure 3.1). In the second step, this position is taken as centre and the step 
size is halved to 2. A further eight positions (marked '2') are searched in a 
window of 5x5. The best match found is shown as the circled '2'. In the third 
step, with this position as centre and step size of 1, eight positions are 
searched around it. The best match found in this step (labeled '3') corresponds 
to the final motion vector, (6, -1) in this example. Note that here motion 
vectors are mentioned relative to the centre '0', but in the MPEG2 reference 
model [40], motion vectors are actually relative to the upper-left corner (0, 0) 
of the frame. 
One problem that occurs with the Three Step Search is that it uses a 
uniformly allocated checking point pattern in the first step, which becomes 
inefficient for small motion estimation. 
3.4.2 New Three Step Search 
The New Three Step (NTSS) search algorithm [60] was an attempt to include 
an additional step in the 1$8 to make it include the centre biased motion (the 
general experience that the motion is most likely to be represented by a 
motion vector close to zero) in its searching strategy. 
As in the 1$8, eight positions (marked '1' in the 9x9 window) are 
searched in the first step around the centre '0'. Then another eight positions 
(marked '1') are searched in a 3x3 window around centre '0'. AIl the SAD 
values (eqn. (2.1)) found at these 17 positions are compared together. If the 
minimum SAD (the best match) belongs to the positions of 9x9 window then 
the next two steps will be same as the steps two and three of 1$8. This means 
in this case that a total of 33 positions will be searched. But if the best match 
belongs to inner window of 3x3 then there are two options. If the best match 
is at the centre '0' then the search will be halted with the motion vector as (0, 
0). This means total of 17 positions is searched. But if it belongs to any other 
position (circled '1' in 3x3 window in Figure 3.2) then another 3x3 window 
(labeled '2') is searched around this position. 
49 
The best match thus found (circled '2') in this step corresponds to the final 
motion vector. In Figure 3.2 the motion vector found is (2, -1). In NTSS, the 
number of positions searched varies between 17 and 33 . 
.1.6.5 ... .3-210123.567 . 
1 1 1 
2 2 2 
1 1 I!l: !;ID 
1 It- 1 2 1 
" 
1 1 1 
N 
1 1 1 
.. 
... 
Figure 3.2 An example of NTSS 
3.4.3 Four Step Search 
The Four Step Search (FSS) [61] was another attempt to improve the efficiency 
of TSS especially for small motion vectors just as NTSS. The FSS algorithm 
uses a centre-biased search pattern. Figure 3.3 shows a typical example of the 
four steps of the FSS. The notion is that if a best match is found at the center of 
any step, then the overall best match can be found in the eight positions 
around it with unit distance between them and that will be the end of search. 
It starts with nine checking points (marked '1') on a 5x5 window with a step 
size of two pixels between the positions in the first step. 
In the second step the center of the 5x5 search window is then shifted 
to the point with minimum block-distortion measure (circled '1'), but the 
positions searched are less than the total positions (9) of a 5x5 window 
depending upon the location of the best match. In Figure 3.3 the number of 
additional positions in the second step which need to be searched is five 
50 
(labeled '2'). The best match found in step two is shown as a circled '2'. 
Similarly in the third step additional positions (labeled '3') in the 5x5 window 
around this best match are searched . 
.1.6.5-4.3.2 ID 12] 4567 . 
4 4 4 
2 2 2 4 l:.CD- 4 
1/ 4 ~ 
I I tm ~ V ] 
V 
-
..... 
I If I 2 ] 
I I I N 
... 
.. 
.. 
... 
FIgure 3.3 Example path for convergence of Four Step Search 
The best match found in this step is shown as circled '3'. In the fourth and 
final step a 3x3 window will be searched around this best match which will 
give the final motion vector, ((7, -3) (in this example». The FSS requires 
between 2 to 4 steps, and between 17 to 27 search positions. 
3.4.4 Two Dimensional Logarithmic Search Algorithm (TOL) 
The Two Dimensional Logarithmic search (TDL) [62], was the first block 
matching algorithm to exploit the quadrant monotonic model to match 
blocks. The multi-stage search is accomplished by successively reducing the 
search area during each stage until the search area is trivially small. The TDL 
has several stages. Since the publication of Jain and Jain's algorithm [62], it 
has been described by various authors. They differ only in the way the step 
size is reduced. 
51 
7-6-5 ... -3-2101234 . . 5 6 1 
... 
2 b. 
3 3 ;i) U. 
2 3 o..! 3 2 .. 
3 3 3 W 
2 N 
-
I I .. 
~ 
w 
I .. 
.. 
.. 
... 
Figure 3.4 Example path for convergence of Two DimensIonal Loganthmic Search 
The following implementation is based upon one such idea. Four positions 
(labeled '1') are searched at the corners of a Greek cross (+) around the centre 
'0' (Figure 3.4) with a step size equal to p/2 where 'p' is half the search 
window width. The best match found is shown again as a (circled '1'). In the 
second step, if the step size is greater than 1, it is halved and four more 
positions in a similar manner to step one are searched (marked '2') around it. 
The best match found in step two is shown as circled '1', the same point as 
found in step one in this example. This procedure continues until the step size 
becomes unity. In this case nine positions (marked '3') are searched around 
the best match found in the previous step. The best match found thus gives 
the final motion vector in this case, point (1, -5) in Figure 3.4. Note that unlike 
the previous methods the TDL is not able to search all of the (-7, 7) x (-7,7) 
search space. 
3.4.5 Orthogonal Search 
The Orthogonal Search Algorithm (OSA) was proposed in reference [63]. It 
proceeds alternatively in the horizontal and vertical direction. According to 
the algorithm, two equally spaced positions (marked '1' in Figure 3.5) around 
the centre '0' are searched horizontally with a pixel distance of p/2. 
52 
:r ~ -5 5 6 -4-3-2101234 7 
6 
5 D 5 
, 
3 
I 
4 
I I 
" 
w 
2 .. 
'" 
... 
FIgure 3.5 Path of convergence for OrthogonaI Search 
The best match thus found ('0' in Figure 3.5) is taken as a starting point for the 
vertical search. Again two positions (labeled '2') are taken around this best 
match with the same step size. The best match found in this step is shown as a 
circled '2'. If the step size is greater than 1 then it is halved and the best match 
is taken as the starting point for the next horizontal search (positions marked 
as '3'). The same procedure continues (switching between horizontal and 
vertical search) until step size becomes 1. In Figure 3.5, the final motion vector 
found is (2, -6). 
3.4.6 Large Diamond Search (LDS) [10,64] 
As shown in the Figure 3.6, eight positions (marked '1'), starting from the 
centre '0', which form a large diamond, are searched. In the next iteration the 
same (large diamond shaped) window is searched around the best match 
(circled '1'). This process continues until the best match found (circled '1' 
again) is at the center of the diamond. Then another four points at the corners 
of a Greek cross shaped pattern are searched in a window of 3x3. The best 
match thus found (circled '3') is the final motion vector, (2, -1) (Figure 3.6). 
53 
-1.6-5-403.2 10 1234 . 5 6 7 
... 
.. 
u. 
.. 
loo 
1 2 N 
1 1 J:.al 2 
1 3 2 .. 
1 1 3 2 
1 2 
" 
w 
.. 
.. 
.. 
... 
Figure 3 6 Large Diamond Search strategy 
3.4.7 Small Diamond Search 
Ibis algorithm [10] searches four positions (labeled 'I'), starting from the 
centre '0', at the ends of a cross (+) in a 3x3 window as shown in Figure 3.7. 
Then another 3x3 window around the best match (circled '1') (Figure 3.7) is 
searched until the best match found (circled '3') is at the centre. 
7.6-5-403.21023 . . 4 5 6 7 
... 
In 
U. 
.. 
'" 4 N 
1 4 D 4 :.. 
1 3 .. 
1 2 3 
... 
... 
... 
'" 
... 
..... 
Figure 3.7 Small Diamond Search pattern 
54 
This algorithm was an attempt to further reduce the computational 
complexity of the above described large diamond search pattern. 
3.4.8 Limited Diamond Search 
The idea of limited diamond search, given in [10], is to reduce the complexity 
of Small Diamond search which may arise due to large motion sequences. 
According to this algorithm, the Small Diamond search described in section 
3.4.7 should halt after 4 steps whether the best match is found at the centre or 
not. In Figure 3.8 the best match (circled '4') is found after the search halted 
after four steps of Small Diamond search pattern. The final motion vector is 
(3, -1). Note the example described here in Figure 3.8 is different from that 
described in Figure 3.7. 
-7.6.s~3.2'O' 234 . 5 6 7 
... 
.. 
'" .. 
'" 5 .. 
, 2 3.- '!Jffi 5 
, 3 4 .. 
, 2 3 4 ~ 
N 
w 
-
~ 
.. 
... 
Figure 3 8 An example path of Limited Diamond Search 
3.4.9 Conjugate Direction Search 
This algorithm is presented here as a form of "One at a Time Search" [65]. 
According to this algorithm the search is started from the centre position ('0') 
(Figure 3.9), and two equidistant (1 pixel distance) horizontal positions 
(iabeled '1') are searched. The next position to be searched depends upon 
55 
whether the best match found in the first step is to the right of centre '0' or to 
the left. In this example the best match (circled '1') is found to the left of the 
centre '0'. Then one more position is searched in this direction, i.e. left. If this 
is the best match (circled '2') then another position at a one pixel distance is 
searched. This process continues until the minimum SAD found at a certain 
position does not change. This position is shown as circled '3' in Figure 3.9. 
This point is then taken as the horizontal coordinate of the final motion 
vector. In a similar manner the search proceeds from this point (circled '3') in 
the vertical direction, by searching two positions (labeled '5'), one above and 
one below it. The direction of the search depends upon the best match found. 
The search will continue in the vertical direction until there is no change in 
the minimum SAD. This point is shown by the circled '6' and gives the 
vertical coordinate of the final motion vector. Accordingly the final motion 
vector in this example is (-3, -2). 
.7-65-4.3.2101 . 2 3 4 5 6 7 
... 
In 
U. 
.. 
1.0> 
;.. 
5 ~ 
'" 2 I co 
5 
... 
.. 
.. 
'" 
.. 
... 
FIgure 3 9 Conjugate DIrectIon Search Strategy 
56 
3.4.10 Cross Search 
A (slightly simplified) version of cross search algorithm [66] is presented here. 
It can be considered as a form of TSS, however at each iteration (step) the 
algorithm searches positions at the end of a cross (X) starting from the centre 
'0' as shown in Figure 3.10. 
The initial step size is taken as pj2. In Figure 3.10 this step size is 3. So 
the four positions searched are the ends of a (X) in a 7x7 window with centre 
'0'. Similarly in the next step the step size is halved and the search proceeds in 
the same manner by moving the search window around the best match 
(circled '1') found. This process continues until the step size becomes unity. 
The best match found in this step (circled '3') corresponds to the final motion 
vector, (4, -6). 
"-'-5403-2101234567 . 
~ 3 
2 / :o!.D 
V IY 3 
1 / I:ID 
/ /' 
,y 2 . 
If' .. 
N 
1 1 w 
... 
... 
FIg 3.10 An example of Cross Search 
In the actual version of the Cross Search, four more positions are searched at 
the corners of a cross (+) if the best match in the last step belongs to either the 
upper right corner of the cross (X) (searched in the last step with positions 
marked as '3') or lower left. Similarly four more positions will be searched at 
57 
the corners of the Greek cross (X) if the best match in the last step belongs to 
the upper left or lower right corner of the cross (X) searched in the last step. 
3.4.11 Parallel Hierarchical One Dimensional Search Algorithm (PHOD) 
The PHOD algorithm [67] is very similar to the TSS and TDL motion 
estimation algorithms. The difference is, that in the PHODS algorithm only 
those macroblocks are evaluated, that are on the coordinate axes defining the 
search area. As a result, the coordinates of motion vector are obtained 
separately. Figure 3.11 shows an example of the PHODS algorithm. The first 
step starts by calculating the SAD value at the position (marked '0') in the 
centre of search area. Next, the search step size is defined to be equal to the 
half of the search range (P/2 .. 6/2 = 3 in Figure 3.11) and four SAD values 
along the x-axis and y-axis direction are evaluated (those positions are 
marked '1' in Figure 3.11 and the intermediate best matches in the 
corresponding steps are circled). After the evaluation, the two best matches 
are chosen: one for the x-axis (circled '1' along x-axis), another for y-axis 
(circled T along y-axis). 
In the second step the step size is reduced by a factor of two and SAD 
values at four locations are evaluated as before. Two locations around the best 
match found on the y-axis and two locations on the x-axis. All those 
macroblock positions are marked '2' in Figure 3.11 and the corresponding 
best matches are circled. This procedure continues until step size reduces to 
unity. At this point the search is halted and the corresponding best matches 
(circled '3') on the x-axis and y-axis are taken as the horizontal and vertical 
coordinates of the final motion vector respectively. Figure 3.11 shows the final 
motion vector as the point (4, -4). 
58 
-7.6-5-'.3-2101234567 . 
3 
,(6 
m 
" Cl 7 
/ 
2 / 
1 0 2 RI ar t1n 3 .. 
1 w 
.. 
.. 
Figure 3.11 An example of Parallel HIerarchtcal one dimensional search algorithm 
3.4.12 Spiral Search (SS) 
The spiral search algorithm [68] comprises of the following steps. In the first 
step the SAD is evaluated on a grid of four pixels (marked as '1') as shown in 
the Figure 3.12, which are at the ends of a Greek cross (+). The dimension of 
the cross is taken to be equal to the half of the length of the search window (in 
Figure 3.12 it is '3'). The SAD is also evaluated at the corners of the search 
window (marked as '1'). If the best match found (circled '1') belongs to the 
positions at the corners of the Greek cross (+) then a grid of nine pixels 
(marked '2') is used with a half step size for the SAD calculation around this 
best match. Note that the points marked '2' near the lower right corner of the 
search area (Figure 3.12) will only be searched when the best match in the first 
step belongs to the corners of the search area. 
This process continues until the step size becomes 1. The best match 
found (circled '3') corresponds to the final motion vector (3, -3). 
59 
7.6-5-4-3-210 1234561 . . 
I I 
2 2 2 
3 3 3 
2 Z ,..,., 
3 lA' 3 
2 2 V 2 
I I 
I w 
I (j) 3 3 .. 
3 <D 3 2 
3 3 3 .. 
I 2 ID ... 
FIgure 3.12 An example of Spiral Search 
If the best match found in step one belongs to the corners of the search 
window (circled '1' at the lower right corner of the search area) then three 
positions (marked '2') are searched in the next step with a distance of 2 pixels 
between them as shown at the lower right corner of the search area in Figure 
3.12. In the next and final step positions are searched (marked '3') in a 3x3 
window around the best match (circled '2'). The number of positions searched 
at this step can be 9 or less than 9 depending upon the location of best match 
in the second step. Figure 3.12 shows a possible searching path at the lower 
right corner of the search area. The best match in this case is shown as the 
circled '3'. 
3.4.13 Block-Based Gradient Decent Search (BBGS) 
The final fast method, the BBGS algorithm [69] relies on the assumption of 
centre-biased motion as shown in the Figure 3.13 below. 
60 
'1-6-5-4-3.210123 . 4 5 6 7 
~ 
.. 
(so 
4 4 4 • 
3 .g) 3 w 
2 2 :m 3 
'" 1 1,W)' 2 3 ~ 
1 IY 1 2 
" 
1 1 1 
.... 
... 
... 
.. 
'" 
... 
Figure 3 13 BBGS path 
A square of 3x3 neighbour integer-pixel search positions (marked '1') is 
moved in the direction of the lowest SAD. The search is stopped, when the 
centre position of the 3x3 square offers the lowest SAD. The authors suggest 
this search procedure should be stopped after a certain number of steps 
(typically after 4 steps). 
3.5 Summary and Conclusions 
This section presents some conclusions derived from the performance of the 
above described motion estimation algorithms. In order to compare them, two 
sequences, one containing large motion - the 'Rotating City', and the other 
small motion sequence - the 'Oaire' sequence (Figure 3.14), are used. One 
way to compare the computational complexity of the algorithms is to 
compare their total SAD calculations, i.e. number of times the algorithm calls 
the SAD metric. In the MPEG2 reference model [40], the SAD metric is 
implemented in the 'DISTI' function in motion.c [40]. This DISTl function is 
shown in Figure 3.15. 
61 
93 100 193 lOO 250 
Figure 3.14 Frame 11 from Claire Sequence 
Scalar DISTl (SAD) function 
static int distl(blkl , blk2 , lx,hx , hy , h , distlim) 
unsigned char *bIkl , *blk2 ; 
int lx , hx,hy , h , distlim; 
unsigned char *pl , *pla , *p2 ; 
int i , j; 
int S , Y; 
s = 0 ; 
pI bIkl; 
p2 bIk2; 
if (!hx && !hy) 
for (j=O ; j<h ; j++) 
if «v pi (0) - p2 (0» <0) v -Vi S+= V; 
if «v 
if «v 
pi [ 1 ) 
pi (2) 
- p2 [1) ) <0) 
- p2 (2) ) <0) 
62 
v = -Vi S+= V; 
v -Vi S+= vi 
if ( (V p1 [3] - p2[3])<0) 
if ( (V p1 [4] - p2 [4] ) <0) 
1f ( (V p1 [5] - p2 [5]) <0) 
if «v p1 [6] - p2[6])<0) 
if «v p1 [7] - p2[7])<0) 
if «V p1 [8] - p2 [8] ) <0) 
if «V p1 [9] - p2[9])<0) 
if «V p1 [10] - p2 [10]) <0) 
1f «V p1 [11] - p2 [11] ) <0) 
if «V p1 [12] - p2 [12]) <0) 
if «v p1[13] - p2[13])<0) 
if «v p1 [14] - p2 [14] ) <0) 
if «v p1 [15] - p2 [15] ) <0) 
if (s >~ distl1m) 
break; 
pl+~ Ix; 
p2+~ Ix; 
else if (hx && Ihy) 
for (j=O; j<h; j++) 
for (i=O; i<16; i++) 
V -Vi S+= V; 
V -Vi S+= V; 
V -Vi S+= V; 
V -Vi s+= V; 
V -Vi S+= V; 
V -Vi S+= V; 
V -Vi S+= V; 
V -Vi S+= V; 
V -Vi S+= V; 
V -Vi S+= V; 
V -Vi S+= V; 
V -v; S+= v; 
V -V; S+= V; 
V = «unsigned int) (p1[i]+p1[i+1]+1)>>1) - p2[i]; 
if (v>=O) 
} 
S+= Vi 
else 
s-= V; 
p1+= Ix; 
p2+= Ix; 
63 
else if (Ihx && hy) 
{ 
} 
p1a = p1 + Ix; 
for (j=O; j<h; j++) 
{ 
} 
for (i=O; i<16; i++) 
{ 
} 
v = «unsigned int) (p1 [iJ +p1a [iJ +1) »1) - p2 [iJ ; 
if (v>=O) 
S+= Vi 
else 
s-= Vi 
p1 = p1a; 
p1a+= Ix; 
p2+= Ix; 
else f* if (hx && hy) *f 
{ 
p1a = p1 + Ix; 
for (j=O; j<h; j++) 
{ 
for (i=O; i<16; i++) 
{ 
v = «unsigned int) (p1 [iJ +p1 [i+1J +p1a [iJ +p1a [i+1J +2) »2) -
p2 [iJ ; 
if (v>=O) 
S+= Vi 
else 
B-= Vi 
} 
p1 = p1a; 
p1a+= Ix; 
64 
p2+= Ix; 
} 
} 
return Si 
} 
Figure 3.15 Sequential implementation of DISf1 function 
The above code (Figure 3.15) has been taken from MPEG2 TM5 reference code 
and shows the sum of absolute difference between two (16*h) blocks 
including half pixel interpolation of block (bIk1). Here 'blkl' and 'blk2' are the 
addresses of the top left pels of each macroblock. 'Ix' is the distance (in bytes) 
of vertically adjacent pels. 'hx' and 'hy' are flags for horizontal and/ or 
vertical interpolation. 'h' is the height of block (usually 8 for a field or 16).The 
condition on 'distlim' (the previous minimum SAD) tells the algorithm to bail 
out if the currently accumulated partial SAD sum exceeds this value. 
Inside the 'DISTI' function is shown four kinds of sequential 
implementations of SAD metric. The first, formed under the condition 
'if ( Ihx && Ihy ) ',pertains to the usual SAD defined in (eqn. (2.1» and is 
calculated sequentially. The other forms (Figure 4.1) of SAD are for half-pixel 
motion estimation (section 2.3) to enhance the effect of ME and to obtain 
better prediction. 
65 
800000 
700000 
600000 
~ 500000 
u 
_1400000 
I-
!!! 300000 Q 
200000 
100000 
0 
0 50 
2p+1 
100 150 
-+-ntss 
--.-cross 
__ 20-,og 
~splral 
-+-hmrte<Cds 
-t-tss 
-Iss 
-ds 
-+-Ids 
__ ortho 
-'-COnIU 
__ phod 
~gradlent.:......J 
Figure 3.16 Number of times 'DISf1' is called by algorithms in Oaire sequence 
Figure 3.16 shows the total number of times 'DISTl' is called by each 
algorithm in the Oaire sequence. As expected the TSS (tss) algorithm requires 
more calls to distl function than NTSS (ntss) and FSS (fss). The Limited DS 
Qimited_ds), DS (ds) and gradient search algorithms require the fewest DISTl 
calls. 
The case above assumed that the metric SAD (eqn. (2.1)) has been 
calculated fully, i.e. without any early termination. Since the MPEG2 reference 
model [40] implements a SAD early termination mechanism (Figure 3.15) that 
terminates the SAD calculation whenever the currently accumulated partial 
SAD value exceeds the previously calculated SAD minimum, it is better to 
compare algorithms in terms of total number of SAD evaluations (inside 
iterations of SAD function). 
Figure 3.17 shows total number of 'DISTl' evaluations executed by each 
algorithm in Oaire sequence. It shows that the early termination mechanism 
is less effective for NTSS than for the FSS for this sequence. 
66 
4500000 
4000000 
:!! 3500000 
o ~ 300000O 
.2 2500000 
~I 2000000 
~ 1500000 
U) 
is 1000000 
500000 
o 
o 
... 
/ 
p-
• ~ 
1:':1. 
~/ 
a.~ 
/JI 
... 
50 
.... 
./ 
... 
100 
2p+1 
-
~ 
150 
__ nlsS 
-'-Umrte<Cds 
~cross 
__ 20_log 
__ spiral 
-+-Iss 
--Iss 
--ds 
__ lis 
__ ortho 
-'-conJu 
___ phod 
__ grad,enl 
Figure 3.17 Total number of'DISTl' evaluations by each algonthm in Claire sequence 
45000000 r---------------, 
40000000 ~===:~S;;:;=~~=~~~ 
.. 35  + 
c 
,g 30000000 +--.... >£-----------1 
.. 
~ 25000000 +-----.1'------------1 
i>1 20000000 +--+-------------1 
... 
In 15000000 ~'i~;~.iE~~ii~ i5 10000000 t 
5000000 H1">I';;1 
O~~--~--T_~--~--~~ 
o W ~ 50 50 100 lW 1~ 
2p+1 
__ nlss 
-'-6mrte'Cds 
-*-cross 
__ 20_log 
-+-splral 
-+-Iss 
--Is. 
--d • 
__ Id. 
__ ortho 
-'-corju 
___ phod 
__ grad,enl 
Figure 3.18 Total number of 'DISTl' evaluations in Rotating City sequence 
Figure 3.18 shows total number of 'DISTl' evaluations executed for the 
'Rotating City' sequence which contains greater motion. As shown the 
complexity of OS and Large OS algorithms has increased considerably 
compared to other algorithms making them less suitable for large motion. On 
the other hand the limited OS has shown reduced computational complexity 
for this case as well. 
67 
450000000 ,---------------, 
400000000 +------------..t:-----l 
350000000+----------~~__1 300000000+---------~,t~-__1 ~ 2~0000t--------~;fL---~ ~ 200000000 t-------:-/-/-----I 
is 150000000 -1------7/'--------1 
100000000 +-----:;f¥-/------1 
50000000 +----=>-/---------l O~~~~~~~~~~~ 
o 50 100 150 
2p+1 
-+-Is 
-A-cblss 
--¥-cross 
__ 20-'og 
-+-splral 
-+- IImlted_ds 
--Iss 
--Iss 
-+-ds 
__ Ids 
-A-ortho 
~CO'lu 
__ phod 
-+-gradlent 
Figure 3.19 Total number of 'DIST1' calls by algorithms including Full search in Rotating City 
sequence 
Figure 3.19 shows a comparison of '01511' calls by the Full Search (fs) ME 
algorithm and 'D1511' calls by the fast ME algorithms. Figures 3.20 and 3.21 
shows comparison of algorithms in terms of their PSNR. 
488 -+-full 
__ tss 
4875 -A-ntss 
~fss 
48.7 
-+-ds 
It: 4865 -+-hmlted_ds 
z 
-Ids Cl) 
D.. 486 
--ortho 
4855 -+-conJu lid 
__ cross 
485 
-A-20_log 
r 
4845 ~phod 
0 50 100 150 __ spiral 
2p+1 -+- gradient 
Figure 3 20 PSNR vs search window size for all algonthms in Claire sequence 
Note that the PSNR values in these figures are average values and are 
obtained from the output statistics file generated by the MPEG2 encoder. 
Obviously the Full Search algorithm has optimum quality level for large 
68 
motion sequences as shown in Figure 3.21. Note the increase in PSNR is 
slight. 
386 __ full 
__ tss 
384 
-A-ntss 
--*-fss 
382 __ ds 
-+-hmrte'Cds 
0: 38 --Ids 
z 
--ortho ~ 378 
__ conJu 
__ cross 
376 
-A-2D-1og 
374 
--*-phod 
__ spiral 
372 
__ gradient 
0 50 100 150 
2p+1 
FIgure 3.21 PSNR for all algorithms in Rotating City sequence. 
69 
4. Vector Datapath for Real-Time MPEG2 Encoding 
4.1 Introduction 
The previous chapter discussed the definitions of fast motion estimation 
algorithms and suggested a possible means of providing a comparison 
through the number of DISTl evaluations. This notion will now be expanded 
with the introduction of the Dynamic Instruction Count (or DIC value). In this 
and the following chapter we shall investigate the efficiency of these 
algorithms through empirical profiling which involves running the MPEG2 
application both in Linux x86 mode as well as Simple Scalar (SS) mode (where 
SS is a simulator/profiler which gives complexity statistics for a given 
application (MPEG2 here» as well as by a simple theoretical complexity 
analysis. Moreover the possibility of exploiting Data Level Parallelism (DLP), 
i.e. using Single Instruction Multiple Data (SIMD) commands and Thread 
Level Parallelism (TLP), is also investigated as a means of reducing the 
complexity of the motion estimation process. 
The MPEG2 codec is based around the discrete cosine transformation 
of either the residual data, obtained after performing motion estimation and 
compensation when removing redundancy within frames (inter-frame 
coding), or the original luminance/ chrominance data when removing 
redundancy within the same frame (intra-frame coding). These 
transformations are followed by quantization which removes high spatial 
frequency components, significantly reducing the required transmission rate 
while maintaining good visual quality. 
A significant amount of research is currently being conducted into 
alternative algorithms such as those based around wavelet transforms [70] 
and fractal-based coding algorithms [71]. Nevertheless, the DCf-based 
methods are presently much more popular than the other two and form the 
basis of all current international standards for digital video coding. These 
standards are summarized in Table 4.1. 
70 
MPEG4 [51] and H264 [73] achieve even lower bit-rates and higher 
PSNR values than MPEG2 through increasingly sophisticated techniques. 
This directly translates into significantly higher raw computational 
requirements (at least an order of magnitude increase for H264) and increased 
power consumption. This, in conjunction with the rising importance of digital 
video transmission (through the expected phasing out of analog TV in the UK 
over the next few years and the popularity of video-capable embedded 
devices like mobile phones and portable DVD players), has spawned 
significant research and development efforts both in industry and academia 
into a new generation of sophisticated hardware platforms. These platforms 
utilize an increasing number of embedded configurable processors the most 
significant of which are reviewed in the next section. 
Table 4.1: Dcr-based video coding standards 
Std Year Body Rate Usage 
H261 1990 ITU-T 64Kb/s ISDN Video phone 
MPEG1 1993 ISO 1.2Mb/s CD-ROM 
MPEG2 1994 ISO/ITU-T 4-80 Mb/s DVD,HDTV 
H263 1995 ITU-T 64Kb/s PSTN Video Phone 
MPEG4 2000 ISO 24-1024 Many 
Kb/s 
H264 2003 ISO/ITU-T <64Kb/s Many 
4.2 Configurable and Reconfigurable Architectures 
In an attempt to reach near-hardwired performance levels, embedded 
processor vendors have produced CPU architectures that can be extended to 
closely match the processing and memory requirements of the required 
algorithm. This is the domain of (statically) configurable, extensible 
processors. It is interesting to note that traditional embedded CPU designers 
71 
Table 4 2 Configurable/Re-configurable processor vendors and ardutectures [74] 
Vendor 
ARC [75) 
Tensilica [76) 
ARM [77) 
Microarchitecture 
A4 (4 stage 
pipeline) 
AS (4 stages) 
A600 (5 stages) 
A700 (7 stages) 
Xtensa (5 stages) 
ARM9 (5 stages) 
Characteristics 
Scalar, 16/32-bit (modeless except A4. 
A modeless mix of 16/32 bit 
instructions makes code faster and 
parallel when parallelism is required 
and compact when not needed ), 32-
bit datapath, configurable, extensible 
Scalar, 16/24 bit (modeless), 32-bit 
datapath, configurable, extensible 
Scalar 16/32 bit (mode-bit, It selects 
ARMI0 (6 stages) between 16 bit or 32 bit instructions), 
MIPS[7S) 
ARMll (S stages) 
M4K (5 stages) 
SiliconHive [79) Avispa+ 
Aspex [SO) LineDancer 
Elixent [SI) DFA-lOOO 
Craddle [S2) MDSP 
32-bit datapath, coprocessor I/F 
Scalar, 16/32-bit (mode bit (It selects 
between 16 or 32-bit instructions or 
ISA (Instruction Set Architecture) 
mode) ), 32-bit datapath, coprocessor 
I/F, datapath extension technology 
Instructions up to 76S bits long 
(ULIW). Up to 60 instructions per 
cycle 
Combines a SIMD parallel processor 
with a RlSC controller 
Consists of an array of 4-bit ALUs 
connected using a routing network of 
SRAM-based switches. 
Multiple RlSC, DMA and DSP engines 
arranged in quads, in a two level 
hierarchical bus structure with local 
memories. 
72 
-----
like ARM and MII'S (ARM and MlPS are RISC (Reduced Instruction Set 
Computer) microprocessor architectures) (Table 4.2) have moved in the 
direction of pioneering companies of configurable CPUs such as ARC and 
Tensilica via closely/loosely coupled coprocessors (ARM, MlPS). In the last 
few years, active research in the domain of very-Iong-instruction-word 
(VLIW), and dynamically configurable architectures has lead to the 
commercialization of more exotic architectures from vendors such as 
SiliconHive, Aspex, Elixent and Cradle. A Very Long Instruction Word, or 
VLIW, CPU architecture implements a form of instruction level parallelism. 
Similar to superscalar architectures, it uses several execution units (e.g. two 
multipliers), which enables the CPU to execute several instructions at the 
same time (e.g. two multiplications). The main vendors, architectures and 
characteristics of both configurable and reconfigurable offerings are 
summarized in Table 4.2 [74]. 
Because many applications, especially demanding multimedia and 
communications applications, do not run fast enough on standard embedded 
microprocessors even with the extra performance boost of an embedded DSP 
(Digital Signal Processor), engineering design teams tend to hand-<:ode parts 
of the design in Verilog or VHDL to achieve the performance they need. 
However, custom RTL (Register Transfer Level) logic for complex functions 
takes a long time to design and verify. In addition, hand-<:oded RTL blocks 
are often too rigid to change once they are designed, yet changes are often 
needed to accommodate developing standards or new product features. 
The most popular strategy is to build a system consisting of a number 
of highly specialized application specific integrated circuits (ASICs) coupled 
with a low cost core processor, such as an ARM [77]. The ASICs are especially 
designed hardware accelerators to execute the computationally demanding 
portions of the application that would run too slowly if implemented on the 
core processor. 
73 
Configurable Processors 
Configurable processors allow embedded-system developers to create 
processors specifically tailored to the target algorithms - producing a much 
better fit between processor and algorithm. Designers can add special-
purpose, variable-width registers; specialized execution units; and wide data 
buses to reach an optimum processor configuration for specific algorithms. 
These features allow developers to mould the processor's characteristics to 
the algorithm instead of trying to force a 'round' algorithm into the resources 
available for a 'square' one. Consequently, application developers can more 
rapidly develop systems that meet all performance specifications using 
configurable and extensible processors than by using off-the-shelf, fixed-ISA 
microprocessors and DSPs. 
Configurable processors offer several important advantages over RTL 
design, including programmability after silicon implementation to 
accommodate changing standards and provide maximum flexibility. For 
example, one configurable processor can run multiple audio codecs, whereas 
each codec would have to be implemented in a separate RTL block if 
designed in Verilog or VHDL. 
Configurability of a CPU [83] represents the freedom for designers to 
customize and optimize a processor from a standard menu of options. Cache 
size and features, interrupts, extension instructions, DSP features, timers and 
many other features can be specified. Configurability enables designers to 
add the features they need and to remove features that are not required for 
their application. Performance, size and power tradeoffs can be quickly made 
to define an optimum solution for many applications. The result is generally a 
smaller die area, lower power and a lower production cost than is possible 
with a fixed architecture core. 
Extensible processors, an important superset of configurable 
processors, provide system designers with the additional ability to add 
instructions to the processor that may not have been considered important by 
74 
designers of the original architecrure. The addition of highly customized 
instructions matched perfectly to a specific application gives configurable 
processors the ability to deliver the required performance levels. Configurable 
processors are delivered as RTL code that is synthesized into an FPGA (Field 
Programmable Gate Arrays) or SOC (System On Chip) design. 
Extendibility [84] is the freedom for designers to add extensions of 
their own design to the processor core. For example ARC processor (ARC 
international's RISC based processor) offers flexibility to extend the core's 
instruction set, register set, auxiliary registers, and condition code logic, to 
create a processor highly tuned for the specific application. By accelerating 
inner loop or repetitive software in hardware via custom extensions, the 
designer can get more work done per instruction to allow a performance 
unattainable by fixed architecture cores. Alternatively, the designer may elect 
to get the same amount of work done, with a greatly reduced clock frequency 
to conserve power. Performance increases or frequency reduction of up to 100 
times are possible with intelligent use of the flexible extendibility of such 
cores. 
Example of possible Extensions include: 
o SIMD Instructions 
Single Instruction Multiple Data (SIMD) [84] custom instructions allow 
parallel execution of instructions. Typical RISC processors must execute an 
instruction sequence multiple times in series to process a set of data. With 
ARC's extendibility, instructions can be created that will operate on multiple 
operands of 16 or 8 bits in parallel, packed into one 32 bit word. In addition, 
by creating multiple extension registers, instructions can be made 64 bits, 128 
bits or even wider to include many operands for the parallel operation. This is 
obviously attractive in MPEG2 motion estimation where 8-bit data is used. 
In reference [29], 64-bit SIMD instructions are used to implement a 
coprocessor of Intel® XScale® microarchitecture to act as an accelerator for 
75 
handheld mobile video applications. Similarly in reference [26], SIMD 
instructions of different lengths (4, 8 and 16 bytes) are used to exploit the 
parallelism inherent in the inner loop of ME process of multiple codecs 
(MPEG2, MPEG4, and H264) by implementing a coprocessor to be tightly 
coupled with a RISC CPU (SPARC-V8 compliant) [26]. The emphasis was on 
Full Search ME algorithm. In this thesis SIMD instructions of different lengths 
(4,8 and 16 bytes) are used in the inner loop of ME process of the MPEG2 
codec to accelerate the process by implementing a coprocessor to be attached 
with the RISC CPU (SPARC-V8 compliant) as above but the emphasis is on 
Full Search and Fast ME methods and there performance evaluation in terms 
of a theoretical analysis. 
IJ Integrated Coprocessor Instructions 
Coprocessors of any size and any pipeline length can be tightly coupled to the 
ARC processor [84] through registers mapped into auxiliary register space, or 
through control data passed to the coprocessor as operand data. The return 
data can be routed to extension core registers to allow the ARC processor to 
continue executing instructions in parallel with the coprocessor. 
4.3 Data Level Parallelization (DLP) 
It is evident from eqns. (1.2) and (1.3) that the full search motion estimation is 
enormously time and power consuming due to its requirement of billions of 
integer operations per second and billions of 8-bit memory accesses per 
second respectively. Also because of the data irregularity of fast motion 
estimation (ME) algorithms (see Chapter 3), they are often performed in 
software. As a result it makes sense to utilise some of the techniques 
mentioned above. ME uses 8-bit luminance data so that packing 4 pixel values 
into the data to form SIMD instructions is very attractive, likewise the use of 
multiple embedded processors. To begin we consider the scalar SAD metric 
(DISTl function, Figure 3.15) (shown again here in Figure 4.1 for convenience) 
76 
which, due to its sequential nature, is the most computationally intensive part 
of ME process. 
Scalar DISTl (SAD) function 
static int distl(bIkl,bIk2,lx,hx,hy,h,distlirn) 
uns1gned char *blkl,*blk2; 
int lx,hx,hy,h,distlirn; 
unsigned char *pl,*pla,*p2; 
int i,j; 
~nt S,Vi 
s = 0; 
pI =: blkl; 
p2 = blk2; 
1f (!hx && !hy) 
for (j=O; j<h; j++) 
if «v pI (0) - p2[O)<O) 
if «v pI [1) - p2[I)<O) 
if ( (V pI (2) - p2(2)<0) 
if ( (V pI (3) - p2(3)<0) 
if ( (V pI (4) - p2(4)<0) 
if ( (V pI[S) - p2[S) )<0) 
if ( (V pI(6) -p2(6)<0) 
if ( (V pI (7) - p2 (7) ) <0) 
if ( (V pI (8) - p2(8))<O) 
if ( (v pI (9) - p2 (9) )<0) 
if ( (V pI (10) - p2[IO))<0) 
if ( (V pI (11) - p2[ll)<O) 
if ( (v pI (12) - p2[I2)<0) 
if ((v = pI (13) - p2(13))<0) 
77 
v -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V -Vi S+= 
V; 
V· ,
V; 
V; 
V; 
v; 
V; 
v; 
v; 
V· ,
V; 
V; 
V; 
v; 
if «v pl[l4] - p2 [14]) <0) v -v; s+= v; 
if «v p1[lS] - p2[lS])<0) v = -v; s+- v; 
~f (s >= distlirn) 
break; 
pH= Ix; 
p2+= Ix; 
else if (hx && thy) 
for (]=O; j<h; j++) 
} 
for (i=O; i<16; i++) 
{ 
v = «uns~gned int) (p1[i]+p1[i+1]+1»>1) - p2[i]; 
if (v>=O) 
S+= Vi 
else 
s-= Vi 
p1+= Ix; 
p2+= Ix; 
else if (!hx && hy) 
{ 
p1a = p1 + Ix; 
for (j=O; j<h; j++) 
for (i=O; i<16; i++) 
{ 
v = «unsigned int) (p1 [i] +p1a [i] +1) »1) - p2 [i] ; 
if (v>=O) 
S+= Vi 
else 
78 
} 
} 
s-= Vi 
} 
pi = pia; 
pla+= Ix; 
p2+= Ix; 
else /* if (hx && hy) */ 
{ 
pia = pi + Ix; 
for (j=O; j<h; j++) 
{ 
for (i=Oi i<16i 1++) 
{ 
v = ({unsigned int) (pl[i]+pl[i+l]+pla[i]+pla[i+l]+2»>2) -
p2 [i] ; 
} 
} 
if (v>=O) 
s+= Vi 
else 
S-= V; 
} 
pi = pia; 
pla+= Ix; 
p2+= Ix; 
return S; 
} 
Figure 4.1 Sequential implementation of OISf1 function 
In order to get a numerical estimation of the computational complexity of the 
ME process we have profiled the full search [40], [7], [10] and the fourteen fast 
79 
ME algorithms (Chapter 3) each with twelve video sequences (fog, snowfall, 
snow lane, cup, deadline, office, paris, rotating city, student, mother and 
daughter, bowing, tennis). Each sequence consists of 25 frames, however the 
sequences differ in terms of the type of motion and a few sequences have 
different dimensions from that defined by CIF. 
MPEG2 TM5 reference software was initially profiled, in native mode 
(x86) as well as on a simulated processor which implements the SimpleScalar 
ISA [85]. It is very similar to the MIPS-II (Microprocessor without Interlocked 
Pipeline Stages, MIPS is a RISC processor) ISA (Table 4.2). This is an 
experimental architecture for algorithmic research and optimization and can 
be best described as a 32-bit RISC architecture with 64-bit opcodes (operation 
codes). The toolset consists of a C-compiler, assembler and linker, and a 
collection of an architectural and a single microarchitectural-simulator. 
The profiling results give us the DIC (Dynamic Instruction Count), 
which is defined here as the total number of instructions executed during 
runtime and gives a measure of the computational complexity of an 
algorithm. DIC values might typically be used by a system designer to assess 
the performance of an algorithm for purpose of optimisation. The value of 
this metric is as a direct measure of performance and does not relate to any 
particular CPU implementation. As shown in Figure 4.2 and in Tables 4.3 
(algorithmic fast search methods) and 4.4 (Full Search), the major complexity 
contributors are the inner loop of the motion estimation function (DISTl) 
which computes the error of the current rnacroblock over an arbitrary 
reference macroblock. This function is called for those rnacroblocks in the 
search window (reference frame), determined by the algorithm, and is 
independent of the search algorithm utilized. 
For full search ME in particular, profiling results demonstrate that the 
fractional DISTl Instruction Count ranges on average from 51 % to 72% of the 
unmodified MPEG2 reference software complexity for a search window size 
of p=6 to p=62 pels respectively, where 'p' is the half search window width 
(Table 4.4). At the same time, the fractional instruction count of FDCT 
80 
(Forward Discrete Cosine Transform [8], [40]) and fractional DIC of 
FULL_SEARCH (where DIC of FULL_SEARCH is equal to the whole DIC for 
the Full Search part of the algorithm minus the DIC involved in executing 
DISTl) varies with search window range with the former decreasing and the 
later increasing. REMAINING represents the instructions not covered by 
FULL_SEARCH, DISTl and FDCT. 
Full-search ME Complexity dlstrlbullon 
-;2-
'" 
,: , .OI5T1 
0= 
aAllSEAACH 
aREMAlllfIIG 
,., 
"'" 
.... 
"'" "'" "'''' 
Complexity 
Figure 4.2 Full search ME fractional DIe dlstnbunon 
Also in Figure 4.3, general processor requirements for real-time OF video at 
25 fps are depicted. These results were obtained by scaling the architecture-
level results (DIq obtained above by an average clocks-per-instruction (CPI) 
[86] value of 1.5. The dynamic instruction count, multiplied by the average 
CPI and the clock period of the microarchitecture translates directly into the 
real-time units (Le. clock cycles). 
SI 
500 
~ 400 
S 300 
i;' 
= 200 g 
C' 
.g 
100 
0 
0 50 100 150 
search range parameter p 
_snowfall 
--rotating city 
-+--fog 
-..-snow_lane 
- ..... - tennis 
...•.. mad 
_.lI(. - deadline 
--I-paris 
--- stud 
- ___ - bowing 
-*-cup 
-+--off 
Figure 4.3 Full search ME performance requirements for a variety of sequences 
Utilizing the DIC complexity metric, these results are applicable to a wide 
range of CPU architectures based on the principles of RISC processing. 
Oearly the frequencies obtained are far too high to be realistic for FSME with 
battery-powered devices except for very small search ranges. 
The fast algorithmic methods on the other hand exhibit near-constant 
behaviour over the search range. Table 4.3 shows the average DIC 
distribution of the three identified functions for three of the fast ME methods, 
averaged over all video sequences. It is clear from Table 4.4 and Table 4.3 that 
for both full search and fast methods the inner loop of the ME computation is 
the most processing intensive function and one that would provide the major 
performance benefit could it be accelerated successfully. 
Table 4.3: RelatIve Complexity Distribution Algorithmic ME (fast methods). 
Function DISTl FDCT FULL_SEARCH REMAINING 
TSS 43.55 18.09 1.55 36.80 
FSS 42.23 18.57 1.04 38.16 
HDS 41.54 18.57 1.93 37.95 
Table 4 4 Fractional (percentage of Total Complexity) DIC for main ME functions for p in 
the range 2 and 62: Full Search Algorithm. 
82 
P DISTI FDCT FULI.SEARCH REMAINDER 
6 51.0 21.1 3.5 24.5 
14 61.1 13.7 9.2 15.9 
24 68.3 7.8 14.7 9.1 
30 69.8 6.1 17.1 7.1 
46 71.7 3.3 21.1 3.9 
62 71.9 2.1 23.5 2.4 
Vectorized DISTI (SAD) function (VSAD) 
if (!hx && !hy) 
for (j=O; j<h; J++) 
1* kernel for 32-bit wide coprocessor 
configuration *1 
vldun(l, pl+0); II load 4 bytes 
II ( 4 pixels, 0 to 3) from reference 
II block (pI) into the 
II vector register file 
vldun(2, pl+4); II load reference pixels 4 to 7 
vldun(3, pl+8); II load reference pixels 8 to 11 
vldun(4, pl+12); II load reference pixels 12 to 15 
vldun(9, p2+0); II load current block pixels 0 to 3 
vldun(10, p2+4); II load current block pixels 4 to 7 
vldun(ll, p2+8); II load current block pixels 8 to 11 
vldun(12, p2+12); Ilload current block pixels 12 to 15 
vsad(l, 9); II SAD across pels 0 to 3 
vsad(2, 10); II SAD across pels 4 to 7 
vsad(3, 11); II SAD across pels 8 to 11 
vsad(4, 12); II SAD across pels 12 to 15 
83 
lIendif 
lIi£ (VLMAX--S) 1* kernel for 64-b1t wide coprocessor 
configuration *1 
vldun(l, pl+O); II load reference frame pels 0 to 7 
vldun(2, pl+S); II load reference frame pels 8 to 15 
vldun(9, p2+0 ); II load current frame pels 0 to 7 
vldun(lO, p2+8); II load current frame pels 8 to 15 
vsad (1,9) ; I I SAD across 0 to 7 
vsad(2,lO); II SAD across 8 to 15 
lIendif 
lIif (VLMAX--16) 1* kernel for 128-bit coprocessor 
configurat10n *1 
vldun(l, pl+O); II load reference pels 0 to 15 
vldun(9, p2+0); II load current pels 0 to 15 
vsad(1,9); II SAD across pels 0 to 15 
lIendif 
commit_temp_acc(s); II fetch accumulated value 
II into scalar variable B 
if (s>= distlim) II SAD early term1nation 
break; 
pl+= Ix; 
p2+= Ix; 
II go to next row of reference block 
II go to next row of current block 
Figure 4 4 Vectorized DISTl function (VSAD) 
The DISTI function was consequently recoded (Figure 4.4) to expose the data-
level-parallelism and in the process, a vector ISA (Instruction Set 
Architecture) was identified. An ISA defines instructions, registers, data 
memory, the effect of executed instructions on the registers and memory, and 
an algorithm for controlling instruction execution. 
Figure 4.4 shows the vectorized DISTI code replacing the inefficient 
scalar code of Figure 4.1. It shows the use of the primary vector SAD (VSAD) 
opcode (Figure 4.6) along with the post-increment unaligned vector load 
84 
opcode (VLDUN) (which supports unaligned register access), for the three 
coprocessor configurations, 32-bit, 64-bit and 128-bit. The vectorized encoder 
was subsequently compiled for vector lengths (VLMAX) of 32-bit (4 bytes), 
64-bit (8 bytes) and 128-bit (16 bytes), using gcc for x86 (the LINUX 
workstation) and the SimpleScalar gcc compiler (sslittle-na-sstrix-gcc) for the 
SimpleScalar ISA (simulator) [87], and results were obtained for both native 
mode (Linux x86), for verification of bitstream correctness, as well as on the 
simulated processor SimpleScalar ISA. The verification process involved 
justification of equivalency of bitstreams generated by the vector ISA 
(vectorized MPEG2 in native mode) and MPEG2 reference software (scalar 
MPEG2)). This equivalency is verified using the compare 'cmp' utility of 
Linux OS for all sequences. 
4.3.1 The virtual processor extensions 
On completion of the native mode validation process, the identified 
instruction extensions (VSAD, VLDUN, etc) were inserted in the simulated 
processor opcode space (Figure 4.6). The existing SimpleScalar instruction 
definitions (default sim-profile simulator) were augmented to include the new 
processor state as shown in Figure 4.5 [74]. One way to add new instructions 
into the SimpleScalar ISA is by modifying the SimpleScalar DEF (Instruction 
Definition) file [32], [88]. In this way it is possible to add new instructions or 
change or enhance the semantics of existing instructions. DEF files (ss.def, 
machine.def, vector.def) [32], [87], [41], [89] are stylized C macro files that 
describe how to decode and execute instructions for a particular SimpleScalar 
target. All SimpleScalar simulators rely on DEF files to get information about 
decoding and executing instructions, as a result, changes made in a DEF file 
will be immediately understood by all SimpleScalar simulators. 
The most convenient way to add new instructions to a program 
(SimpleScalar ISA) is to define instruction functions (Figure 4.6) in GNU gcc 
[87]. An instruction function is an inlined GNU gcc function call (see assembly 
85 
syntax in Figure 4.6) that is resolved to a single instruction. By integrating 
these function calls into a source-level algorithm implementation, it is 
possible to create a high-quality compilation utilizing the new instruction. 
86 
#define MP2_VLMAX 16 
#define MP2_VREGS 8 
/* vector length register (maximum) */ 
/* number of vector registers */ 
#define MP2_SREGS 32 
typedef struct 
/* number of scalar registers */ 
{ 
/* Vector Length Register */ 
int VLEN; 
/* Vector register file */ 
unsigned char VRF[MP2_REGS] [MP2_VLMAX]; 
/* Scalar accumulator */ 
int ace; 
/* Scalar register file */ 
int SRF[MP2_SREGS]; 
} 
mp2_vstateT; 
Figure 4.5 Embedded CPU state for at-speed simulation 
Finally, it is possible to alternate between running the unmodified 
algorithm, the native-mode (x86) modified binary and the simulated binary 
on the simulated processor through the use of a #define statement. The 
compiler used was gcc 2.7.3 with full optimizations (CFLAGS= -03). These 
options control various sorts of optimisations, as indicated by the letter 'a. 
Turning on optimization flags makes the compiler attempt to improve the 
performance and/ or code size at the expense of compilation time and 
possibly also the ability to debug the program. This processor state (Figure 
4.5) is equivalent to the vector coprocessor data path consisting of a vector 
register file, scalar registers, scalar data paths, etc., which are described in 
detail later, along with their hardware implementations. Instructions (Figure 
4.6) are introduced at the assembly-statement level. The instructions 
themselves operate on the virtual machine state, stored in a global variable of 
type vstateT. One such instruction extension that computes the SAD of two 
pixel vectors is shown in Figure 4.6. 
87 
#ifdef X86 1* VSAD functionality for native-mode operation *1 
#define vsad(vrsl,vrs2) \ 
( {\ 
extern vstateT vstate;\ 
int index;\ 
int temp;\ 
unsigned char pl,p2;\ 
for (index = 0; index < vstate.VLEN; index +=1)\ 
{\ 
pl= (unsigned char) «vstate. VRF [vrsl] [index]) & Oxff ); \ 
p2=(unsigned char) «vstate.VRF[vrs2] [index]) & Oxff);\ 
temp=(int) pl - p2;\ 
~f (temp < 0) temp=-temp;\ 
vstate.temp_acc=vstate.temp_acc+temp;\ 
pl=(unsigned char) «vstate.VRF[vrsl] [index] » 8) & Oxff );\ 
p2=(unsigned char) «vstate.VRF[vrs2] [index] » 8) & Oxff);\ 
temp=(int) pl - p2;\ 
if (temp < 0) temp=-temp;\ 
vstate.temp_acc=vstate.temp_acc+temp;\ 
}\ 
} ) ; 
#else 1* VSAD functionality for simplescalar mode operation *1 
#define vsad(rsl,rs2) \ ({\ 
1* gcc inline assembly *1 
asm volatile (".word Ox00010000 n);\ 
asm volatile (n.word \ 
}) ; 
1 «29 1* EXT_OPCODE *1\ 
15 «25 1* CATEGORY *1\ 
5 «20 1/* OPCODE *1\ 
(n#rsl" « 10) 1\ 
(n#rs2" « 5) ");\ 
88 
#endif 
Figure 4.6 NatIve (x86) and simulated (SimpleScaIar) VSAD IlI1.plementation 
4.3.2 Description of 'VSAD' 
In the native (x86) mode of the VSAD (vectorized SAD) implementation 
(shown in Figure 4.6) [74], [41], 'vstate' is a structure variable (an instance) of 
the structure 'vstateT', which represents the programmer's model of a vector 
coprocessor coupled to the CPU, and is defined in Figure 4.5 (where 
'mp2_vstateT' is same as 'vsateT'). It includes the vector register length 
'VLEN', vector register file 'VRF[MP2_REGS][MP2_ VLMAX]' where 
'MP2_REGS' are number of registers in a vector register file and 
'MP2_ VLMAX' is the maximum length of a vector register; the Scalar 
accumulator 'ace' which holds the accumulated value resulted from SAD 
summation operation; and 'SRF[MP2_SREGS]', the scalar register file used for 
address calculation for 'VRF' as well as for holding scalar values from main 
RISCCPU. 
The vector load operation call 'vldun( .. )' shown in Figure 4.4 loads two 
byte data, i.e. two pixels concatenated into one element of the vector register. 
Hence, in order to get the first pixel (byte) value, i.e. LSB, from the register 
element the following operation instruction is executed, 
pl=(unsigned char) «vstate.VRF[vrslI [index]) & Oxff 
where 'pl' is the reference pixel, 'index' is the corresponding element of the 
vector register and 'vrsl' is the location of register (the starting address of the 
array of pixels) in 'VRF'. The corresponding vector register element 
'vstate.VRF[vrsl][index], is anded (' &') with 'Oxff' to get the first byte (LSB) of 
the concatenated pixels (bytes) loaded into the VRF[vrsl][index]. Similarly 
'p2' gets the value corresponding to the current block pixe!. The following 
instructions execute an absolute difference and summation operation (Figure 
4.6). 
temp=(int) pl - p2;\ 
89 
if (temp < 0) temp=-temp;\ 
vstate.temp_acc=vstate.temp_acc+temp;\ 
where 'temp_ace' (which is the same as 'acc' in Figure 4.5) is the accumulator 
register of the vector coprocessor (in the programmer's model). To get the 
second byte (pixel) MSB of the vector register element the following statement 
is executed, 
pl=(unsigned char) «vstate.VRF[vrsl) [index) » 8) & Oxff );\ 
where again 'pt' is holding the reference pixe!. The vector register element is 
right shifted by 8-bits and anded '&' with 'Oxff to get the MSB of the register 
element. This same instruction is repeated for the current block pixel 'p2' 
followed by the absolute difference and summation instructions. These all 
instructions are inside the 'for' loop (Figure 4.6) which iterates for a number 
of times 'vstate.VLEN' equal to half of maximum length of vector register, i.e. 
'vstate.MP2_ VLMAX/2'. 
In Figure 4.6, the second VSAD macro 'vsad(rs1, rs2)' in 55 mode 
contains gcc inline assembly syntax to access the VSAD instruction inserted 
into the operation code space of the Simple5calar ISA. The VSAD in 55 mode 
is same as the VSAD in x86 (native) mode as shown in Figure 4.6 except that 
the VSAD execution in 55 mode follows 55 ISA. The assembly syntax in the 
macro 'vsad(rs1, rs2)' [41] represents the instruction format of the 55 ISA. The 
instructions of 55 ISA are 64-bits in length [32], [88]. The register fields in the 
instruction are 8-bits, to support extension of the architected registers to 256 
integer and floating point registers. There is a 16-bit opcode field that 
facilitates fast instruction decoding. There is a 16-bit annotation field which is 
useful for synthesizing new instructions without having to change and 
recompile the assembler [32], [88]. 
In Figure 4.6, in the macro 'vsad(rs1, rs2)" the instruction, 
asm volatile (oo.word OxOOOlOOOO");\ 
represents the annotation field combined with the operation code '.word 
Ox00010000' in the inline assembly function 'asm volatile' passed to the 55 
ISA. This annotation value 'OXOOOt' is determined (decoded) in the extended 
90 
ss ISA [41]. If the annotation value is decoded as non-zero, all the new 
instructions inserted can be decoded [87], [41]. Similarly the next assembly 
instruction, 
asm volatile (".word \ 
1 « 29 I 1* EXT_OPCODE *1\ 
15 «25 /* CATEGORY */\ 
5 «20 1/* OPCODE */\ 
("#rs1' «10) 1\ 
("#rs2" « 5)');\ 
shows the remaining 32-bits (bit fields) of the SS ISA, i.e. an encoded 
instruction with 'EXT_OPCODE' (external operation code) in bit-field 29 
having value '1', 'CATEGORY' as category in bit-field 25 having value '15' 
and 'OPCODE' (operation code) in bit-field 20 having value '5'. These codes 
are decoded together in the extended SS ISA to execute VSAD instruction [87], 
[41]. Similarly '#rs1' and '#rs2' are source registers at bit-field locations 10 and 
5 respectively. These registers hold the input arguments 'rs1' and 'rs2' of the 
macro 'vsad(rs1, rs2)" which are the addresses of the reference and current 
macroblock pixels or starting address of array of pixels respectively. 
The VSAD macros (both for x86 and SS mode) Figure 4.6, and the 
vector load operations 'vldun' (a call to which is shown in Figure 4.4), are 
implemented and inserted in SS ISA by the help of the ESD group [41]. After 
inserting the appropriate vectorized instructions corresponding to three 
vector register lengths of 32-bits (VLMAX=4), 64-bits (VLMAX=8) and 128-
bits (VLMAX=16) into the 'DISTl' function (Figure 4.4), three vectorized 
binaries corresponding to the three vector lengths are generated using SS gcc 
compiler, i.e. 'sslittle-na-sstrix-gcc' [41]. Simulations are then run for the three 
vectorized binaries in both x86, as well as in SS mode, for both Full Search 
and Fast ME methods corresponding to a number of video sequences and 
search ranges. Simulations for scalar binary (non-vectorized) are also run in 
SS mode for calculating relative DIC as described above. 
91 
As described above, the simulated VSAD implementation in Figure 4.6 
shows the standard way of filling bit-fields of SimpleScalar ISA instruction in 
inline GNU gcc. Using these methods (DLP jSIMD) it is possible to reduce the 
DIC for all ME algorithms. Figure 4.7 depicts the variation of the average DIC 
of the vectorized full-search ME, across all twelve sequences, with search 
range and vector register length. It demonstrates an increasing fractional 
algorithmic complexity reduction with increasing search range due to the 
introduction of the vector instructions described above. 
It is clear that the difference in complexity reduction between the 32-bit 
(4 bytes) and 64-bit (S-bytes) and between 64-bits and 12S-bits (l6-bytes) 
averaged over all sequences, ranges from 4.4% to 9.9% and from 2.2% to 5% 
respectively, over the search window range. This diminishing return 
demonstrates that a datapath width of 64 bits (VLMAX=S, the length of the 
register in bytes) presents a good design compromise in terms of area-
performance. 
Similarly exhaustive simulations of the MPEG2 TM5 video encoder 
were performed for a number of sub-sampling ME methods, for twelve 
different video sequences, each consisting of 25 frames, and for three vector 
register file lengths (VLMAX = 4, S and 16). Figure 4.S depicts the average 
relative DIC for the Three Step Search. Other sub-sampling ME algorithms 
such as the Four Step Search, the Diamond Search, etc., (Chapter 3), follow a 
pattern similar to that in Figure 4.S. The data presented in Figure 4.7 and 
Figure 4.S is relative to unmodified (non-vectorized) Full Search (FS) ME 
algorithm as it appears in the TM5 distribution. 
It is immediately clear from Figure 4.S that fast ME substantially 
relieves the processing requirements. An interesting observation is that all 
vectorized fast ME methods exhibit very little performance improvement, 
going from an SIMD datapath width of 32 bits to 128 bits. The superiority of 
FS is seen in fast moving sequences such as 'Rotating City' as shown in Figure 
4.9. 
92 
08 
07 
u 06 
-Cl 
-.; 
05 .~ 
£ 04 
0.3 
02 
(a) 0 
FIgure 4 7 Average fracbonal Complexity reduction over all sequences vs. Search range vs. 
VLMAX for Full Search. Upper curve VLMAX = 4, lower curve VLMAX = 16. 
1 
0 08 
13 06 Cl) --.-VLMAX=4 
~ 04 to 
~ 02 
--VLMAX=8 
--VLMAX=2 
0 
0 50 100 150 
2P+1 
Figure 48 Fracbonal DIC of Three Step Search (fSS) over all sequences vs Search range vs 
VLMAX. 
This suggests that, unless the use of FS is mandatory for best PSNR, an SIMD 
datapath of 32 bits appears to be sufficient to capture most of the DLP 
available in fast ME MPEG2 implementations. However it is also clear that if a 
Full search is necessary, DLP methods can have a significant impact on 
performance. 
93 
386 
384 
~ /'" IJ:l 382 ~ / ~ 38 / z 378 
'" Il. • /" 376 lit-
374 
0 50 100 150 
(cl Seor.:hRmlp 
FIgure 4.9 Average PSNR over search range for the 'Rotating City' sequence 
4.4 Vector ISA and Programmers Model 
This section discusses the ISA and the programmer's model of the parametric 
vector coprocessor. The latter defines a flexible architecture consisting of up to 
eight 32-bit scalar registers, used primarily for memory address calculation, a 
parametric scalar accumulator, used when executing the vector SAD 
instruction, and the Vector length register VLEN which specifies the number 
of byte elements of the target vector register that will be affected by the 
currently executing vector instruction. Finally, there are up to VR_MAX, 
parametric-length vector registers, making up the vector register file, (the 
maximum length of which can be set to any value between 4 and 1024 bytes at 
elaboration time) which are used to hold the luminance data prior to 
executing the VSAD instruction for stall free operation. The programmer's 
model [74] is summarized in Figure 4.10 [74] and the coprocessor ISA in Table 
4.5 [74]. Later sections discuss the hardware architectures and 
implementations of this vector architecture (coprocessor). 
94 
Vector Register File 
VLMAX 8-hit elements 
~~~~~~~~~+-~VRO 
~~-+-+-+~~~~~VR1 
~~~~+-~~~~-+~VR2 
~~~~~-+-4-4~~VR3 
~~~~~~~~~~VR4 
~~~~~~~~~~ VR5 
~+-+-~+-+-~+-+-~~ VR6 
L-.l.-.l.-.l....-.l.-.l.-.l....-.l...-.l...-...l-....J VRMAX-I 
Vector Length Register 
8-hits 
LI _-----....11 VLEN 
FIgure 4.10: Coprocessor Programmers Model 
Scalar Register File 
32-hit 
I--__ ~SRO 
SRI 1-----1 
SR2 
f-----l 
~ __ -ISR3 
SR4 
1-----1 SRS 
SR6 
SRMAX-I 
Scalar Accumulator 
32-hib 
LI __ -----11 SACC 
As is evident from the Table 4.5, which shows the instructions of the ISA, 
there are six instructions (labeled MV) that transfer scalar data between the 
RISC CPU and the coprocessor vector and scalar registers and the scalar 
accumulator. 
There are also two major vector load/store opcodes (VLD and VST) 
which transfer arbitrary byte-aligned vectors to and from memory and the 
vector register file. In addition, the ISA contains three vector-operate 
instructions that perform vector SAD, VSAD and its sub-pel variations, 
VAVG2SAD and V AVG4SAD. 
95 
Table 4 5 Coprocessor !SA [74J 
COMMAND DESCRIPTION 
LDVLEN Load coprocessor vector length register with immediate 
value 
MVR2CSR Move RISe GP register to coprocessor scalar register 
MVR2CACC Move RIse GP register to coprocessor accumulator 
MVCACC2R Move coprocessor accumulator to the RIse processor 
MVCSR2R Move coprocessor scalar register to the RIse processor 
MVCVR2R Move coprocessor register element (32-bit) to RIse CPU 
MVR2CVR Move RIse CPU register (32-bit) to coprocessor vector 
register element 
ADDREDACC Add intermediate value to the residual 
VLD Load VLEN 8-bit elements into coprocessor vector 
register from virtual address. 
VST Store VLEN 8-bit elements into virtual address. 
VSAD Compute the absolute value of the difference of two 8-bit 
numbers. Accumulate result in temporary accumulator. 
VAVG2SAD Average two 8-bit numbers and compute the absolute 
value of the difference of the average with a third 8-bit 
number. Accumulate result into temporary accumulator. 
VAVG4SAD Average four 8-bit numbers and compute the absolute 
value of the difference of the average with a fifth 8-bit 
number. Accumulate result into temporary accumulator 
4.5 DLP MICRO ARCHITECTURE (Hardware) 
The vector accelerator (coprocessor) is designed to be attached to a 
configurable, extensible Spare-VS compliant [32] CPU core available under 
96 
GPL (General Public License). The LEON core [90] is an open source VHDL 
implementation of the SP ARC core. 
SPARC is a CPU instruction set architecture (ISA), derived from the 
reduced instruction set computer (RISC) lineage. As an architecture, SPARC 
allows for a wide spectrum of chip and system implementations at a variety 
of price/performance points for a range of applications, including 
scientific/engineering, programming, real-time and commercial approaches. 
A SP ARC processor logically comprises an integer unit (IU), a floating-point 
unit (FPU) and an optional coprocessor (CP), each with its own registers. 
The LEON2 VHDL model [90) implements a 32 bit processor 
conforming to the SPARC VS architecture [91). It is designed for embedded 
applications with separate instruction and data caches. The VHDL model is 
fully synthesizable with most synthesis tools and can be implemented on both 
FPGAs and ASICs, and the LEON processor fits in a Xilinx Virtex XCV300 
[70]. The major identified blocks of the present microarchitecture are the 
Scalar CPU and the Vector Coprocessor. A detailed schematic of the 
microarchitecture is shown in Figure 4.11. 
4.5.1 Scalar CPU 
Figure 4.11 [74] shows a detailed view of the microarchitecture of the 
combined processor/coprocessor, indicating the bi-directional 
communication channel across the scalar CPU and the vector accelerator. 
The main CPU is a standard 5-stage ruse pipeline. Instructions are 
fetched from the instruction cache and clocked into the instruction register. In 
the DECODE stage the latched instruction is decoded. The bypass logic in 
DECODE determines whether register file data (CPU scalar registers) or 
internally pipelined results are to be clocked in the ALU input registers. 
During EXEC, the ALU operation is performed and a virtual address is 
calculated. Data cache access takes place during DMEM/EXEC2 and scalar 
97 
results also return to the RISC pipeline during this cycle. Finally, results are 
clocked into an intermediate register prior to committing to the processor 
register file. 
6 
~ T ... 
_. MEMORY CI'RL 
DATAPATH PIPE PIPE 
----------
VKlDrRep.tuFDe SaluRogldu 
... 
" Q a 
ill 
Q 
n AW 
ill CIRL AW ~ r.l 
""'" CTRL 0 ~ 
Inc.Z' ~ 
: ..... ~ 
1B)'p&" 
, 
______ ..P2.'IL __ 
______ ,a,cc 
i "a1s 
'1-- --~..:;;;~~-:---:-~~II "-_________ ...J 
i 
~ 
""' .. 
..... 
FIgure 4.11 Processor/Coprocessor Microardutecture [53] 
4.5.2 Vector Coprocessor 
The vector coprocessor block interfaces to the core processor over the 
interface described above. It consists of the parametric vector datapath, the 
memory pipeline and the control path (Figure 4.11). The vector datapath 
comprises the vector register file and bypass logic, a number VLMAX of 
datapath elements arranged in parallel and the reduction logic/ accumulator. 
98 
The datapath is pipelined over three stages. The vector register file (RF 
5r/1w) is space critical and an expensive resource in the coprocessor 
microarchitecture. It supplies operands to the vector datapath and is required 
to provide a maximum bandwidth of five vector reads and one vector write 
per cycle in order to execute V A VG4SAD and three vector reads and one 
vector write per cycle for V AVG2SAD, as shown in Table 4.5. 
Wf_dala 
WI_addr 
rd_addrS 
rd_addr4 
rd_addr3 
rd_addr2 
rd_addrl 
VLMAX 
.. 
. Log2(VRMA ) 
-, Log2(VRMA ) 
• 
L",,2( ) 
":" • Log2( 
"? ,L0g2l ) 
• .Log2(VRM! X) 
, 
,J 
VRMAX 
x 
VLAMX 
VRMAX VRMAX 
x x 
VLAMX VLAMX 
FIgure 4.12 Vector Register File (Dual port SRAM based) [74] 
" 
i, /\V- i, 
VRMAX VRMAX 
x x 
VLAMX VLAMX 
I 
VLMAX • 
VLMAJ(' 
VLMAX, 
VLMAX , 
.-
VLMAX 
.. 
~ 
-:: 
~ 
:: , 
oprS 
opr4 
opr3 
opr2 
oprl 
There is a number of ways to satisfy these requirements in a typical 
ASICjFPGA flow. 
Figure 4.12 [74] shows a simple implementation of the vector register 
file which is dual port SRAM based. Alternatively, one could resort to a flop 
or latch-based register file with the additional benefit of high speed access 
(faster than the RAM-based solution), but at the expense of substantially 
increased power consumption. The section (4.6) presents the silicon area and 
maximum operating frequency for flop, latch and RAM-based vector register 
files for a high performance ASIC technology. 
99 
8 op~ ~~--------------------------------, 8 op~ .I~------------------------------~ 
cm .~8,---------------------------.----r;--. 
opa .1-7~----------------------' 
8 
oprl .-=,..~------------,---------, 
8 
oprs --=;."----------, 
9 
\:::~7------II •• C.hift 
~;.;.-.-----. res 
Figure 413 Scalar Datapath [74] 
For the MPEG2 TM5 workload, a VR_MAX (Figure 4.12) value of 8 is 
sufficient for the stall-free implementation of the vectorized DIST1 function. 
Similarly, a VLMAX of 16 bytes (128 bits) and subsequently, a vector datapath 
consisting of 16 scalar elements (Figure 4.11), is the default parameter for the 
workload. Figure 4.13 shows the scalar datapath [74] microarchitecture. There 
are VLMAX such datapaths in the EXEC stage, each capable of performing 
the SAD, AVG2SAD and AVG4SAD operations (Table 4.5). 
100 
VOfl~~~~~·~W~bb~-'--~~--'-~--'-~--'-~--.--r--.--r-' 
von~~~~~~;-Tf~+-"~+-Tf~~Tf~+-r+-.~r+-rr., 
32 ,...-.:::32=r-__ --' 
L __ ~~~~~:VA:CC 
SRES 
Figure 4.14 Add-Reduction Logic [74] 
The reduction logic occupies EXEC2 in Figure 4.11, this is the second 
execution stage during which the vector result produced in the previous 
cycle, is add-reduced to a single value which is committed to the scalar 
accumulator register at the end of the cycle. Figure 4.14 details the 
implementation of the add-reduction logic, for VLMAX=16 bytes (128 bits) 
consisting of log2(17)+1=6 full-adder stages, each of increasing bit width. It 
shows the vector addition of two vector operands (each 16 bytes). The final 
major section in the coprocessor microarchitecture is the memory pipeline 
which is responsible for supplying data to the vector register file via vector 
load/ store operations. This is a critical block in the system microarchitecture 
since it implements the datapath to and from the off-chip SDRAM, (see Figure 
4.16). The block consists of the vector LSU (load store unit), the memory 
controller and the on-chip-bus interface unit. It includes a parametric number 
of scalar address/ data registers, the address increment logic, the vector data 
cache, the vector write buffers and the on-chip-bus (AHB) controller. The 
101 
scalar registers hold 32-bit values which are either copied to individual 
elements of a vector register or which function as address pointers for vector 
load/ store operations. 
ROP(32) 
seLtiscitl -----''2.._-... ___ , 
SRMAX 
32-bit 
register 
file 
seLscalar:.....reg .---'--_-.--"...--.J 
VSTATE.VLEN -----, 
Figure 4.15 Scalar registers and update logic [74] 
32 
res/VADDR 
The data is written from the main RISC processor using the MVR2CSR 
instruction and can be read-back into the RISC register file using the 
MVCSR2R instruction (see Table 4.5). However when they are used as 
address pointers in vector load/ store operations, the programmer can specify 
post-increment/pre-decrement modes of operation. In this case, the address 
register is updated after/before the load/store operation by the amount of 
bytes transferred, which is always equal to the (dynamic) vector length as 
102 
specified in the VLEN register. The address registers' block schematic is 
shown in Figure 4.15 and can be implemented as either flops or as a single-
port compiled SRAM block of dimensions SRMAX x 32 bits. As shown in the 
schematic, data commits either from the RISC processor (ROP) or from the 
address update logic. SeLscalar Jeg selects one of the scalar registers which is 
presented to the address update logic. 
The virtual address is computed based on the current value of the 
VSTATE. VLEN register, the addressing mode (pre/post) and the direction 
(inc/ dec). Finally, the computed virtual address is passed to the LSU, at the 
end of the cycle. The vector data cache is proposed [74] to supply the high-
bandwidth vector register file with operands. 
The microarchitecture (Figure 4.16) utilizes the industry-standard AHB 
(Advanced High Performance Bus) bus [74] to connect the (Scalar CPU, DMA 
engine and Coprocessor LSU (Load/Store Units» to a single memory slave 
(pC133 SDRAM controller). The AHB datapath is 32-bits wide and clocked at 
the same rate as the rest of the system [74]. The final microarchitectural block 
of the accelerator is the control pipeline which decodes the instruction latched 
in the main processor instruction register and produces the control signals for 
the datapath and the memory pipeline. It also regulates the flow of data 
between the scalar CPU and the coprocessor, providing flow control to the 
main CPU in the case of control! data dependencies in the coprocessor. 
4.6 Vector Datapath Hard Macro 
The VHDL code of vectorized (extended) LEON2 co-processor was written, 
simulated and synthesiszed by the help of ESD group [41]. Three variants of 
the vector accelerator microarchitecture have been taken to layout, utilizing 
dual-port SRAMs, flops and latch-based vector register file. 
The scripting process [41], [89] iterated over tens of potential 
implementation candidates of the physical cluster and achieved a local 
minimum at an aspect ratio of 0.4-0.6 (width:height) and a pre-route 
103 
utilization of 85%, for the RAM-based register file configuration. 
Subsequently, the aspect ratio was fixed at 0.5 for all the vector register file 
configurations. Figures 4.17, 4.18 and 4.19 depict the floorplan (left portion of 
figure) and layout (right portion of figure) of the vector datapath cluster for 
the SRAM, Flop and Latch-based vector register file configurations 
respectively. These figures illustrate the silicon area for the vector register 
file. The scripts collect the area/ performance output from the synthesizer in a 
final results file which can be viewed in a spreadsheet. 
Scalar CPU Copl'ocessor 
;~ Coprocessor 
Core CPU Datapath 
SDRAM 
(off-chip) ICACHE I DCACHE I VDCACHE I 
~ 
iI , ~ 
SDRAM Cid Bus Controller Bus Controller 
(AHBSlave) (AHB Master) (AHB Master) 
AHB 
Figure 4.16 High level view of system showing main CPU, Vector coprocessor and off-chip 
SDARM[53] 
104 
Figure 4.17: Floorplan and Layout of RAM-based. 
105 
Figure 4.18: Flop-based VRF configuration. 
106 
Figure 4.19: Latch-based VRF configuration. 
107 
The latch-based configuration, in particular, is penalized by the routing of a 
second, double-frequency clock used to generate the write-strobe. The 
physical results are summarized in Table 4.6. 
Table 4.6: Datapatb physical macro data 
Config Fmax(MHz) Uti! (%) Std Cells Area (Jlm2) 
+ RAMS 
SRAM 312 93.02 26777+5 457x914 (417689) 
Flops 278 88.28 59084+0 46Ox938 (431480) 
Latch 229.7 91.76 46519+0 488x1003 (489464) 
4.7 Thread Level Parallelism (TLP) 
A second method of improving the real-time performance of the 
computationally intensive functions that form MPEG2 motion estimation is to 
implement the encoder on a multiprocessor (several processor cores) System 
on Chip (SoC) architecture. In reference [28) a simulation tool is implemented 
for the performance evaluation of a multiprocessor SOC, based on SystemC, 
which contains models of processors (ARM) and memory (AMBA (Advanced 
Microcontroller Bus Architecture)) etc. In reference [30) an ARM7 based 
multiprocessor SoC is implemented in which two to ten varaiable number of 
processors are used to reduce the workload of a video encoder. In this work a 
multiprocessor SOC is presented to reduce the work load of a video encoder 
(MPEG2), based on LEON2 processor (SPARC-V8 compliant) [90), [91), 
augmented with data parallel coprocessors. 
In order to exploit both forms of its natural parallelism, in addition to 
the DLP described above, the process of motion estimation (ME) is 
multithreaded, i.e. the ME function [40) which calls the Full Search function 
[40) sequentially, for all macroblocks in the current frame, is parallelized such 
108 
that (up to) all the macroblocks (depending upon the maximum number of 
allocated threads) in a row in the current frame are executed concurrently in 
order to further reduce the computational load of ME. The combined effect of 
DLP and TLP (which, here, involves concurrent execution of the search of 
current macroblocks in a row) on the performance of both Full Search and fast 
ME methods have been assessed through experimentation as well through a 
simple theoretical investigation (Chapter 5). 
4.8 Parallel simulation methodology 
The ME function of the MPEG2 encoder was implemented with multi-thread 
and data-parallel algorithms. The coding for this implementation was 
performed by the help of ESD group, in the Department of Electronic and 
Electrical Engineering, at Loughborough University. The programming model 
followed adheres to shared-memory execution semantics and care was taken 
to ensure that variables were assigned private (per-thread) storage in order to 
avoid asynchronous modification from the multiple threads. Subsequently, 
the threaded MPEG2 code was compiled and validated on a custom, multi-
context instruction-set simulator (MTISS) based on the SimpleScalar 
infrastructure [93]. The simulator was modified by enhancing the semantics of 
its existing instructions to permit arbitrary multi-context system modelling 
[41], [87]. It models each instruction as executing in a single unit of time, 
irrespective of whether it is arithmetic, logical, branch or load/ store 
operation. Memory accesses by more than one processor to the same memory 
location in the same 'time slot' (cycle) are serialized by the simulator; thus the 
tool can be classified as an EREW PRAM (exclusive read, exclusive write 
parallel RAM) model. The results collected on the MTISS in the current work 
are based on the performance of the main thread (thread 0). This thread is 
chosen for comparison purposes as it has the highest computational 
requirements; being the parent thread it is responsible for system I/O, 
initiating the execution of other threads, coordinating their completion and 
109 
for executing all non-threaded (non-parallelizable) code. To understand what 
is meant by a thread we must first discuss processes. 
A process is the unit of work in a computer system [94], [95]. There are 
two aspects to any process, a static part and a dynamic part. The static part, 
also referred to as a task, involves the resources allocated to the process. This 
includes a certain amount of space in memory, a current working directory, 
sources of input and output such as a keyboard, screen and open files, and 
maybe a connection with another process over a network. The most important 
resource which any process has, though, is a program. In this context a 
program is just a sequence of instructions and a process runs an instance of 
that program. The dynamic part of a process can be described as 'a program 
in action'. When the instructions that make up a program are actually being 
carried out, the CPU works its way through the program in a particular 
pattern. This dynamic part of a process is known as a 'thread of execution' or 
just a thread. A thread has access to all of the resources (the address space of a 
process, etc) assigned to the task. 
The operating system creates a process for the purpose of running a 
program. Every process has at least one thread. On some operating systems it 
is possible for a process to have more than one thread. A thread is a semi-
process, in that has its own stack (Figure 4.20), and executes a given piece of 
code [94]. Unlike a real process, the thread normally shares its memory with 
other threads (whereas processes usually have a different memory area 
assigned to each of them). A Thread Group is a set of threads all executing 
inside the same process. They all share the same memory, and thus can access 
the same global variables, the same memory, the same set of file descriptors, 
etc. All these threads execute in parallel, using time slices, or if the system has 
several processors, then actually in parallel. Threads exist within the 
environment of a process - part of this environment is shared by all threads 
belonging to the same process, whilst the remainder is specific to individual 
threads. 
110 
Figure 4.20 illustrates the various components of the thread 
environment in a process. The shared environment comprises executable 
code, static storage and dynamic storage, whilst the thread-specific context 
comprises stack storage, CPU registers and user-defined keys. The remaining 
elements of the thread environment principally contain data. Data values in 
static storage are declared as global variables [94] and exist for the entire time 
that the program is running. Static storage is shared, so if several threads use 
the same data in static storage they must synchronize their activities to 
prevent data corruption. On the other hand data held in a thread's stack is 
specific to an individual thread - the stack forms part of the thread context 
and is switched whenever a new thread is allocated processor time. 
Process 
Threadl Threadl, Thread3 
Stack Stack Stack" 
Re'gisters Re~lsterS Registers 
Keys Keys 'Keys 
Code 
Stabc Storilge 
Figure 4 20 Environment of a process 
Data in the CPU registers also forms part of the thread-specific context 
and is updated during a context switch. The scheduler may interrupt an 
active thread at any point in its execution so the next program instruction to 
be executed by this thread must be remembered. The relevant address within 
the program code is held in a CPU register. When a multi-threaded program 
111 
starts executing, it has one thread running, which executes the mainO 
function of the program. This is already a fully-fledged thread, with its own 
thread ID (identification number). 
The code in Figure 4.21a shows the simplified form of the single 
threaded version of the motion_estimation( .. ) function [40]. 
v01d motion_estimation(oldorg,neworg,oldref,newref,cur,curref, 
Bxf,Byf,sxb,syb,mbi,secondfield,ipflag) 
{ 
for ()=O; j<height2; j+=16) 
for (i=O; i<width; i+=16) 
frame_ME(oldorg,neworg,oldref,newref,cur,i,j,sxf,syf,sxb,syb,rnb1); 
else 
field_ME(oldorg,neworg,oldref,newref,cur,curref,i,),sxf,syf,sxb,syb, 
mbi,secondfield,ipflag); 
mbi++; 
FIgure 4 21a Non-threaded ME process of MPEG2 TM5 
Where 'oldorg' is the source frame for forward prediction (used for P and B 
frames), 'neworg' is the source frame for backward prediction (B frames only) 
'oldreI' is the reconstructed frame for forward prediction (P and B frames), 
'newreI' is reconstructed frame for backward prediction (B frames only), 'cur' 
is current frame (the one for which the prediction is formed), 'sxf' and'syI' 
are forward search window (frame coordinates), 'sxb' and 'syb' are backward 
search window (frame coordinates) and 'mbi' is a pointer to macroblock info 
structure defined in file 'mpeg2enc.h' (see MPEG2 TM5 reference code [40]). 
112 
Here 'secondfield' indicates that the second field of a frame should be 
used and 'ipflag' indicates whether say the first field of a frame is I type and 
second field is P type. The purpose of this function is to execute either Full 
Search, by calling the fullsearch( .. ) function [40], or fast ME methods [41] 
through the frame_ME( .. ) function (Figure 4.21a) [40], for each macroblock in 
the current frame. It iteratively calls frame_ME( .. ) function (assuming 
progressive ME (see chapter 2», in a sequential manner, which in turn calls 
either the fullsearch( .. ) function to execute the Full Search strategy or fast ME 
method to execute the sub-sampled searching strategy, until all macroblocks 
in the current frame are searched. 
The aim here is to parallelize the inner 'for' loop (Figure 4.21a) by 
assigning a thread to each macroblock of a row. These threads will run in 
parallel with each thread running on a separate CPU core. The work done on 
multithreading by the ESD group prior to this work was to multithread the 
inner 'for' loop in Figure 4.21a by inserting the multithreaded macros [41]. 
The multithreading was targeted for Full Search algorithm only. In the 
current work the multithreading is targeted towards both the Full Search 
algorithm as well as the Fast ME methods. The main macros used are shown 
in Figures 4.21 b and c. 
#define GET_PRIDR(var) 
{ 
SET_GPR(RD_ADDR, (xcregs.xcregs[contextl.regs_C.PRID)); 
} 
FIgure 4 21b A macro used to get the processor context ID 
#define BARRIER 
{ 
xcregs.rendezVOUS++i 
xcregs.xcregs[contextl . regs_C.PSTATER = XC_BARRIER; 
} 
Figure 4.21c A macro used to synchronize the threads/ processor contexts 
113 
In Figure 4.21b, the 'GET_PRIDR(var), macro is used to get the processor ID 
or context. SET_GPR( .. ) is used to set general purpose register, i.e. assign a 
right hand argument to a left hand argument. In this macro the processor ID, 
i.e. 'xcregs.xcregs[context].regs_C.PRID' is assigned to RD_ADDR (A 
destination register holding the address of the temporary register '$12' [32], 
[88] which contains the processor ID). 'xcregs' is the extended context inserted 
into the instruction space of the instruction-set simulator. It contains array 
'xcregs[context], of register files, each register file corresponds to one 
processor context. 'regs_C' is the control register which holds processor ID 
'PRID' and state. Similarly in Figure 4.21c, the macro 'BARRIER' is used to 
synchronize all the threads in the application code. The 'rendezvous' variable 
is declared inside the structure 'xcregs' and is used to indicate when all the 
threads have arrived at the 'BARRIER'. The state 'PSTATER' of the arrived 
thread (context) at the 'BARRIER' is then assigned a value 'XC_BARRIER' to 
indicate that it has arrived. Once all the threads have arrived they are again 
activated concurrently to run in the same fashion for the next portion of the 
code. Note that theses macros actually contain assembly syntax, which is not 
shown here for simplicity, to call the instructions shown inside the macros 
which are actually inserted into the instruction space of the simulator. 
In the current work first the Full Search algorithm is replaced by the 
Fast ME methods. The code was simulated with MTISS [41] and the relative 
complexity reduction is calculated. The second attempt is based on the idea of 
multithreading the Full Search algorithm instead of multithreading the inner 
'for' loop shown in Figure 4.21a. The idea is based on the fact that assigning a 
processor context to a candidate motion vector may require less hardware 
compared to the processor context assigned to the current macroblock for the 
whole search area. For this purpose the default Full Search algorithm, which 
is implemented in a spiral fashion, is first converted to raster fashion, 
consisting of two 'for' loops. Then the above macros were inserted to replace 
the inner 'for' loop of the Full Search algorithm. This involved creation of an 
114 
array that would hold the 'SAD' values obtained corresponding to all the 
candidates in a row equal in number to the maximum allocated processor 
contexts. A trace of the code is shown in Figure 4.21 d. 
dmin_array[context+i*MAX_THREADl = 
distl(org+lowx+context+(i*MAX_THREAD)+lx*j,blk,lx,O,O,h,dmin3); 
FIgure 4.21d An Array of SAD values corresponding to each processor context 
Although the output complexity statistics (DJq were slightly higher than 
those obtained with the default spiral search, due to more coding involved (in 
the alternative TLP), the idea of hardware saving due to reduced thread 
context may compensate for it. 
Figure 4.22 [94], illustrates how stack space is allocated (in the case of 
multithreading per processor) whenever a function call is made and then de-
allocated after function returns. 
,Before Call, During Call AfterCal( 
Free Space 
Free Space 'Free Space 
Stack Frame , 
Used Space Used Space' Used Space , " 
FIgure 4 22 Allocation of Stack Space 
115 
4.9 Results and discussion 
The non-threaded MPEG2 encoder was profiled to identify the distribution of 
the computational load. The results of that per-function profiling of all 
fourteen implemented fast ME algorithm are given in Figure 4.23. As 
expected from our previous analysis, the constituent functions of the fast 
algorithms do not exhibit any dependency on search range as was shown 
above. Note also that the fractional DISTl complexity is roughly the same for 
all sequences. This indicates that exploiting data-parallelism within fast ME 
will not yield significant benefits. Subsequently, the performance of the both 
threaded and vectorized fast ME-based MPEG2 encoder was measured for 
seven CIF-format video sequences, each consisting of 25 frames. The 
sequences were encoded using full search as well as the thirteen fast ME 
algorithms introduced in Chapter 3. 
s 
I 
... 
II!'EGo2 Su!> ...... Hng IIE~rflll'mlll •• Proflllng 
BowIng Soqumco, 12 Fm" .. , 12 FIGoP, CF 
~IE~!l~;'§~~~,~il 
co 
SUkmlpllng Ilgcrlthm 
Figure 4.23 Per-function profiling of MPEG-2 for sub-sampling algorithms. 
116 
IlRomaiing 
_Ilbray 
cl_ 
cputblls 
_dl.11 
IlIde! 
Figures 4.24 and 4.25 depict the performance improvement of the data-
parallel implementations for the fast ME methods (see also section 4.3, Figure 
4.8), for the single-thread case, compared to the non-vectorized, Full Search 
method, for a search window range of 7 to 81 pels for the Paris Sequence. 
Both figures include the Full Search data-parallel performance results for ease 
of comparison. 
It is again clear that wide data-parallel hardware is not the preferred 
means of accelerating a fast ME-based encoder as is shown by the very close 
spacing of the data points for 4, 8 and 16-byte wide vector lengths [96]. 
Data-parallel fast ME perlormance (Fast ME SET 1) 
Paris seq, CIF. 25 frames, 12 GoP 
7 13 19 ~ 31 ~ ~ ~ $ ~ ~ n N 
FS uarch range (pels) 
Figure 4 24 Data-paraIIei MPEG-2 performance for fast ME (Set 1) 
117 
....... TSS32-bd: 
-e-TSS64-b11 
___ TSS 128-bit 
~NTSS32-bd 
-+-NTSS 64-bIt 
~NTSS 128-bil: 
-+-FSS 32-b1t 
-FSSG4.a 
-FSS128-b1t 
-o-FS32-brt 
-o-FS64-bit 
-ik-FS 128.at 
Data-paranol fast ME portoRnanee (Fast ME SET 2) 
Pari. Hq, CIF, 25 frame., 12 GoP 
O.r-------------------------------------------------~ 
O~----__ --__ ----__ ----~----~----~----__ --~ 
1 16 25 34 43 52 ., 10 19 
FS search nmg. (pels) 
Figure 4 25 Data-parallel MPEG-2 performance for fast ME (Set 2) 
70 
~ 
-
65 .. 
Three step 
Four-step 
" 8 Diamond 
I 60 .. j 
55 1 
is 50· 
Cenb diamond 
OrlI!ogonal 
Large diamond 
!: 
'" ~ 45· . -.... 
40 
2 7 12 17 22 27 
_ of Prvc:essors 
FIgure 4 26 Thread-parallel MPEG-2 performance for fast ME (Set 1) 
118 
~DS32-b1t 
...... OS64-b1t 
--a- OS 128.t:lrt 
""'*""HDS32-b1t 
__ HOS 64-bIt 
-+-HDS128-b11: 
-+-CENB_DS 32-b1t 
--CENS_OS 64-bd 
--CENS_OS 12B-btt 
-'-FS32-b11 
~ 64-bll 
-- 128-bd 
32 
70'~---------------------------------r--------~ 
-- Conjuga" 
-CIon 
- ~D_lo; 
-- SpIrI 
-- New 11,,_ si< F 
-- Gradient 
i 65 ~ .. i 60 . . .... •.... iS~ i so ......... ~~. 
£ 45 ....... .... .~~. ~~_~_~= .. =--=--====1 
~+-------~----~------~------~----~------~ 
2 7 12 17 22 27 32 
Number of Processors 
Figure 4.27 Thread-parallel MPEG-2 performance for fast ME (Set 2) 
Figures 4.26 and 4.27 depict the reduction in the per-processor dynamic 
instruction count of the threaded encoder utilizing only the fast ME 
algorithms. When encoded using 22 CPU contexts, a reduction in the per-
processor dynamic instruction count of between 50% and 60% is observed, 
compared to the corresponding single-threaded implementations. Note that 
the results in Figures 4.26 and 4.27 corresponds to the multithreading of 
Motion Estimation (ME) loop together with the multithreading of 
Transformation loop - the loop surrounding Forward and Inverse Discrete 
Cosine Transformation (DCT), (see transform.c in [40]). 
The relative DIC corresponding to multithreading of ME loop alone 
was also calculated. The complexity reduction achieved with multithreading 
of ME loop alone was found to be 10% smaller than that for the combined 
multithreading of both the ME and DCT (forward and inverse) loops. The 
transformation loop involves the application of DCT, during encoding, on the 
8x8 luma and chroma blocks (in case of intra coding) and on 8x8 luma and 
chroma difference blocks (in case of non-intra coding) inside the macroblock. 
The multithreading of the transformation loop was implemented by the FSD 
group [41] and involved allocating threads/processor contexts to the 8x8 
lurna and chroma component blocks for the concurrent calculation of the DCT 
119 
components of the luma and chroma components. The case for inverse Dcr 
during the decoding is similar. 
For the ME part, the parallelized encoder assigns each macroblock 
(MB) in a column to each software thread, a maximum of 22 CPU contexts 
(threads) are sufficient for the parallel execution of a single ClF frame. Any 
increase in the number of contexts beyond 22 (the number of MBs per row in 
the CIF image) shows no further benefit on the parallel MPEG·2 encoder 
performance. Due to the nature of SoC implementations, a 22-context 
multiprocessor is unrealistic from silicon area and power perspectives; and 
the results of Figures 4.26 and 4.27 suggest that a low-order multi-processor 
architecture of 2-4 processors, with 4-byte data-parallel acceleration, is 
expected to be an acceptable configuration for a threaded and vectorized 
MPEG-2 encoder. 
4.10 Conclusions 
In this chapter the architecture specification and physical implementation of a 
parametric vector datapath were discussed. Starting at the algorithmic level, 
systematic modeling was applied and transformation techniques (adding 
processor state) were used to expose and evaluate the Data-Level-Parallelism 
of the inner loop of the Motion Estimation algorithm. In the process, a custom 
vector ISA was developed which was implemented as a tightly-coupled 
coprocessor attached to a RISC CPU. By targeting only the inner loop of the 
ME it was possible to use this microarchitecture to accelerate a number of 
Algorithmic ME methods without modification. By following a highly-
parameterized design approach, we were able to specify top-level 
architecture, microarchitecture, timing and physical implementation 
constraints which are propagated down the implementation flow via a 
scripting mechanism. In this way, it was possible to exhaustively probe the 
implementation space of the microarchitecture and converge to an 
appropriate physical solution. Ongoing work in the ESD group at 
120 
Loughborough [41] uses this methodology at the SoC-kernellevel where the 
described accelerator data path, and its control and memory pipelines are 
connected to the controlling RISC CPU thus, forming a complete SoC 
computation kernel for real-time MPEG2 video encoding. 
Another very significant source of parallelism in block-based video 
coding algorithms was then used, i.e. TIrread-Level Parallelism. In this case, 
multiple processor contexts executed different sub-graphs of the control-flow 
graph of the algorithm while maintaining sequential semantics through 
Fork/Join operation. The results indicate that threading of the Full Search 
motion estimation provides very significant extra benefit which will 
complement the Data-Level-Parallelism optimizations presented above. 
121 
5. Theoretical Analysis ofDLP and TLP 
5.1 Theoretical analysis 
One general thrust of this work is an assessment of ways of accelerating the 
calculation of Motion Vectors in Block-based Motion Estimation algorithms. 
This category of algorithms may usefully be split into two general techniques. 
First there is the Full Search (FSBME) algorithm, which imposes a fairly 
prohibitive computational burden on the CPU, but which requires only a 
relatively simple controller and allows a high level of data re-use [10], while 
the second comprises the fast motion techniques introduced in Chapter 3, 
which are significantly less burdensome but which also allow for very little 
data reuse (Le. often the same data must be reread from external memory 
several times), making them rather inefficient. Fast methods also generally 
deliver somewhat poorer picture quality in terms of PSNR rating. 
The proving-ground for all fast methods has been a set of standard 
sequences, with current frames divided into 16 x 16 Macroblocks (MBs), and a 
search area of 32 pixels x 32 pixels centred at the location of the current frame 
MB, in the reference frame. This raises an additional difficulty for fast 
methods as their successes are not necessarily scalable, in an obvious manner, 
to those larger search areas of (2p+16) x (2p+16) pixels (allowed in MPEG2, 
MPEG4, etc.) which may be required for fast moving or complex, highly 
detailed sequences, in order to obtain an acceptable picture quality. This is 
particularly clear for example in the case of the commonly used Three Step 
Search (see Chapter 3.4). Picture quality of a transmitted sequence is generally 
quantified by the value of the peak signal-to-noise ratio (or PSNR):- the RMS 
difference, expressed on a dB scale, between the current frame and the 
approximate version, reconstructed from reference frame MBs. A larger value 
of p (a larger search area) will lead in general to a better PSNR, Figure 4.9. 
Increasing the number of steps in the fast searches, or lengthening the 
algorithm to accommodate a larger search area is, of course, possible; 
122 
however this will, potentially, further increase the complexity of the controller 
in any hardware design. On the other hand keeping the same algorithms as 
those trialled for the 32 x 32 search area reduces the likelihood of finding a 
global minimum. 
A search window of (2p+16) x (2p+16) corresponds to S2 = (2p+l) x 
(2p+l) possible candidate blocks. A Full Search of this area involves an SAD 
calculation for all S2 candidates, consequently the computational burden on 
the CPU increases with s. The central aim in Chapter 4 has been to consider 
the exploitation of any parallelism which may exist in these Block-Matching 
algorithms. For example, the regularity of the FS algorithm allows for the 
development of multiple threads, each thread performing the task of 
matching a single current frame MB to a different set of search area MBs, or of 
matching different current frame MBs to different (overlapping) search areas, 
likewise SIMD architectures represent additional attractive possibilities. For 
the fast methods, on the other hand, whose next candidate block often 
depends upon the result of the present calculation, such architectures are 
likely to be less attractive. However it is also clear that there is a law of 
diminishing returns that operates here. Too much redundant logic on the chip 
can be wasteful; either through the use of vector registers and the 
corresponding logic support for an SIMD architecture; or through the extra 
replica processing elements required to run extra threads in addition to any 
possible higher pin-count. 
The aim of the current chapter, then, is to define some figures of merit 
for algorithmic performance with a view to investigating that level of 
parallelisation (both data-level and thread-level) which will provide an 
optimal performance for the various types of search (pS and fast methods). 
Redundant logic will necessarily consume additional energy, which is a 
commodity that is restricted (particularly for real-time and mobile 
applications), and will also occupy additional silicon real-estate. As a result, 
the following questions arise naturally: Can either the Full Search or the fast 
methods be made significantly faster using vector commands and multi-
123 
threading? For a search window of a particular size (i.e. given a value of s = 
2p + 1), what is the best length of vector commands, the best size and number 
of the vector registers? and again; What is the best number of threads to 
operate under these conditions? 
The answers to such questions naturally depends upon what is meant 
by 'best' in this context. For real-time encoding, the overriding requirement is 
that the frames be coded with a frequency greater than the frame rate fr 
(typically 25 - 30 frames per second (fps». One means of estimating this 
frequency is by means of the Dynamic Instruction Count (DIC) which is a 
standard technique for comparing the execution times of different algorithms. 
The DIC can be converted to an execution time from the average number of 
flock cycles required :eer instruction (or CPI). Multiplying the DIC by the CPI 
(- 1.5 cycles for a scalar 32-bit CPU of a given memory structure) [97], [98], 
[99], provides an estimate of the total number of clock-cycles required to 
execute the algorithm. If the CPU runs at a frequency f, then the time to 
encode a single frame is 
T( d /f ) DIC(instructions/frame)x CPI(cycles/instruction) secon s rame = ----'------':c---:-''-:---'-'':--'------'-f( cycles/ second) 
To operate correctly this time must be less than l/frame rate -1/30 seconds 
per frame or 1/ fr. Rearranging gives 
f > fnun = DICxCPIxf, (5.1) 
As the values of CPI and fr are assumed to be fixed, the minimum 
required operating frequency for real-time coding is roughly proportional to 
the DIC If the vector arrays in the SIMD architecture are of length n-bytes 
then exploiting this DLP produces an algorithm with a DIC value of, say, 
C(n). If, in addition, a particular algorithm is performed on a system running 
m threads, i.e. with the workload distributed among m processor cores, the 
DIC will be functionally dependent on both n and m, as say C(n,m). 
In exploiting these forms of parallelism, data-level and thread-level, in 
an attempt to achieve the frequency requirement of eqn. (5.1), the area A, and 
hence the cost, will also increase. As a consequence there is a trade-off 
124 
between fmm and A or, equivalently, between DIC and A. A large chip, with a 
lower operating frequency, can be manufactured with increased parallelism, 
while a smaller chip running at a higher frequency can be obtained with a 
scalar architecture. 
With this in mind, and in addition to satisfying eqn. (5.1), it would 
appear sensible to try to optimise some combination of DIC and A. Here we 
define the (dynamic) Instruction Count-Area Product (ICAP), which is a form 
of area-delay product, to be 
ICAP(n,m) = C(n,m) x A(n,m) (5.2) 
for DLP with a scale parameter n and TLP with a scale parameter ffi. Thus for 
real-time coding, any design should attempt to satisfy eqn. (5.1) first, while at 
the same time trying to keep ICAP optimal or as close to optimal as possible. 
If encoding in real-time is not essential then ICAP may be a sensible cost 
function to optimize. 
For large search areas, large s values, it seems imperative that parallelism 
be exploited where possible, particularly in the case of the Full Search. As 
DLP is aimed at exploiting the parallelism within the inner loop 'DISTI' 
function (see Figures 4.1), while TLP is aimed more at exploiting the 
parallelism in the outer loops (see Figures 5.1 and 4.21a), the two processes 
are orthogonal and may be assessed independently, gathering the results of 
both only at the end. 
for (Row_NO = 1; Row_No<=18; ROW_NO++) 
for (Column_No = 1; Column_No <=22; Column_No++) 
for (candidate = 1; candidate<=289; candidate++) 
calculate SAD and compare to minimum; 
Figure 5.1 Unthreaded pseudo-code 
125 
5.1.i Thread Level Parallelism 
Motion estimation can be performed on a single high performance processor, 
or the workload of the encoder may be spread across m slower and simpler 
cores. For the purpose of illustration, we consider the case in which each MB 
is a 16 pixel x 16 pixel square, and the search area is a 32 pixe1 x 32 pixe1 
square. Furthermore we shall consider the case of a CIF frame (352 pixels x 
288 pixels). Each motion vector thus involves finding the smallest of 17 x 17 = 
289 SAD values, and each frame contains 22 x 18 = 396 such MBs. We assume 
here that the multithreading to have been done in the following manner: It 
has not been attempted here to try to thread the motion vector calculation of a 
single current frame MB (by threading the 289 SAD comparisons around 
several cores), rather the several slower cores operate in parallel on different 
current frame MBs. Thus the motion vector calculation (FULL_SEARCH + 
DISTl) of the FSBME algorithm is changed from that in Figure 5.1 to 
effectively (assuming, say, four processor cores) that in Figure 5.2. 
for (Row_No = 1; Row_No<=18; Row_No++) 
for (Proc_No = 1; Proc_No<=4; Proc_No++) 
for(Column_No=1+6*(proc_No-1);Column_No<=6*proc_No; Column_No++) 
for (candidate = 1; candidate<=289; candidate++) 
calculate SAD and compare to m1n1mUmj 
Figure 5 2 Possible muIbthreaded pseudo-code. 
Likewise the FDCT function can be threaded in a similar manner, however 
most of the REMAINING functions will be run by the main thread. 
In this way, with the exception of those additional commands 
performed by the main thread, which are required to provide overall control, 
the DIe is spread over the m (= Proc_No) cores. In general we also allow the 
size of the search area to vary so that there will, in general, be S2 rather than 
126 
289 candidates. Ignoring edge-of-frame effects, the DIC for the main thread 
can then be written roughly as 
C(s) C(s,m)=Co +-
m 
(5.3) 
where Co is the extra burden carried by the main thread, and which is 
expected to be largely independent of s and to be only very weakly 
dependent on m. Co + C(s) is the dynamic instruction count of a single 
processor performing the same task. 
Naturally, with this type of threading, similar expressions will hold for 
all TLP-fast search algorithms as well. The m-l dependence is clearly seen in 
Figures 4.26 and 4.27 for a variety of fast methods in which the (fractional 
DIq C(m) - 0.45 + O.5/m, so that for p = 8, Co/C(s=l7) - 0.9. Note, however, 
that the coefficient values (0.45 and 0.5) are only weakly dependent on which 
of the fast algorithms is used. By contrast, in the case of the Full Search, C(m) 
- 0.2 + 0.8/m. It is also a fair assumption that the area (n) increases linearly 
with m. For example, 
n(m)=no +mo=no(l+m :J=no(1+6m)!lffi2 (5.4) 
where ao + 0 is a notional area required for the main thread processor and 0 is 
the notional extra area required by each extra processor core, so that o/(ao + 
0) is the relative 'cost' of setting up the second thread. 
The area results for the threading method used above are given in the 
Table 5.1. 
Table 5.1 Area of a smgle Leon2 CPU compared to 2-way and 4-way confIgurations 
fmax(MHz) Std cells + RAMS Area (mm2) 
Single CPU core 219.3 25068 + 26 5.00 
2-way configuration 179.5 110099 + 52 9.04 
4-way configuration 168.2 215420 + 104 17.3 
127 
Oearly in this case the area may be written as roughly a = (1 + 4m) mm2, i.e. 
with a value of {} (= 5/ ao) equal to around 4. For p = 8, using C(m) - 0.45 + 
O.5/m, for fast search methods and C(m) - 0.2 + O.8/m for Full Search we 
obtain relative rCAP values which increase with m according to the Table 5.2. 
rt is clear from Figures 4.26 and 4.27 that the increase beyond m = 8 processors 
is not very significant and given the area increase in Table 5.1 a value of 2 or 4 
processors is perhaps reasonable for fast methods. The value of relative ICAP, 
Table 5.2, captures this information fairly clearly, where 2 - 4 processors 
appears to be the correct decision for fast methods. Likewise, for Full Search, 
four processors is perhaps best. 
Table 5.2 Relative increase in leAP with thread number for Full and fast searches 
M 1 2 3 4 8 
Full Search 1 1.08 1.2 1.76 20 
Fast searches 1 1.26 1.6 1.96 3.4 
Note that the ICAP product increases more slowly with m for the Full Search 
than for the fast searches. Taking rCAP as the penalty function, it would 
appear that the penalty incurred, for a Full Search, from using four processor 
cores might be considered reasonable, while perhaps two cores is appropriate 
for fast searches. The main reason for the poor rCAP value is that the current 
threading design assigns 26 RAM blocks to each core, so that {} turns out to be 
relatively large (around 4). A threading design in which the separate threads 
consider different candidates for the same current MB, would not require so 
much additional memory and a much lower {} would be obtained. As a result 
extensive threading would be possible at little extra ICAP cost. This is 
perhaps particularly true for the fast methods. Using the values obtained from 
Figure 4.26, i.e. p = 8 and C(m) - 0.45 + O.5/m, the rCAP/ao is plotted as a 
function of m for a variety of values of {} = 5/ ao in Figure 5.3. It is clear that 
the rcAP curves are generally fairly flat. Only when {} = 5/ao > 1 is there a 
significant minimum, at m '" 1. 
128 
(f) 
" I»
-
'" 0-
r:; 
~ 
12 
/ 
10 / 
/ 
/ 
8 / 
/ 
I 
6 /' 
/ -...... 
--' / -~--~ -4 / 
./ .. .--
-
OL--L __ ~~ __ ~ __ ~-L __ ~ __ L_-L __ ~~ 
o 2 4 6 a 10 12 14 16 1 a :;n 22 
m (number of threads) 
FIgure 5.3 Scaled lCAP vs m for vanous values of e= 6/C1<J for fast searches. Full curve 
corresponds to e = 1, dashed curve e = 2 0, dotted curve e = 0.2, dot·dashed curve e = 0.5. 
10 
-~--~ ---~-
2;' _._.-'-' h. _ ---
~ .. -- .. - .. - , •••••• __ 1' ..... _ .. - .... ----- .. 
-0 ..... ~_ .................. _ ........................ _""."" _ ........ .. 
OL--L __ ~~~~ __ ~~ __ ~ __ L_~ __ ~~ 
o 2 4 6 a 10 12 14 16 1 a :;n 22 
m (number of threads) 
FIgure 54 Scaled lCAP vs m for various values of e= 6/ a. for Full Search. Full curve 
corresponds to e = 1, dashed curve e = 2 0, dotted curve e = 0 2, dot·dashed curve e = 0.5. 
129 
By contrast, designs for which S < 1, 1 :,; m :,; 4 and even 1 :,; m :,; 8 all give very 
similar ICAP values. 
If, as here, the different threads work on different current frame MBs, 
there will be a large on-chip memory requirement, as a significant portion of 
the m current frame MBs, and their respective search areas, will need to be 
stored on chip simultaneously. Consequently S, for such a design, will be 
relatively large. By contrast if the different threads are processing different 
portions of the search area for the same MB, the memory requirement will be 
smaller and so will S. It goes without saying that there will also be limits to be 
placed on pin count in the case considered above. 
It is also clear that the optimum number of cores also depends upon 
the size of the search area (S2). As a result we turn our attention to this issue 
now. 
S.l.ii Data Level Parallelism 
A similar analysis holds where DLP is concerned. Consider first the case of a 
singly threaded architecture. The idea here is to introduce vector commands 
such that up to 16 bytes of information can be treated in a single vector 
instruction which operates like 16 one-byte instructions. 
for (candidate 1; candidate<=289; candidate++) 
/* calculate SAD and compare to minimum */ 
for (i=l; i<=16; 1++) 
for (j = 1; j<=16; j++) 
SAD=SAD+abs(MB_CF(i,]) - MB_RF(i+u,]+v»; 
Figure 5.5 Unvectorized (scalar) pseudo-code. Where (u, v) are the coordinates of the 
candidate mohon vector. The mner loop of FIgure 5.1. 
130 
In this, the inner loop in Figure 5.1 is replaced so that Figure 5.5 becomes, 
withn=2, 
for candidate = 1:289, 
{ 
/* calculate SAD and compare to minimum */ 
for (1 _ 1; i<=16; i++) 
for (j = 1; J<=8; j++) 
( 
VMB_CF_O_to_7= MB_CF(i,2*j-1); 
VMB_CF_8_tO_15= MB_CF(i,2*j); 
/* concatenate the two adJacent pixels of current block (CB) and 
put them in CB vector reg~ster VMB_CF of length 2 bytes */ 
VMB_CF= (VMB_CF_O_tO_7«8) I (VMB_CF_8_to_15) 
VMB_RF_O_tO_7= MB_RF(i+u,2*j-1+v); 
VMB_RF_8_to_15= MB_RF(i+u,2*j+v); 
/* concatenate the two adjacent pixels of reference block (RB) and 
put them in RB vector register VMB_RF of length 2 bytes*/ 
} 
} 
VMB_RF = (VMB_RF_O_tO_7«8) I (VMB_RF_8_to_15); 
VSAD=VSAD+vabs(VMB_CF - VMB_RF); 
Figure 5.6 Vectorized pseudo-code 
Note that the inner loop (on j) has been decreased from length 16 to length 8. 
In general this line would be for j = 1:16/n so that the allowed values here, 
giving integer j, are n = 1, 2, 4, 8 and 16, i.e. 2k. If the level of vectorization is n 
(where 1 ~ n ~ 16 and n = 1 corresponds to the scalar (non-vectorized) SAD 
engine), the DIe will consist of two parts: those instructions that can be 
vectorized and those that cannot. Thus we expect that for vectorized code, 
C(s,n) = A(s) + B(s) 
n 
Likewise, the area is expected to increase linearly with n, as 
131 
(5.5) 
a(n)=ao +nA (S.6) 
A 16-byte vector unit has area characteristics given in Table S.3 [86]; while a 
Full Search macro, based on a design in [41], [17] has area characteristics 
given in Table S.4. 
Table 5 3 Area characteristics for a 128-blt vector umt [86]. 
fmax(MHz) Utilisation (%) Standard Cells + RAMs 
312 93.02 26777+ S 
Table 5.4 Area characterishcs for a Full Search Macro. 
fmax (MHz) Utilisation (%) Standard Cells + RAMs 
183 9S 12134+31 
Assuming that eqn. (S.6) holds we estimate that 
0.418 a(n)"'0.566+n~ 
As a result of eqn (S.6), ICAP is given by 
ICAP=(ao +nA{ A(s) + B~») 
Areamm2 
0.418 
Areamm2 
0.566 
The optimum level of vectorization which minimizes rCAP, becomes 
n2 a o / A(s) _ 20 B(s) 
opt A/B(s) A(s) 
(S.7) 
(S.8) 
(S.9) 
This analysis is again independent of the search algorithm, although 
the values of A and B are search dependent, which may be a Full Search or 
may be one of the fast methods. 
As we shall see later for a Full Search, in the specific case of p = 8, s = 
17, the Full Search DIC is 
(S.10) 
132 
as a function of n. The balance between the constant term and the coefficient 
of the term in l/n are similar to that of the TLP expression where C(m) = 0.45 
+ 0.5/m or 0.2 + O.8/m, Consequently the ICAP curve has a similar flat 
appearance to that of Figures 5.3 and 5.4, and similar conclusions may be 
drawn. Thus if Il/ ao < 1, 1 ~ n ~ 4 and even 1 ~ n ~ 8 all give very similar area 
values. From eqn. (5.7) we estimate that Il/ ao - 0.05 so that Ilop! - 4 - 5, 
although the increase to n = 8 or 16 is at little extra ICAP cost. 
One might conclude that, for a search area characterised by p = 8, the 
Full Search method would benefit from 4- 8 bytes of Data Level Parallelisation 
and a Thread Level Parallelisation of around four processor contexts. 
Naturally these numbers may be different for larger search areas. As a result, 
it is now necessary to examine the various parameters in eqn. (5.9) for their 
dependence on s, the parameter describing the size of the search area. To 
begin with we consider the total Dynamic Instruction Count for coding 30 
frames, i.e. roughly 1 second of video for twelve different sequences which 
are believed, between them to contain many of the different types of motion 
seen in typical video clips. To display the results the total DIC for each 
sequence has been plotted against the DIC calculated for the 'Paris' sequence. 
This is shown in Figure 5.7. 
It is clear that for all sequences the DIC variation with search range is 
given by 
(5.11) 
Where, importantly, the slope, i.e. the ratio Asequence/ Apans, is independent of 
the search range parameter p. These slopes vary from around 2.4 for the 
'snowfall' sequence, to around 0.1 for 'office'. To a certain extent, this 
variation can be explained by the relative frame sizes of the different 
sequences which vary from 0.4 to 1.9, and the sequences are roughly in the 
right order. However 'Paris', 'bowing', 'deadline' and 'stud' are CIF 
sequences of frame size 352 x 288 and, while the 'deadline' curve lies almost 
on top of the 'Paris' sequence, 'bowing' and 'stud' lie some way off. Likewise 
133 
although 'fog, 'snowfall', 'snow lane' and 'rotating city' (rc) are all 512 x 380 
there is a significant variety in the observed slopes. The remaining variety 
must then arise from the content of the individual sequences. An early 
termination of the SAD calculation is applied if the current partial sum 
exceeds the current best value and the effect of this is naturally sequence 
dependent. The order in terms of size is 'office' (200 x 200), 'cup' (320 x 240), 
{'mad' = 'tennis'} (352 x 240), {'Paris' = 'bowing = 'deadline' = 'stud'} (352 x 
288), {'fog = 'snowfall' = 'snow lane' = 'rotating city'} (512 x 380), whereas 
the order they occur in terms of scalar-DlC is 'office', 'stud', 'bowing, 'mad', 
'tennis', 'cup', 'snow lane', 'deadline', 'Paris', 'fog, 'rotating city', 'snowfall'. 
Only for a large vector length (n = 8 or 16) is this correct order regained, and 
does the DIC slope become largely proportional to the frame size. 
3SE+11 
30E+11 
2SE+11 
20E+11 
1 SE+11 
1 OE+11 
SOE+10 
OOE+OO 
0 0 0 ~ 
~ ~ ~ 
+ + w w + w 
'" 
.... 
~ 
~ ~ 
~ ~ 
+ + w w 
'" 
.... 
__ deadline 
__ cup 
--.-stud 
-*-mad 
"'-boWlng 
__ tennis 
-+-rc 
-fog 
-snowfall 
__ snowJane 
-::roft 
--,~-- pans 
Figure 5 7 Total dynamic instruction count for a variety of sequences (listed as inset) vs that 
of the 'Pans' sequence. The vanous points correspond to the search range values of 7, 15, 25, 
31,47,63, 81, 97, 113, 121 and 127. Here n = VLMAX = 1, i.e. scalar algonthm. 
Note also that in eqn. (5.11), the A. values are only defined to within a constant ~ 
scaling factor and A. could be replaced by KA. without altering the analysis. 
Averaging eqn. (5.11) over the whole set of twelve sequences, we find that 
134 
(5.12) 
or alternatively, for any sequence 
A _ 
Csequence (p ) = ~ence C(p) (5.13) 
As a result we now consider the average DIC over the whole set of sequences 
and use eqn. (5.13) to extract results for individual sequences. The advantage 
in doing this is that it becomes necessary only to understand the variation 
with p of the single function C(p)without the need to consider those parts 
that are sequence dependent and those that are not. The average DlC values 
may be found as fractional values in Figure 4.2, where the algorithm was 
divided into the following parts: DlST1 is the section of code responsible for 
calculating the SAD distance metric; FDCT is responsible for the Forward 
DCT transform; FULL_SEARCH controls the raster scan of all candidate 
blocks in the reference frame; and REMAINING corresponds to all 
instructions which are not included as part of the pervious three. 
For macroblocks of size N x N (typically 16 x 16) and a Search Area of 
size (N + 2p) x (N + 2p), the number of SAD calculations to be performed is 
(2p + 1)2 per current frame MB. From a fairly naive viewpoint, therefore, one 
would expect part of the DlC to scale as s2 = (2p + 1)2 and part to be 
independent of p. For example, the forward DCT (FDCT) occurs after block 
matching and so is expected to be independent of p, while clearly the DlC for 
DlSTI will increase with p, probably as S2. Let us define CFS, CDIsn, CFDCT and 
CREM to being those parts of the dynamic instruction count pertaining to the 
various functions FULL_SEARCH, DlSTI, FDCT and the remaining 
instructions REMAINING, respectively. Let us also define fractional values 
CFS*, CDIsn*, CFDCT* and CREM*. From our naIve perspective we should expect 
CREM + CFDCT to be constant, equal to c say, i.e. independent of window size p. 
The fractional combined DlC for REM and FDCT is then 
CREM * +CFDCT * = ___ C....:REM=_+_C--'-'FDCT==-__ _ CREM +CFDCT +CDISf1 +CFS 
135 
(5.14) 
This fraction will decrease with p, as in Figure 4.2, since the CDIsn and CFS 
functions are expected to increases with p. Oearly we also have 
COIsn -!C 
- D!STl CREM + CFDCT c 
(5.15) 
and similarly for CFS. We anticipate that CDISf1 and CFS will be functionally 
dependent on the size of the search range, s = (2p+1) and a plot against S2 
reveals the expected behaviour. 
Figure 5.8 shows a plot of C~!STl * * against S2 together with the 
C REM +CFDCT 
straight line 1.9 + s2/290, and clearly the fit is excellent except that, for the 
smallest window sizes, the estimate is slightly too high. Also shown is a plot 
of ~FS * * against S2 together with the straight line s2/780 and again, 
C REM +CFDCf 
except for the smallest window size, where the estimate is slightly low, the fit 
is excellent. This leads to the following approximations to the average DIC 
results of Figure 4.2, 
CD!STl = C(1.9+~) 290 S2 CFS =c- CREM+CFDCT =c 780 (5.16) 
Note that as we have assumed both CREM and CFDCr to independent of p, we 
should expect that 
CFIlCT C FIlCT * a 
CREM CREM * 
(5.17) 
for some constant value a. The results from Figure 4.2 show that a - 1.16 to a 
good approximation (1.14 < a <1.17) over the range p E [6, 62]. Thus, our 
analysis shows that the individual contributions to the total DIC are 
C =i19+(2P+W) C =c(2p+1)2 C =_c_ C =1.16c (5.18) D!STl \" 290' FS 780' REM 2.16' FDCr 2.16 
and that total DIC,C(p), for a Full-Search with MPEG2 encoding is 
C(p) =CFS +COIsn +CFDCf +CREM = C(2.9+<2P+1)2(_1_+--.!....)) (5.19) 290 780 
The DIC for each individual sequence is then obtained from eqn. (5.13). 
136 
When the algorithm is vectorized as well, the part which is most likely 
to be affected is DISTl which should possess a section of code with a 
complexity varying roughly as 1/n, n being the multiplicity of datapaths with 
respect to the original, scalar algorithm, whereas CFDCT, CREM and CFS are 
unlikely to be affected as in the particular implementation of TMS, we choose 
to disable the vectorized Forward Dcr (FDeT) function, and as we have seen 
FULL_SEARCH is characterized by the complementary thread-level 
parallelism (TLP). 
Let us assume that the split between those parts that can be vectorized 
and those that cannot, gives us a DIC of 
( 
1 (2P+1)2) COlST1 (n) = c 0.9 +-+.!...-'.---''--
n 290n 
(5.20) 
which we shall justify later through Figure 5.9. The total DIC becomes 
Cror(n)= 1.9+-+(2p+1)2 --+-{ 1.0 (1 1 )) n 290n 780 (5.21) 
giving a fractional reduction in Dynamic Instruction Count, on vectorizing, of 
(1.0+ (2~;:)2 J( 1-~) 
2.9+(2p+ 1)2(2~0 + 7!0) C5CALAR 
(5.22) 
Plotting this against 2p +1 for n = 4, 8, 16 yields Figure 5.9 where the 
upper curve corresponds to n (= VLMAX ) = 16, and the lower curve to n (= 
VLMAX) = 4. These results compare favourably with the calculation from 
[86], shown as data points, although it is possible that a better fit to the data 
points could be obtained with a slightly different split in eqn. (5.17). 
A significant conclusion of this is that, for a vectorized Full-Search, the 
total architectural Dynamic Instruction Count and the DIC for the DISTl 
function vary according to eqn. (5.21) and eqn. (5.20) respectively. Oearly the 
total DIC CTOT(k) is bounded below by j 1.9 + (2p + 1)') and no increase in n = 
'\. 780 
VLMAX will improve matters. There is also a downside to increasing n 
137 
(VLMAX), of course, in that more chip area will be consumed in performing 
the DISTlloop with increasing vector length. 
16 
2 
00 500 1000 1500 20lI 2500 3IXXI 3500 4000 4500 
2 
s 
Figure 5.8 Graph-based DISTl and FULISEARCH (upper curve) approximatIons. 
06 
5 055 
ii 
~ 05 0: 
~ 
';; 
-t 045 
E 
o 
<; 04 
~ 
~ 
w 
~035 
03 
o 
Search Range 
Figure 5.9 Fractional Die reduction over search range vs vector register fIle length 
VLMAX=16 upper curve, VLMAX =8 middle curve, VLMAX=4lower curve. 
Indeed one would expect the area to increase linearly with n, as in eqn. (5.6), 
repeated here for convenience. 
138 
(5.6) 
where ao + !J. is the area occupied by logic performing scalar (non-vectorized) 
instructions (FDCf, REM, FULL_SEARCH), VLMAX = 1, and !J. is the rate of 
increase of area with the degree of vectorisation. For a given p (which 
essentially maps onto the PSNR, e.g. see Figure 4.9) the (dynamic) Instruction 
Count-Area product (lCAP) is then minimized to give an optimum 
vectorization length of 
(l+~}O -~(1+..!.) ra; 
(1.9+~)!J. 3 40 'lA 780 
(5.22) 
For s in the range 5 - 61 (p = 2 - 30; commonly in MPEG2, P = 8), the 
ratio of the terms in s varies roughly linearly in s, hence the approximation in 
eqn. (5.22). The change in the value of (1 + si 40) over the range of s values 
considered is from around 1 to 2.5. Using the values from eqn. (5.10), this 
corresponds to an increase in vector length from 4 to 10 bytes. This implies 
that a 32-bit architecture using 4-byte vector commands is best for small 
search areas, while 64-bit or 128-bit DLP architectures would be required for 
the larger search areas. 
Eqn. (5.22) also shows that if an algorithm is optimized for ICAP, for a 
given size of Search Area s = 2p + 1, and the Search Area needs to be 
increased, e.g. for greater PSNR, then the degree of vectorization should be 
increased by an amount proportional to the increase in p. 
In terms of the individual sequences, eqn. (5.13) yields 
Csequence(p) cAsequence [2.9 + (2p + 1)2(2.-+2.-)] (5.23) 
A 290 780 
where it might be expected that the coefficient c depends upon the set of 
sequences over which we are averaging; indeed one might expect KFS= cl A 
to be independent of the sequences themselves and dependent instead on the 
algorithm being used for compression. This leads to an expression for the 
complexity of the Full Search operating on any sequence, given by 
139 
Csequence(P) A [2.9+(2 + 1)2(~+~)] KFS sequence p 290 780 (5.24) 
In other words, the numbers on the right hand side of eqn. (5.24) are fixed by 
the algorithm, and not by the sequences, which merely contribute a factor 
Asequence. It is possible to absorb the constant scaling factor KFS into the 
definition of A, recall that the A values were only defined to within a scaling 
constant. So we can write, for all sequences, 
Csequence(P) 2.9+(2p+ 1)2(~+~) 
Asequence 290 780 
(5.25) 
and the only sequence dependent parameter is the constant Asequence. 
Note, also, that with increasing Data Level Parallelization the relative 
slopes change, with sequences of the same frame-size tending to converge to 
the same limit at large n, Figures 5.10 to 5.12. The only exception to this is the 
'snow-lane' sequence which appears less 'complex' than the other frames of 
size 512 x 380 pixels. Consequently to a good approximation we have 
Csequence(p,n) = Asequence(~seq,n)[1.9+.!.+(2P+ 1)2(_1_+~)] 
n 290n 780 
(5.25a) 
where ~seq is the frame-area for the sequences. This means that we must have 
lim Asequence (~seq ,n) = A..,~seq n ___ (5.26) 
say. So that, in general, let us postulate that 
Asequence(~seq,n)=~seq( A.., + 7t: +O(~2 )) A(n) =, A.., + ~ +O(~2 )) (5.27) 
so that, for the individual sequences, 
Csequence(p,n) =~seq(A." + 7tseq )[1.9+.!.+(2P + 1)2(_1_+~)] (5.28) 
n n 290n 780 
which, reduces to eqn. (5.25) when averaging over all sequences, and, which 
to first order in n -1 becomes 
csequence(p,n)=A..,~seq[1.9+ 1+1.97tseq + (2P +l)2(1+ 2.7+7tseq )] (5.29) 
n 780 n 
This gives a formula for the DIC of a sequence in terms of the sequence 
contribution 7tseq, the size of the search area S2, the frame-size ~seq and the 
140 
degree of vectorization n, and appears to be consisted with all observed DIC 
results on the Full Search method. 
A consistency check is that is that eqn. (5.11) should hold i.e. 
(5.30) 
so that the plots of Csequence(p, n) against CPans(p, n) should be straight lines of 
slope 
:Eseq 1tseq + A.~n 
:Epans 1tpans + A.~n 
(5.31) 
The slope> :Eseq/:EPans if 1tseq > 1tpans (irrespective of n) and < :Eseq/:Epans if 1tseq < 
1tPans (irrespective of n). 
4.0E+10 ,----------------, 
3.5E+1 0 f----------------:J~---1 
3.0E+10 f-----------:>~-~ 
2.5E+10 1------7~~~ ... 
2.0E+10 1=====2~~~::===~ 1.5E+10 
1.0E+10 -1-----=<'~"'=---------__1 
5.0E+09 f---:;:JIIf""=--------------1 
O.OE+OO -I----,-----,.-----,r----,-----,-----1 
o 5E+09 1E+10 2E+10 2E+10 3E+10 3E+10 
-+- Deadline 
...... Cup 
---Stud 
--*-Mad 
-iIE- Bowing 
-+-Tennis 
-+-Paris 
Figure 5 10 DIe vs Paris sequence complexity. The various points correspond to the search 
range values of 7, 15, 25, 31, 47, 63, 81, Cfl, 113, 121 and 127. Here n = VLMAX = 4. 
For large enough n, large enough that eqn (5.27) holds, the sequences 
converge to slopes of :Eseq/:Epans, only 'snow lane' (Figure 5.13) does not fit into 
this pictures as the slope of 'snow lane' is lower than the other sequences 512 
x 380 sequences making 1tsnowlane < 1tPans while for the other sequences 1tseq -
1tpans, implying significantly less motion in'snowlane' (less intra MB coding) 
141 
than the other sequences and hence a significantly higher use of the early 
jump-out mechanism. 
3.0E+10 .,---------------, 
2.5E+10 -I-------------,;"!---1 
-+- Deadline 
2.0E+10 -t---------7~fii!I"":;;;,..A_J --Cup 
--Stud 
1.5E+10 +---------:;T:~r'-----_l --*-Mad 
1.0E+10 -t----
--Bowing 
-+-Tennis 
5.0E+09 -t------: ,"-----------1 -+-Paris 
O.OE+OO -t------,.---r----,----r-----l 
o 5E+09 1E+10 2E+10 2E+10 3E+10 
Figure 5.11 DIC vs Paris sequence complexIty. The various points correspond to the search 
range values of 7, 15, 25, 31,47,63,81,97, 113,121 and 127. Here n = VLMAX = 8. 
25E+10 
20E+10 
1 5E+10 
1 OE+10 
50E+09 
o OE+OO 
0 
'" 
0 
0 0 ~ 
+ + + w w w 
0 0 0 
0 lC) ~ 
0 0 
~ ~ 
+ + w w 
~ 0 
~ N 
0 
~ 
+ w 
lC) 
N 
-+-Deadline 
__ Cup 
-+-Stud 
--*-Mad 
__ Sowing 
-+-Tennls 
-+-Pans 
FIgure 5.12 DIC vs Paris sequence complexity. The various points correspond to the search 
range values of 7, 15, 25, 31, 47, 63, 81, 97, 113, 121 and 127. Here VLMAX = 16. 
We conclude this section with the observation that eqn. (5.29) does give a 
remarkably simple representation of the instruction count of the Full Search 
142 
method. It remains now to consider the case of the fast sub-sampling 
methods. 
S.l.iii Fast Methods 
For other methods, such as the three-step search (TS5), the dependence on 
window size is very weak (Figures 3.15 and 4.8). One expects the DIC relative 
to scalar TSS to be roughly 
C(n) (5.32) 
where Brss represents that part of the TSS algorithm which scales as lln (the 
part that can be vectorised) and Arss represents that part of the algorithm that 
is independent of n. In this case, plotted against 1/n, the slope and intercept 
should be independent of window size, but are likely to vary between 
sequences. Also, relative to vectorized FS, the relative DIC is 
A Brss rss+-
n 
n-+<o ) ( 52 ) 
C 1.9+ 780 
(5.33) 
This again is bounded below for large n which shows that 
vectorization is a process which has an optimal solution. Consequently one 
might expect that an area-delay product, such as ICAP, to be useful as a cost 
function in this regard. 
S.l.iv Summary. 
It is important to summarize the theoretical analysis presented in this section 
and assess what conclusions can sensibly be drawn. First, in attempting to 
speed up the Full Search, there are two types of parallelism that one might try 
to exploit. The first is to multiple-thread the architecture, sharing the 
workload amongst a number of simpler, slower cores in that hope that either 
the frame encoding time or the power dissipated in encoding a frame may be 
143 
decreased, or that silicon area savings may be made. The second is to include 
vector commands in the code with a view to reducing the number of times 
that an instruction is executed. Either way some sort of additional 
(redundant) hardware is involved. 
Of course, for real-time applications, the overriding requirement is that 
the video segment is encoded at the required frame rate, and thus eqn. (5.1) 
must hold. I.e. the encoder must run a minimum frequency of 
£..n = DICx CPIx f,. (5.1) 
With the frame rate fr and CPI relatively fixed, this frequency can only be 
reduced, and with it the dissipated power CTV2fnun, if DIC is reduced. 
However the total capacitance CT and the total area (and hence the cost) 
increases with any hardware redundancy. 
3SE+09r-----------------------------~ 
30E+09+-----------------------------~ 
2SE+09+-------------------------~~~ 
20E+09+-------------------~~~~--~ 
1.SE+09 +---------------#-~1L---____:c",.._'!.-~ 
1.0E+09 +---------..",~~---,.,.L-----------~ 
SOE+08+----~~~--------------------~ 
OOE+OO1-~·:·;·~·~~·~~~·===::·====:·:=:·::·~ 
o OE+OO S OE+08 1 OE+09 1.SE+09 2 OE+09 2 SE+09 
-+- snOW_lane 
......... off 
-.-rc 
-..-snowfall 
-.-fog 
Figure 5.13 Remaining sequences plotted agamst 'fog'. Here VLMAX = 8 
It makes sense to consider a cost function such as the ICAP metric, the 
Dynamic Instruction Count x the area product, and to ensure that, given eqn 
(5.1) is satisfied, it is as small as possible. 
144 
For non-real-time encoding, optimising ICAP corresponds to the 
optimisation of the standard delay-area product. One cannot increase the 
parallelisation beyond a certain limit as eventually the cost implications will 
dominate any additional reduction in DIe. However, the optimum level of 
redundancy depends upon the size of search window, which is a variable in 
MPEG2, and which may need to be changed for complex sequences such as 
'rotating city', Figure 4.9, in order to obtain an acceptable PSNR values. To 
this end the Full Search code has been profiled for a variety of search area 
sizes. 
The behaviour of the system depends on the redundancy introduced. 
For a Full Search exploiting Data-Level Parallelism (DLP), in which SIMD or 
vector instructions are introduced, the optimal length, in bytes, of the vector 
registers is roughly (eqn. (5.22» 
n _2(1+~) ~ 
opt 3 40 '17 (5.22) 
Where ao + !'J. is the chip area for a scalar architecture and !'J. is the 
additional area per byte of vector length in a vectorized architecture. For a 
design which uses the full resolution Full Search chip and the vector unit 
described above, this leads to a 32-bit vector data-path (with VLMAX = 4) for 
relatively small window sizes, and a 64- or 128-bit datapath (with VLMAX = 8 
or 16) for the larger search windows. By contrast little improvement can be 
gained by the fast methods through DLP largely due to the irregular dataflow 
and smaller number of candidate motion vectors (CMVs) searched. 
On the other hand exploiting Thread-Level Parallelism (TLP), with the 
different threads accounting for different sets of MB columns, and with any 
type of search Full Search, or fast methods, the area increases rapidly due to 
the separate RAM requirements of the processor contexts. Consequently the 
optimum number of processor cores (from ICAP) is close to m = 1, although 
the penalty function (as a fraction of total cost) is less severe for Full Search 
than for fast methods. Better results may be obtained from a multithreaded 
architecture in which several threads worked on the same current frame MB. 
145 
Consequently, it may be preferable to use the alternative TLP below 
calculate SAD and compare to minimum; 
Figure 514 Altemabve mulbthreaded pseudo-code. 
The conclusion from this analysis is that, for the DLP /TLP design considered 
above, which uses a separate processor core per each column of current frame 
MBs, is that a moderate amount of DLP, perhaps up to 4-bytes (n = 4) together 
with a mild amount of TLP parallelism perhaps (m = 4 for Full Search and m 
= 2 for fast methods) should be used. One question posed at the start of this 
section was to what extent the exploitation of DLP and TLP could help 
improve the speed of the Full Search. With a value of p = 8, and vector 
registers 4-bytes long, the DIC for the full search can be reduced by around 
one third, as 
(1.0 + (2p + I)' )(1-.!.) 290 n 1.5 
""::""---.,-!,-=-'---"::.,... = - = 0.34 
2.9+(2P+l)2(_1_+....!...) 4.3 
290 780 
(5.34) 
consequently the value of fmax can also be so reduced. 
Running four processor cores creates a further reduction of around up 
to a factor of around 2.5, as DIC = 0.2 + 0.8/m - 0.4. The combination of four 
threads, vector registers of length 4-bytes, and taking p = 8 would thus only 
be able to reduce the Dynamic Instruction Count and hence fmax by a factor of 
perhaps four (as 2/5 x 2/3 -1/4). 
Things do not change very much as p gets larger when SC - 3(1 -
l/n)/4 = 0.56, so that the most important effect is again obtained through 
TLP. As, in many cases, this can be applied to fast methods as well as the Full 
146 
Search, it is clear that such parallel techniques alone will not speed up a Full 
Search sufficiently to compete with fast searches. 
In the next chapter we shall look at an additional method of reducing 
the power/area and frequency requirements of the Full Search in order to 
allow the clock frequency to be reduced further. 
5.2 Conclusions 
It has been shown that the exploitation of DLP, via vector/SIMD instruction 
set architecture extensions, and the use of ME algorithms, is vital for real-time 
execution of the complex MPEG2 TM5 video encoder. A complementary 
approach is advocated here based around an analytical complexity model for 
the vectorized MPEG2 application through extrapolating the architecture-
level simulation data. As shown, the model matches the simulation results 
quite accurately for a search window range of between 8 and 64 pels. 
Subsequently, we proposed a new complexity metric, the complexity 
(instruction count)-area-product (ICAP) to drive the optimization process 
without the need for prohibitively long, exhaustive simulation of the 
algorithmic and microarchitectural space. This chapter also quantified the 
benefits of thread and data parallel implementations of a number of fast ME 
algorithms for the real-time execution of the MPEG2 standard. The data-
parallel implementations suggest that either no or only moderate SIMD 
capability is sufficient for fast ME encoders. 
At the same time, TLP exploitation via a low-order SOC multi-
processor architecture captures a significant part of the TLP of the workload. 
The work presented in this chapter suggests that two or four-processor SOC 
architecture, with 4-byte data-parallel acceleration per processor, would 
therefore represent a reasonable compromise as a general purpose silicon 
engine capable of accelerating both full search and fast ME. 
147 
6. Reduced Bit Full Search Block Matching 
6.1 Introduction 
In the last chapter methods of reducing computational complexity, area and 
power consumption through efficient DLP and TLP were discussed, as well as 
their combination with fast sub-sampling algorithms. In this chapter we look 
at an efficient way of further reducing Full Search ME algorithm complexity 
through an orthogonal technique. 
We have taken the figure of merit for any algorithm to be the PSNR 
value (peak Signal to Noise Ratio), defined here as PSNR = 20log1o(255jRMS) 
where RMS is the Root Mean Square Error between the current frame and its 
approximation reconstructed from reference frame MBs. To be considered as 
a realistic alternative, any fast algorithm really needs to be within a fraction of 
a dB of the Full Search (regarded here as the gold standard) value. 
In order to preserve the dataflow regularity in the VLSI architecture, 
and at the same time to reduce computational complexity, a number of 
authors have suggested a full-search algorithm, but with eqn. (2.1) for the 
SAD goodness-of-fit value based on a reduced number of bits. Such schemes 
are generally referred to as Bit Truncated algorithms and provide an RBSAD 
(Reduced-Bit Sum of Absolute Difference) metric or distortion measure [101, 
102]. The downside of such a method is its reduced dynamic range. In the 
case of a 4-bit RBSAD, the resolution for pixel luminance values within the 
integer range [0, 255] is reduced from unity to steps of 16. The result is that 
although two blocks may provide the best match with the reduced resolution, 
it is possible that the match with full resolution may not be best. This leads to 
an increased error matrix and consequently a lower bit-rate, or a higher 
quantisation error and, possibly, a reduction in visual quality. Whilst these 
are potential problems, in practice the PSNR values of the two methods are 
very close for most real sequences, Table 6.1. 
148 
Note that the PSNR values mentioned here are different from (i.e. smaller 
than) those mentioned in Chapters 2 - 4, which are obtained from the output 
'statistics' file [40] generated by MPEG2 encoder. In MPEG2 encoding, 
reconstruction of a macroblock is obtained by adding its prediction, from the 
reference frame, to the reconstructed residual error. In this chapter, to clarify 
the effect of the change to reduced bits, the reconstruction of the original 
macroblock is made from the previous frame, through motion 
estimation/ compensation, and the reconstructed residual error is not added. 
The only cases in which the average PSNR for the sequences studied is 
significantly worse using RBSAD are the 'Claire', ' Akiyo' and 'Missa' 
sequences where the average PSNR values (with full and reduced resolution) 
are already large. A typical frame from the 'Oaire' sequence is shown in 
Figure 3.14 (Chapter 3). The purpose of this chapter is to investigate means of 
providing a correction term to convert RBSAD values to full resolution SAD 
values, for only those cases in which the RBSAD is poor. In this way the 
advantages of reduced bits (power and speed) can be maintained without 
some of the disadvantages. 
The difference between the average PSNR values for the full resolution 
and with reduced bits (using 4-bits) in the case of the 'Oaire' sequence is 
around 1.5dB. This difference is plotted for the first one-hundred frames of 
the sequence in Figure 6.2 (circles). The problem is that with this 'head and 
shoulders' sequence there are several very good matches for each current 
frame MB and the RBSAD method has difficulty in selecting the best. Indeed 
the RBSAD calculation will not be able to distinguish between any candidate 
blocks for which the RBSAD calculation is equal to zero, and this stiIlleaves a 
wide range of possible SAD values. An obvious and simple correction to the, 
generally very good, RBSAD method is to revert to a full resolution value 
whenever RBSAD = o. 
The hardware implications of such a scheme are obvious as whenever 
RBSAD = 0, the full resolution calculation is simply the 4-bit RBSAD of the 
lower bits. 
149 
Table 6 1 PNSR (maxInmm) for full resolution (FS) and reduced-bit (RBSAD) algonthms The 
corrected values are correspondmg to algoritlun I 
Seq. Mom Fore- Fog Snow Snow Claire Akiyo Missa 
man Fall Lane 
FSSAD 35.86 28.46 32.28 26.82 30.16 46.15 46.2 37.65 
RBSAD 35.58 28.37 31.76 26.70 30.09 43.29 43.18 36.83 
Corrected 35.58 28.37 31.76 26.70 30.09 44.09 46.12 36.91 
RBSAD 
%recal- 1.75 3.2 1.79 0 6.65 23.41 13 20 
culation 
Consequently the calculation may be split into two. A 4-bit RBSAD 
calculation for the upper four bits, which is generally used, and, in the case 
that this upper bit RBSAD = 0, a 4-bit RBSAD calculation for the lower four 
bits. In terms of hardware, two obvious options are possible. First the two 4-
bit calculations can be pipelined and only in the case when RBSAD = 0 is the 
correction (Table 6.1) to full resolution applied. Thus a full resolution SAD 
calculation will take two-clock cycles. Second two identical Full Search 
hardware layouts are created each based on 4 bits. The first runs continuously 
with the second performing a correction calculation whenever the RBSAD 
value calculated by the first is zero. The result, in either case, is a saving in 
power and either speed or area. 
The purpose of this chapter is to investigate the accuracy of this simple 
correction to the standard reduced bit method, to investigate whether such 
correction methods in general are appropriate and to take a design of such a 
system to layout. 
150 
6.1.1 A simple Threshold Algorithm 
We next consider here a number of test sequences, as shown in Table 6.1. 
These involve most types of motion seen in typical sequences and so are 
representative of much of what might be expected in real video clips. We 
imagine a mechanism of thresholding, in which an RBSAD calculation is 
converted into a full resolution SAD if the value of RBSAD is less than some 
threshold value T. We suppose that, in some way as yet unspecified, the 
correction from reduced bit to full resolution can be achieved in hardware. 
Thus we effectively execute the pseudocode shown in Figure 6.1 
if {RBSAD{7:4}} < T then 
output = SAD{7:0} 
else 
output=RBSAD{7:4} 
end; 
FIgure 6.1 A reduced bit pseudocode 
Note that here we have used the notation that RBSAD(7:4) is an SAD 
calculation based on using the top four bits (7:4) of the luminance pixel 
values. In a motion estimation calculation for a given sequence, the fraction of 
times n(T) that the RBSAD calculation fell below T was recorded as a function 
of T, this gives a measure of the efficiency of the method for that threshold 
value. Figures 6.3 and 6.4 show the derivative dn/ dT for the 'Oaire' and the 
'snow lane' sequences respectively. It is clear that dn/dT is large at T=1 
rapidly saturating for larger threshold values. The 'Oaire' sequence shows a 
particularly rapid increase in dn/ dT as T ~ 1 and, although a similar analysis 
of the other sequences shows the same overall behavior, the behaviour at 
small T is generally not quite so rapid. Nevertheless the results imply that if 
we apply a correction to the RBSAD to the full resolution SAD for T = 1, i.e. 
151 / 
whenever we get an RBSAD value = 0, we shall correct for many of the errors 
in an RBSAD calculation. 
Q) 
u 
<: 
Q) 
~ 
Q) 
:s 
0 
0:: 
z (J) 
D... 
3 
25 
2 
1 
~50L---L-~L-~L-~L-~L-~L-~--~--~--~ 
10 20 30 40 50 60 70 80 90 100 
frame number 
Figure 6.2 Difference between full resolution (FSSAD) PNSR and reduced bit RBSAD PSNR 
values (circles), dIfference between full resolution (FSSAD) PNSR and corrected reduced bit 
(algorithm I) RBSAD PSNR values (asteriks), difference between full resolution (FSSAD) 
PNSR and corrected reduced bit (algorithm IV) RBSAD PSNR values (squares), for the 100 
frames of 'Oaire' sequence. 
The upper four bit RBSAD value is equal to 0 only when, 
IS(i, j,k)-s(i +m, j + n,k -1)1 < 16 
for all i and j in the current frame MB, as then, restricting the analysis to only 
the upper four bits (7:4), gives 
16 
RBSAD = Lls(i, j,k)(74) -s(i + m, j +n,k -1)(74)1 = 0 (6.1) 
l,j=1 
As a consequence only the lower four bits contribute to the SAD sum and we 
may replace the full resolution SAD value by the lower four-bit calculation. 
Oearlythen 
152 
H H 
SAD = ~]s(i, j,k)-s(i + rn, j + n,k -1)1 = Ils(i, j,k)(3-o) -s(i + rn, j + n,k -1)(30)1 
{,j=1 1,,=1 
25 
Percentage corrections 
--
1O 
" 
15 I .. 
~ 
c 
~ .. 
10 
"" • • • • '" •• -, 
5 
00 100 lOO 300 ~ 500 500 700 ID) ~ 1000 
Threshold T 
Figure 6.3 The fraction of corrections dn(f)j dT made With a threshold value of T for the 
'Oaire' sequence. The region around the origin is shown as inset 
7 
Percentage Corrections 
--M 
6 
"' 
"' 
5 r .. 
"' 1;;4 
"' E: 
" c ~3 
"" • • • • 
.. •• 
-, 
2 
00 100 lOO 300 ~ 500 500 700 ID) ~ 1000 
Threshold T 
FIgure 6 4. As Figure 5 4 for the 'snow lane' sequence. The region around the origin is shown 
as inset 
This then corresponds to executing the pseudocode 
If (RBSAD(7:4» = 0 then 
output = RBSAD(3:0) SAD(7:0) 
153 
else 
output = RBSAD(7:4) 
end; 
Figure 6 5 
where the RBSAD(3:0) value is computed from a replica of the hardware (or 
indeed even the same hardware) that obtained the value of RBSAD(7:4). The 
calculation of the two stages can either be pipelined in the same unit or 
performed in parallel by separate units. Either way, depending on the 
number of corrections made, a considerable saving in computation, and hence 
power, may be achieved. Most of the current 8-bit (full resolution) 
architecture designs are suitable for modification as they are typically created 
in a bit slice fashion. 
DU1put 
t 
/ 
-i\ 70 / 
T 
r 
RBSAD RBSAD 
PE" PE, 
Z_llog EnabJo 
Cn·clUp On-dlt> 
Umlazy I&"""l' 
MI M] 
L ..... 
Uppor 4b .. 
4-1> .. 
~ 
--. 
Figure 6.6 Upper (PEu) and lower (PE,) SAD calculabons in a processing element PE. 
For example if the architecture is based on a dedicated adder tree, the zero 
flag of the final add-and-accumulate adder, at the output of the tree used to 
evaluate eqn. (6.1), may also be used as an enable/disable for the lower bit 
154 
calculation and also as a control for the MU)( which selects between the upper 
and lower bit values, Figure 6.6. 
Assuming that a four bit RBSAD takes roughly half the power of a full 
resolution SAD calculation, and that a recalculation is required in 23.4% of 
cases, as in the case with the 'Claire' sequence (Table 6.1), a power saving of 
roughly 76.6%/2 = 38.3% is obtained compared to the Full Search method. 
Note that in the cases for which the RBSAD calculation produces an accurate 
PSNR value (Le. all but the 'Claire', 'Missa' and 'Akiyo' sequences) there are 
almost no corrections applied. In addition, either with only a minimal 
increase in area the timing should in principle be roughly halved, or 
alternatively, with similar timings the silicon area can be roughly halved. 
6.2 Architecture I 
One possible architecture envisaged here, somewhat different from that 
presented in section 6.9, involves a number of Processing Elements (PEs), each 
of which deals with the calculation of the best fit of a single MB in the current 
frame. Thus an image with dimensions 320 x 480 pixels will involve 20 PEs. 
PEt will work on the current frame data in columns 1 to 16, PE2 will work on 
columns 17 to 32 etc. The RBSAD(7:4) calculation takes data from on-chip 
memory Ml, which stores the upper four bits of the pixel values as they are 
read in from external memory. Unless RBSAD(7:4) = 0, the data in on-chip 
memory M2, which stores the lower four bits, will not be read. 
Not having to read all the data from on-chip memory can yield a large 
saving in power requirement. In addition using only the top four-bits yields 
cheaper, less power hungry adders (for four-bit inputs standard ripple adders 
are as fast as the so called 'fast adders' and consume a significantly smaller 
area) [103], Figure 6.6. It is likely that such a simple algorithm as above will 
not provide sufficient corrections (Table 6.1) so we now consider the more 
general case of what to do if RBSAD(7:4) is not zero. 
155 
6.3 Corrected-RBSAD Algorithms and possible Hardware Realizations 
The downside of using the RBSAD (Reduced-Bit Sum of Absolute Difference) 
metric [101, 102] is the reduced dynamic range it imposes. However it is 
possible to correct the RBASD value to fuIl resolution by adding the term 
A(m,n) = I&I,)(m,n)(s(i, j,k)(3iJ) -s(i +m, j + n,k -1)(3iJ») (6.2) 
1,1",1 
[104], where the subscript (3:0) refers to the lower four bits of the pixel values, 
and &I,)(m,n) is the sign of s(i,j,k)(7 4)-s(i+m,j+n,k-l)(7 4), unless it is zero, when 
&I,)(m,n) defaults to the sign of s(i,j,k)(3o)-s(i+m,j+n,k-l)(3O). To implement this 
correction consequently requires the signs of the absolute difference values at 
the input to the adder tree, and their zero flag output, to be saved. This 
corresponds to a slight increase in the, generally large, on-chip memory 
requirement for the processing elements which perform the Motion 
Estimation. There is also a slight overhead in the overall size of the adder tree. 
An adder tree required to calculate an 8-bit SAD, is generally slightly smaller 
than the sum of that required for a 4-bit RBSAD, eqn. (6.1) and that required 
to calculate the correction term, eqn. (6.2). However it is clear that the power 
savings can be significant. The main question for investigation, and the focus 
of this section, is:- Under what conditions should this correction (to fuIl 
resolution) be applied to obtain the best results? 
156 
~" 
" 
I 
.I , 
, 
. 
, 
25 
20 
15 
0: 
z 
Cl) 
a. 
1:10 
<: 
i!! 
Q) 
::: 
0 
5 
0 
-5 
0 5 10 15 20 25 
Frames 
Figure 6.7 The difference between FSSAD and RBSAD for the ' Akiyo' sequence. The solid hue 
represents the standard RBSAD wlule the squares represent the corrected curve 
corresponding to algorithm I. 
6.4 The Corrected-RBSAD algorithm 
For those cases in which the RBSAD is poor, the aim is to generate an 
., algorithm which maintains the advantages of reduced bits (power and speed 
\'! or area) but which also avoids some of the disadvantages (visual quality, bit 
rate, etc). The difference between the average PSNR for the full resolution and 
with reduced bits (using 4-bits) is around l.5dB for the 'Oaire' sequence, 
reaching a maximum value of 2.86dB. This difference is plotted for the first 
one-hundred frames of the sequence in Figure 6.2 (circles) for the 'Oaire' 
sequence and in Figure 6.9 (solid lines) for the Akiyo and Missa sequences 
and is clearly too large for the RBSAD method to be considered as an 
alternative to the FSBME in this case. The results show that RBSAD metric 
performs less well for head-and-shoulder sequences like 'Oaire', which are 
mostly reasonably static. The lack of motion of MBs may be tested for, 
157 
f 
" , 
• , , 
!, 
(, 
• , 
" 
I 
: 
" 
" 
" :. 
, 
I 
'" 
" 
) 
/ 
through the temporal and/or spatial correlation of motion vectors between 
and within frames. 
In references [21] and [23], spatial and temporal correlations are used 
in ME algorithms to make the window size variable and search space limited 
so as to make the algorithm fast and less power consuming. In this work 
spatial and temporal correlations are also used to limit the search space but 
with RBSAD metric to make the algorithmic implementations fast and power 
efficient. 
X. if (RBSAD = 0) then apply correct10n 
u . if (mvT = 0) then apply correction in range of 
my's given by [-l,l)x[-l,l) 
xxx. if (mv, = 0) then apply correction in range 
[-l,l)x[-l,l) 
xv. if (mv, = 0 or mVT = 0) then apply correction 
in range [-l,l)x[-l,l) 
V. if (RBSAD = 0) then apply correction and 
if (mv, = 0) then apply correction in 
range [-l,l)x[-l,l) 
VX. if (RBSAD = 0) then apply correction and 
if (mvT = 0) then apply correction in 
range [-l,l)x[-l,l) 
vu. if (RBSAD = 0) then apply correction and 
if (mvT = 0 or mv,= 0) then apply 
correction in range [-l,l)x[-l,l) 
VXXX. if (RBSAD = current minimum) and 
then (update motion vector 
Figure 6 8 RBSAD Algorithms 
158 
The Maximum error per macroblock in Figures 2(a) and 2(b) of reference [104] 
suggests that RBSAD performance is worse for 'head and shoulders' 
sequences (also see section 6.1). In order to determine when the RBSAD 
metric should be used and when the correction to full resolution should be 
applied, we consider the eight algorithms which implement these notions. 
The algorithms (labelled I to VIII) are listed in Figure 6.8 where, 'mv_xc', 
'mv_yc', 'mv_xp', 'mv_yp' are the x and y components of the current and 
previous motion vectors in the same search area respectively; 'mvT' is the 
motion vector of the macroblock in the same position in the previous frame; 
and 'mvs' is the motion vector of the macroblock at the left of the current 
macroblock in the current frame. The algorithm VIII will be discussed 
separately as it may be applied simultaneously with any of the searches, the 
notion being that it allows a raster scan to behave in a pseudo-spiral manner. 
Under Algorithm I, the correction to the full resolution metric (see 
Table 6.1 and Figures 6.2 and 6.7) is applied whenever the RBSAD is equal to 
zero [103]. The reason that RBSAD metric performs less well for head-and-
shoulder sequences like 'Oaire' is that they are mostly reasonably static. 
However, such sequences have high spatial and temporal correlation which 
can be used to generate potential conditions for the application of the 
correction. Algorithm 11 makes use of the temporal correlation between 
frames. Here mVT is the motion vector of the corresponding macroblock in the 
previous frame (Le. in the same position). The notion is that if the block did 
not move in the previous frame then it is unlikely to move far in the current 
frame so that the correction should be applied in the nine cases m = -1, 0, +1 
and n = -1, 0, +1 (although this could be extended to -2, -1, 0, +1, +2, etc.). 
Algorithm III assumes that a degree of spatial correlation exists, where mvs is 
the motion vector of the macroblock to the left (or above, if no such MB exists) 
of the current macroblock in the same frame. Algorithm IV utilizes both 
spatial and temporal correlation, and is a simple hybrid of Algorithms 11 and 
III. Likewise Algorithm V is a hybrid of Algorithms I and Ill, Algorithm VI is 
159 
hybrid of Algorithms I and 11 and Algorithm VII is hybrid of Algorithms I and 
IV. 
160 
Table 6 2. Average corrected-PSNR values, % corrected calculations and % power saving for 
Claire sequence. 
Algorithm Average Average % % Power saving 
PSNR corrections 
FSMBE 41.49 100 00.0 
RBSAD 40.27 0 50.0 
I 40.61 23 38.5 
11 41.11 3 48.5 
III 41.27 3 48.5 
IV 41.47 3 48.5 
V 41.37 25 37.5 
VI 41.27 25 37.5 
VII 41.47 25 37.5 
Table 6.3. Average corrected-PSNR values, % corrected calculations and % power saving for 
Akiyo sequence 
Algorithm PSNR % corrections % Power saving 
FSMBE 46.2 100 00.0 
RBSAD 43.18 0 50.0 
I 46.12 13 43.5 
11 45.64 3 48.5 
III 45.46 3 48.5 
IV 46.1 3 48.5 
V 46.13 15 42.5 
VI 46.16 15 42.5 
VII 46.15 15 42.5 
The average corrected-PSNR values obtained as a result of application of 
these algorit~ on the 'Oaire' and 'Akiyo' sequences (using the first 100 
frames for 'Oaire' and 25 frames for 'Akiyo'), along with the average 
161 
percentage of cases for which corrections are applied, are given in Table 6.2 
and 6.3 respectively. It is clear that, for these sequences, those algorithms that 
involve a correction criterion based on the spatial andj or temporal 
redundancy in the video clip offer the best power savings and require the 
lowest number of corrections. The power saving calculation is a relatively 
simplified version. 
Table 6.4. Average PSNR values corresponding to algorithm VIII 
Sequences Full RBSAD raster RBSAD-Centre 
Resolution biased raster 
Oaire 41.49 40.27 41.21 
Akiyo 46.2 43.18 46.15 
Miss America 37.65 36.83 37 
Salesman 34.87 34.6 34.6 
Rotating City 19.53 19.5 19.5 
Foreman 33.36 33.2 33.2 
Assuming a fraction (Cl) of full resolution calculations are performed at full 
power P, and the remainder, a fraction (1 - Cl), are calculated at power P j2 (Le. 
4-bits), then the fractional power dissipated relative to FSBME is (1 + Cl)j2. In 
truth a full SAD calculation, in the current model, requires slightly more than 
P, say P(l + P), where p is small but may be readily estimated for any given 
architecture. 
Figure 6.2 shows PSNR differences for the 'Oaire' sequence and 
Figures 6.9a and 6.9b shows PSNR differences for the 'Akiyo' and 'Missa' 
sequences respectively. Squares indicate the difference between FSBME and 
the corrected-RBSAD using Algorithm IV. Note that the maximum PSNR 
difference for 'Oaire' sequence in Figure 6.2 has fallen from 2.86 dB to around 
0.35 dB. While Algorithm IV generates the best results for the above 
162 
~r-----~----~----~----~~----' 
20 
15 
'" z Cl) 
a. 
~ 10 
c 
e 
~ 
Cl 
5 
·5 ':------:------:7------:':;-------:':-------:'. o 5 10 15 20 ~ 
Frames 
Figure 6.9a PSNR differences for' Akiyo' sequence (Algorithm IV) 
14.------r------~----~------~----__, 
12 
04 
02 
°0~~--~5~----~1~0------~15~-----~~----~25 
Figure 6.9b. PSNR differences for 'Missa' sequence (Algorithm IV) 
mentioned sequences there will be occasions when reducing the SAD 
calculation to the ranges m = -1, 0, +1 and n = -1,0, +1 will be too restrictive, 
as a result it is recommended that Algorithm VII, which also calculates SAD 
163 
values for those cases in which RBSAD = 0, should be applied in general. 
Alternatively the range {-I, O,+l} could be widened. 
-~-"'-"~~''''~ 
....... -r~... i ... ~" .. 
': :::::::::r:::::::r::::j:::::::r:::j:::::T:::::r·····l 
,• .-,.·,.·r" t ........ : ........ ..;.. i· .. ··. ! 
...... ....... : ••• 7 ..... • : ........ : : ....... : .~ ~ so .. p"- : ... ~ •••• ""! : r ... ,.....: "1a.. : 
..... t· .. -_ ... ··: J..... ..··i·· ....... i .. 1.....; .... .; ~ : ~_··.:::::::L::::::t:::::j,:· . .t ..•. ::I::::~:t::::::.f.:::::: I. 
o ..... • 
10 ...... ; 
p 
·10 -10 
p 
Figure 6.10a MY distnbution of full resolution full search algonthm in Qaire sequence. 
ID 
p -,0 .10 
Figure 6.10b MY distnbutIon of reduced bIt full search centre biased raster algonthm in 
Oaire sequence. 
The idea of algorithm VIII comes from the MY (motion vector) diSIributions 
of the full resolution (Figure 6.10a) and reduced bit full search algorithm and 
from the percentage of correct motion vectors obtained. The MY disIribution 
of a cenIre biased raster (Figure 6.10b) reduced bit full search algorithm 
resembles the MY disIribution of full resolution Full Search algorithm more 
164 
closely than that without bias (left to right raster) (Figure 6.10c). With centre-
biased raster search, 81 % motion vectors were obtained correctly, whereas 
with left to right raster search, this was only 51 %. 
ID 
p 
Figure 6.10c MV distribution of reduced bit full search left to right raster algorithm in Oaire 
sequence. 
The results of average PSNR are shown in Table 6.4 with theoretical power 
saving of around 50%. 
6.5 The general form of the hardware 
Motion Estimation hardware is generally designed around the concept of the 
processing element (PE), a number of which work in parallel to perform the 
calculations. The RBSAD calculation for a MB is done on a row by row basis 
or column by column, taking one row of 16 (4-bit) pixel values from each of 
the current and reference frames. The general form of the processing element 
(PE) envisaged here is depicted in Figure 6.11, along with the memory used in 
the hardware architecture for the motion estimation process. The PE is 
divided into two blocks; the left-hand block (as depicted here) contains an 
absolute difference unit (AD), with 16 carry trees to calculate signs and 16 x 
four-bit subtracters, and an adder tree containing fast adders. As the tree 
requires only small adders (16 x 4-bit adders, 8 x 5-bit adders, 4 x 6-bit adders, 
165 
2 X 7-bit adders and 1 x 8-bit adder) simple ripple adders can be used, which 
are as fast as other choices, for such small bit numbers, but consume less 
power and occupy a smaller silicon area [1051. 
Reference Current 
RAM RAM 
Upper Bits 16x8 Lower Bits 
16x4 .-----....L....:.::::::-----,16x4 
If (COND) 
AD 
Adder Tree 
MIN SADmlO 
MV 
Figure 6 11 A general form of processing element 
The right-hand block in Figure 6.11, which will calculate the correction term 
for the SAD value due to the lower four bits, i.e. A(m,n) (eqn. (6.2», is very 
similar to left-hand block, the major difference being the addition of a 
condition unit, so that the adders and subtracters will only be enabled when 
the appropriate algorithmic condition is met. Under Algorithm I, and any 
hybrids containing it, the left-hand block and right-hand (or 1..-) block will run 
in a pipelined fashion, the A-block running if the RBSAD has been calculated 
to zero, while under other algorithms the blocks may also run in parallel as 
the truth of the condition is known before the start of the calculation. An SAD 
metric will then take sixteen clock cycles to be executed, using a single row of 
sixteen pixels at a time. Using the FSBME method, or the corrected-RBSAD 
method, the best match of a current frame macroblock to a similar macroblock 
in a search area of 16 x 16 macroblocks, will take 16 x 16 x 16 clock cycles. 
The hardware implementation for Algorithm I, i.e. the condition 
RBSAD(7:4) = 0, was considered in section 6.1. The other algorithmic 
conditions (mvT =0 and mvs = 0) are equally simple, but require registers at 
166 
the output of each processing element to store the zero condition of the spatial 
and/or temporal motion vectors. These algorithms also require a number of 
memory locations within the processing element to store the sign and zero 
information required by the A-block. The functional block which calculates 
A(m,n) is almost identical to that which calculates RBSAD, one difference 
being in the selection of the inputs in the absolute difference unit, which only 
depends upon the sign of s(i,j,k)<74> - s(i,j,k-l)<74> in the RBSAD calculation, 
where s(i,j,k)<74> is upper four bit value of current block pixel and s(i,j,k-l}<74> 
is that of reference block pixel, but is slightly more complicated for the 
correction term A(m,n). The power implications for such a scheme are obvious 
as whenever the algorithmic condition is not met, the A-block is disabled. 
We have considered a number of test sequences, Table 6.2. These 
involve most types of motion seen in typical sequences and so are 
representative of what might be expected in real video clips. The hardware 
essentially executes the code as shown in Figure 6.12. 
if (condition) then Output := SAD(7:0) 
else if(not(correcting)) OUtput:= RBSAD(7:4) 
else Output := xFFFF; 
end; 
end; 
Figure 6.12 A pseudocode for the architecture of Figure 5.12. 
Note that if the 'correcting' flag is set the RBSAD minimization is halted and 
only correct SAD values are used to calculate the motion vectors. Note also 
that, for those sequences for which the RBSAD calculation produces an 
accurate PSNR value, almost no corrections are applied. 
167 
6.6 Reduced-bit VLSI SAD Engine 
It has been generally found that the goodness-of-fit metric between the 
current and reference frame MBs, the (SAD) error term gives an error that 
increases monotonicaIly as one moves from the global minimum and often is 
centred close to the starting point. If the position of the global minimum, and 
hence the motion vector (MV), is biased towards the centre of the search 
window, the error metric is said to possess a centre bias [106]. Assuming a 
Gaussian error surface, 
(6.3) 
the equivalent points, the points of constant error, are the circles x2 + y2 = c2, 
while assuming a Laplacian error surface 
E(X, y) = .l.exp (_ 21xI + Iyl) 
a 2 a 
(6.4) 
the equivalent points lie on the square 1 y 1 = c - 1 x I. Which of these is most 
realistic depends upon the type of motion in the video sequence. For random 
motions a Gaussian error surface will give a better representation while for 
more rectilinear dominated motion, such as camera panning, a Laplacian 
error surface may be better. If two motion vectors evaluate to the same error 
metric then the notion of a centre-bias would imply that the point 'closest' to 
origin should be chosen. With a Gaussian error surface, this would be the 
point on the circle of smallest radius, while with a Laplacian error surface it 
would mean the point on the square of smallest side length. Based on this 
assumption of centre-bias square-shaped search patterns are used in many of 
the fast searches, e.g. TSS, NTSS and FSS. 
The centre-bias also implies that a large number of blocks in the 
current frame could be regarded as stationary or quasi-stationary. As a result 
knowledge of their motion in the immediately preceding frame can give 
useful information for restricting the current search. Likewise, in most frames 
the motion of many blocks is correlated with that of its neighbours, so that the 
168 
value of recently calculated MVs can also aid in the restriction of the current 
search [6]. In the implementation of algorithm VIII a Laplacian error surface is 
assumed so that, for equivalent SAD points, the one on the square nearest the 
origin, i.e. MV closest to (0,0), is preferred. 
6.7 Generalised-SAD Architecture 
For a comparison, we have based our architecture on that of the reference 
hardware description issued by ISO/lEe in 2002, the details of which may be 
found in ref [109]. Briefly, it has a SIMD stream architecture that processes the 
video data in parallel to estimate the motion vector for each block. Figure 6.13 
shows a block diagram of the architecture (here the full resolution SAD 
engine is shown). It consists of: a Search Window Memory block consisting of 
31 RAM blocks, each of 31 (locations) x 8 bits; an address generator to 
generate the addresses for reading the reference block data from the memory; 
sixteen processing elements (PEs) to compute the SAD values; a circulatory 
shift register to store the current frame macroblock and a compare unit to find 
the block with the minimum SAD value amongst all the candidate blocks. The 
output of each processing element (PE) is pipelined to the SAD compare unit. 
A search window of 31x31 pixels (surrounding the current macroblock) and a 
macroblock of 16 x 16 pixels are used, which means with the full search block 
matching algorithm we have 256 (16 x 16) candidate blocks. 
The search window memory consists of 31 RAMs. Each RAM size is 
8x31, where '8' is the number of bits in each location (8 bits/pixel for full 
resolution) and '31' is the number of locations or addresses. The RAMs are 
arranged together such that their cells form a 3lx31 matrix so that the first 
row of the search area will be written to the first row of this matrix, as shown 
in Figure 6.15. Each column of this matrix will be connected with one PE 
(processing element). No vertical address generator is needed because all the 
memory bandwidth (248 bits) is used. Using the whole memory bandwidth 
reduces power consumption and improves the performance. The search 
169 
window memory stores the candidate blocks and spreads them to the 
processing elements through the memory bus. 
S .arch WUldow Memozy 
31 RAM blocks 
Each 311ocalions x 8.1')110 
SAD Compare U"lI 
FIgure 6.13 Full-WIdth SAD engme arclutecture 
Row 
Addle •• 
enerator 
Datal" 
Cum"t 
F=eMB 
Mollo" Vecbrs 
The 16 processing elements compute the difference between the candidate 
blocks in the search window memory and the current block. The current 
block is stored in a queue of 256 bytes (16 x 16) as one row and the current 
block pixels are broadcasted to all the processing elements at the same time as 
shown in Figure 6.14. 
One processing element is used per column of candidate blocks. The 
31 memory columns (RAMs) are multiplexed and connected to the 16 
processing elements through the memory bus (MUXs and registers). The 
memory bus consists of 16 multiplexers, each multiplexer has 16 input 
channels of 8 bit width. All multiplexers have the same select lines as shown 
in Figure 6.16. Through the memory bus, each processing element can read 
data from 16 memory columns. These memory columns represent a column of 
candidate blocks; each block is (16 x 16) pixels. Columns 1 to 16 are assigned 
170 
to the first MUX, 2 to 17 to the second MUX, 3 to 18 to the third MUX and so 
on. These MUXs enable the calculation of SAD. 
Search Window Memory 
31 RAMs 
8><31 RAM 
Sbit s. 
. ...................... • __ ••••• a ......... _ •• ~ 
MemoryBusStructl.lre (MUXS & REGS) 
S 
PE PE PE"""""""""PE PE 
S 
Current Block 
Figure 6.14 Parallel architecture for SAD computation of candidate motion vector (CMV). 
With this memory structure, the SAD values for all the candidate blocks in the 
same row are computed at the same time using the whole on-chip memory 
bandwidth. The output of the current block queue is connected to all the 
processing elements, which means the first pixel of the reference block is 
compared at the same time with the first pixel of all the candidate blocks in 
the same row. The select lines are then changed to point to the second pixel in 
all the candidate blocks and the current block queue is shifted by one byte, so 
the second pixel of the reference block will be compared with the second pixel 
of all the candidate blocks in the same row and so on for subsequent pixels. 
When one row of pixels is finished, the address generator will be shifted to 
point to the next row of pixels in the search window memory. This sequence 
171 
Data In 
RAM 1 RAM30 
Dataoutl 
1 byte memory cell (1 plxel) 
RAM31 
31 Honzontal 
Address 
Enables 
FIgure 6.15 Internal structure of the search wmdow memory 
Address 
Generator 
will be repeated until the end of the row of candidate blocks, which consists 
of sixteen rows of pixels. The address generator is then altered to point to the 
first row of pixels in the next row of candidate blocks. For example, after 
processing the first sixteen rows of pixels, which consists of the first row of 
candidate blocks, the address generator will be shifted to point to the second 
row of pixels, which is the first row of pixels for the second row of candidate 
blocks. 
The output of the current block queue (circulatory shift register, Figure 
6.14) is fed-back to its input to insure the continuity of the SAD computation 
operations. At the end of the 256 comparisons (16 x 16 absolute differences), 
the first pixel of the current block will be available at the output of the queue 
for the processing elements to start computing the SAD values of the next row 
of candidate blocks. 
172 
8 bits 
IfP 
16xlMUX 
16xll\flJX 
16xll\flJX 
1111111·1·IITl 
... 
\. 
\. 
\. 
\. 
\. 
\. 
\. 
, 
'I. 
\. 
, 
... 
------------
; [., , 1.1. V 
• 
8 bits 
OIP 
Figure 6 16 Memory Bus structure. 
16xll\flJX ~ 
. 4 bits 
select 
Each processing element consists of a subtractor and an accumulator as 
shown in Figure 6.17a. The processing element has two inputs, one coming 
from the current block queue and another coming from the candidate block in 
the search window memory via the memory bus. To accumulate the absolute 
value needed in the SAD computation of eqn. (2.1), the accumulator adds or 
subtracts the subtractor output according to its sign bit as shown in Figure 
6.17a. There is another version of the processing element shown in Figure 
6.17b consisting of an B-bit subtractor, comparator and 16-bit accumulator. 
The subtractor and comparator together acts as an absolute difference 'abs(a-
173 
by unit. The control unit resets all the SAD registers in the processing 
elements at the beginning of the SAD computation operation. At the end of 
all the SAD computation operations for a row of candidate blocks, the SAD 
registers will contain the SAD values for those candidate blocks. Those SAD 
values will be stored in the pipelined registers (Figure 6.13) to be used by the 
SAD compare unit so that the processing elements can begin to compute the 
new SAD values for the next row of candidate blocks. 
,v 
I Subtractor I 
+ f 519 nblt 
-j+ 
I 
SAD Register 
'It 
, 
FIgure 6.17a Arclutecture of a PE 
The SAD compare unit compares the 16 SAD values computed by the 
processing elements and generates the required horizontal and vertical 
motion vectors. Figure 6.18 shows the block diagram of the SAD compare 
unit. The inputs to this unit are pipelined through the 16 registers shown in 
Figure 6.13, which enables it to compare a total of 256 computed SAD values 
per search area (16 values at a time) while the processing elements compute 
the SAD values for the next row of blocks. 
174 
8 bits ~ ~ 8 bits . . 
Comparator 
8· . • 8 
Subtnctor 
i 8 
ZEXT 
116 +16 
Accumulator 
(Adder & Register) 
J 1- 16 
FIgure 6.17 b Another Architecture of a Processing Element 
With this architecture a row of Candidate Motion Vectors (CMVs) is 
calculated in 16 x 16 clock cycles, i.e. one Current frame MB is calculated in 16 
x 256 clock cycles. The circulatory shift register that holds current macroblock 
takes 256 clock cycles to be written serially and 31 clock cycles are taken by 
RAM with one clock cycle per row. Thus a CIF sized video frame (352 pixels x 
288 pixels) will take (352/16 x 288/16) x (16 x 256 + 256 +31) clock cycles. A 
layout for this architecture is shown in Figure 6.21. 
The Reduced-bit SAD unit is organized in a similar fashion to that of 
Figure 6.13 and its layout is shown in Figure 6.22. The only differences from 
the architecture of Figure 6.13 are: the inclusion of some extra Flops to store 
the data, if needed, for assigning the value of Il,im,n) in eqn. (6.2); and the 
extra cells required to implement the pseudo-code below. To ensure a simple 
dataflow the search is performed in a raster fashion. The search starts at the 
top-left of the search window, scanning a row at a time, and finishes at the 
bottom-right. If a RBSAD<74> value is found which is smaller than the current 
175 
lowest value, then the current value is stored along with the new motion 
vector. However, as RBSAD<74>(m,n) only has a resolution of 16 (i.e. it takes 
on values 16N, for integer N), there may be a number of motion vectors which 
generate the same metric. We resolve this conflict by choosing the motion 
vector according to the centre-bias assumption, choosing the motion vector 
with the lowest value of I x I + I y I , which corresponds to the assumption of a 
Laplacian error surface, eqn. (6.4). This 'centre-biased raster-scan' may be 
realized by the pseudo-code in Figure 6.19. 
16 bits 
.1-
16x1MUX 
16 
SADmin 
16. l, 1116 
Comparator 
& 
Coordinate counter 
5 bits ! 
Horiziontal 
MV 
-. 
I 
Figure 6.18 SAD comparator and coordinate counter. 
if abs (v,) + abs (v,) < abs (mv,) + abs (mv,) 
FIgure 6 19 Implementation of pseudo-spiral search pattern. 
176 
15 
Vertical 
MV 
Here (mvJV mVy) and (vJV Vy) are the currently held best motion vector and the 
motion vector of the candidate block under test respectively. Around the 
square I x I + I y I = D /2, for points yielding the same metric value, this 
pseudo-spiral scan will tend to choose points towards the bottom of the 
square and then on the right hand side. Thus, from group of points with equal 
RBSAD or SAD values, the one with the smallest value of I x I + I y I is chosen 
first. If there is more than one such point the unique MV is chosen (arbitrarily) 
according to the criterion first smallest y value and then largest x value, 
reflecting the nature of the actual raster scan. 
6.8 Results 
The performance of the proposed SAD engine naturally depends upon the 
conditions under which the RBSAD calculations are corrected to full 
resolution. This could be determined by the type of video sequence involved. 
For the 'Oaire', 'Akiyo' and 'Missa' sequences, the algorithm shown in Figure 
6.20 (also see section 6.4) for applying corrections was used [110] which 
makes use of the temporal and spatial correlations that exist between the 
same block in successive frames, and between neighbouring blocks in the 
same frame. 
Here mVr is the motion vector of the corresponding macroblock in the 
previous frame (Le. in the same position) and mvs is the motion vector of the 
macroblock to the left (or above, if no such MB exists) of the current 
macroblock in the same frame. 
if (rnvr = ° or rnv,= 0) 
then apply the correction (the A-term) in range 
[rn, n] = {-I,O,I] x {-I,O,I} 
Figure 6.20 A pseudocode for correction application 
177 
The notion (as discussed above) is that if a particular block did not move in 
the previous frame then it is less likely to move far in the current frame. 
Likewise if the block to the left in the current frame (whose MV has been 
already determined with the raster-scan defined above) has not moved it is 
less likely that the current block under analysis will move far. In either case 
the correction is limited to the nine cases m = -1, 0, +1 and n = -1, 0, +1 
(although again of course this could be extended to -2, -1, 0, +1, +2, etc.). This 
is one possibility for determining when corrections to full resolution should 
be applied; however any of the range of possibilities discussed earlier (section 
6.4) could be used. The performance of these algorithms can be enhanced by 
hybridising them with the scene cut detection algorithm mentioned in section 
2.3.1 which is based on the variances of the residual error (motion 
compensated and non motion compensated) and the current macroblock. 
For other sequences (other than those of 'head and shoulder' type) it is 
generally the case that the 4-bit SAD (RBSAD) is already perfectly adequate 
[110], however this architecture allows for the possibility of corrections when 
they are necessary. When using this algorithm IV, it is important to stress that 
the correction, although applied to the search for the MV of every current 
frame MB, is only applied to 3% of the MBs in the search window leading to 
power savings of close to 50%. 
6.9 VLSI Implementation 
RTL code describing the above architecture was verified using Model 
Simulator (Modelsim) and synthesized, with the UMC 0.13 J..lm 8-Cu 
technology design library. The 0.13 micron process technology was chosen as 
it offers a good combination of density, speed and power. The process design 
rules are reported to allow densities of over 200,000 gates per millimetre 
square, and the process performance boasts, for example, logic speeds over 1 
GHz and power gate dissipation of less than 4nW /MHz, with a 1.2V supply. 
Synopsys Design Compiler was used for logic synthesis (in which VHDL is 
178 
converted into a verilog netlist), while the final layouts were generated using 
the Cadence Encounter/Virtuoso combination. The generalized, reduced-bit 
SAD engine, Figure 6.22, occupies an area of - 3.0 x 106 1ffil2, which is roughly 
half that of the full resolution engine, Figure 6.21, which has an area of - S.7 x 
106 1ffil2. The maximum frequency of the generalized SAD engine is also 
greater at 20S MHz compared to 183 MHz for the full resolution engine. 
As mentioned earlier, the addition has been performed here using 
Processing Elements containing 4-bit adders and a 12-bit accumulator, for the 
RBSAD engine, and 8-bit adders and a 16-bit accumulator for full resolution. 
More significant increases in maximum operating frequency could be 
achieved using adder trees optimized to the purpose. The RBSAD unit would 
require 16 x 4-bit, 8 x S-bit, 4 x 6-bit, 2 x 7-bit adders and one 8-bit adder, 
while the full resolution engine would require 16 x 8-bit, 8 x 9-bit,4 x I~-bit, 
2 x ll-bit adders and one 12-bit adder. Additional power savings can also be 
gained as small (4-bit, S-bit or 6-bit) adders can use simpler ripple designs 
which are as fast as the so called 'fast-adders' of the same bit size but occupy 
a smaller area in silicon [10S]. 
6.10 Conclusions 
In this chapter we have proposed an alternative implementation of the 
standard Full Search Block Matching algorithm for the estimation of Motion 
Vectors in video sequencing. The method is based around the general good 
performance of the method of Bit Truncation in which typically the upper 
four bits are used in the Sum of Absolute Difference calculation. It has all the 
advantages of the reduced bit method in that the Reduced Bit SAD (RBSAD) 
may be calculated with power and time savings, but essentially uses the 
calculated RBSAD value as an early termination of the calculations. 
179 
Figure 6.21 Full-width SAD engine 
Table 6.5' Full-resolution SAD VISI engine characteristics 
Parameter Value 
Cells (Macros) 12134 (31) 
Core size (um2) 566188.52 
Utilization 94.49% 
Fmax (period) 183.4 MHz (5.45 ns) 
Std cell rows 203 
180 
Figure 6.22: RBSAD engine VLSI macro 
Table 6 6: Reduced SAD VLSI engine characteristics 
Parameter Value 
Cells (Macros) 6557 (16) 
Core size (um2) 298667.71 
Utilization 91.05% 
Fmax (period) 204.9 MHz (4.88 ns) 
Std cell rows 148 
For the 'Oaire' sequence and other 'head & shoulder' sequences, where the 
RBSAD value is significantly lower than the SAD value, we have shown that 
181 
the majority of fixes for the RBSAD value occur when the match is good and 
RBSAD(7.4) = O. Correcting the SAD output to the true value SAD(7:0) is 
possible by repeating the calculation on the lower four bits, i.e. RBSAD(3:0). 
On the other hand for the other sequences, in which the values of SAD and 
RBSAD are very close, virtually no extra computations are required. 
The majority of standard architectures based on a bit slice design and a 
single processing element per current frame MB can be easily modified to 
implement this method. The saving in PSNR, shown in Figure 6.2 (asteriks) 
for 100 frames of the 'Claire' sequence and in Figure 6.7 (squares) for 25 
frames of 'Akiyo' sequence, and in Table 6.1 for a group of other sequences, 
shows a significant improvement in accuracy towards the Full Search but 
with most of the power and time savings of the Reduced-Bit SAD method. 
This technique was then extended by converting RBSAD to full resolution 
metric only when RBSAD = 0, i.e. T = 1. 
Various combinations of conditions under which the RBSAD value 
could be corrected to full resolution have also been considered. The sequences 
requiring most correction were the standard 'Oaire' sequence and other 'head 
& shoulder' sequences, for which uncorrected bit-truncation of the SAD 
calculation performs significantly less well than full resolution FSBME. 
The best performing algorithms are those retaining some temporal memory 
(the zero condition for the motion vector of the same MB in the reference 
frame) and/or spatial memory (the zero condition for the motion vector of 
neighbouring current frame MBs) and scene cut detection (see section 2.3.1). 
Using both spatial and temporal correlation along with MPEG2 [40] 
style adaptive switching between intra and non-intra MBs (based upon 
variances of current MB and motion compensated prediction error), 
Algorithm IV and other spatia-temporal algorithms can give additional 
advantage in that they can handle sudden scene changes, which destroy 
temporal correlation. 
Figure 6.23 shows the effect of abrupt scene change on the performance 
of algorithm IV in the 'Flower' sequence. The solid line represents the PSNR 
182 
difference corresponding to RBSAD only, whereas the squared line represents 
the PSNR difference corresponding to algorithm IV. As shown there is a large 
PSNR difference due to abrupt scene change. 
lBr-------r-------~------~------~------~ 
c:: 
z 
Cl) 
a. 
0> 
u 
" 
16 
14 
12 
l!! OB ~ 
c 
06 
04 
02 
10 15 20 25 
Frames 
Figure 6 23 Sohd hne shows PSNR dtfference corresponding to RBSAD only whereas squared 
hne shows PSNR difference corresponding to algonthm IV in Flower sequence. 
~ 
~ -2 
0> 
u 
" ~ -3 
'E 
c 
-5 
~0L-------~5--------1~0--------1~5------~20~----~25 
Frames 
FIgure 6 24 Sohd line shows PSNR difference corresponding to RBSAD only whereas squared 
line shows PSNR dIfference correspondmg to algorithm IV with adapbve switching (based on 
variance) in Flower sequence 
183 
3~----~------.-------~----~------, 
25 
., 
CJ 2 
::E 
'0 1 5 
Q) 
'" co 1: 
Q) 
~ 
Q) 
a. 
05 
ooL-----~~L---J---1'LO-------1L5-------;LO------~~· 
Frames 
Figure 6 25 Percentage of macroblocks coded as intra 
By using adaptive switching between intra and non-intra MBs [101, this 
difference can be reduced as shown in the Figure 6.24. Figure 6.25 shows the 
percentage of macroblocks selected as intra with a maximum value of 
approximately 2.6%. Note that in Figure 6.24 there is a large negative peak 
and other negative values of the PSNR difference. This is due to the presence 
of intra MBs in the reconstructed frame. Other spatio-temporal algorithms 
give similar results for other sequences containing abrupt discontinuities. 
The algorithms show a significant improvement in accuracy towards 
the FSBME method, but with most of the power and time savings of the 
Reduced-Bit SAD method. While Algorithm IV is best for the 'Oaire' and 
other 'head & shoulder' sequences it is likely that Algorithm VII, which 
generates a super-set of Algorithm IV, should be applied in general. The 
psuedo raster algorithm VIII (Figure 6.19) shows best performance in terms of 
qualiIy, power and area saving and complexiIy reduction. 
184 
7. Summary 
7.1 Summary 
This thesis has discussed the architectural specification and physical 
implementation of a parametric vector/SIMD accelerator for real-time 
MPEG2 encoding. Vector instructions were implemented into the operation 
code space of the SimpleScalar ISA to form an extended vector simulator. The 
MPEG2 TM5 reference code was then systematically optimized through 
tapping the Data-Level-ParaIIelism (DLP) of the inner loop of Motion 
Estimation (ME) via custom vector extension instructions. After running 
intense simulations with scalar and vector (extended) binaries (both in Linux 
x86 and SS (simulated) mode) for a number of video sequences, search ranges 
and vector datapath lengths, it was shown that, for a search window of 81x81 
and using full-search motion estimation, these instructions reduce the 
computational complexity of the encoding process by up to 60% for a vector 
register length of 4 bytes, up to 68% for a vector register length of 8 bytes and 
gives complexity reduction up to 72% for a vector register length of 16 bytes. 
The work then progresses towards the physical implementation of this vector 
datapath in the form of a vector coprocessor to be coupled with the 5-stage 
pipelined RISC CPU (LEON 2). 
The RTL for the vector datapath was written, simulated and 
synthesiszed. The combined RISC CPU/Vector accelerator has been 
implemented as a hard macro. This work focuses on the flow from 
algorithmic C to a placed and routed database for the datapath of the vector 
accelerator in a high-performance 0.13 Jl1l1, 8-layer copper silicon process, for 
a number of register file configurations. 
The performance improvement of a number of sub-sampling motion 
estimation algorithms due to vector/SIMD instruction set extensions in the 
MPEG2 TM5 video encoder was then experimentaIIy evaluated. Simulation-
based results (following the same procedure as for Full Search above) 
185 
indicated a substantial complexity metric reduction for Full-Search, Three 
Step Search, Four Step Search and Diamond Search making the later three 
appropriate for execution on a high performance embedded VLSI platform. 
Although Fast methods reduces complexity up to 90% (relative to scalar Full 
Search algorithm), the results showed that wide data parallelism is less 
effective for Fast methods and a datapath of 4 bytes is sufficient to exploit the 
DLP in Fast methods. 
A second, efficient technique, Thread Level Parallelism (TLP), was then 
utilized in search of a further reduction in the workload of the encoding 
process. In this technique multiple processor contexts are allocated to current 
macroblocks so that the process of finding the best match for a current 
macroblock is parallelized. For this purpose a Multithreaded Instruction Set 
Simulator (MTISS) was implemented by the ESD group by extending the 
SimpleScalar ISA with multithreaded (MT) instructions under the no-
operation implementation macro (NOP _IMPL). Those MT instructions were 
then introduced into the MPEG2 TM5 model, by the ESD group, in the ME 
loop to see the effect of MT on Full Search ME. Relative DIe values were then 
obtained (corresponding to thread '0' as it contains the additional instruction 
count of non-threaded portions of the program as well as system calls) for a 
number of video sequences corresponding to a search window size of 17x17. 
Results (corresponding to MT ME and MT DCT, IDCT) showed a complexity 
reduction in the range of 70% to 75% by using up to 22 processor contexts. 
DIe results corresponding to MT ME alone showed 10% smaller DIe 
reduction as compared to the DIe reduction obtained by combined 
multithreading of ME and Transformation (DCT, IDCT) loop. The work in 
this thesis then turned to the use of MT processor contexts with the Fast ME 
methods. Again relative DIe values were obtained to determine their 
computational requirements in an embedded environment. Results showed 
that Fast methods give complexity reduction in the range of 50% to 60% by 
using up to 22 processor contexts relative to their non-threaded, Fast ME 
version. 
186 
The physical implementation of the MT processor contexts (not 
included in this thesis) was carried out by the ESD group in which multiple 
instances of a RISC based, SP ARC-VS compliant, configurable and extensible 
LEON2 CPU processor were instantiated to form a realizable 2-way (two 
processor contexts) and 4-way (four processor contexts) multiprocessor SOC 
system (augmented with data-parallel coprocessors) with shared memory 
(exclusive read, exclusive write). 
Results of DIC obtained from DLP and TLP analysis conclusively 
demonstrate that both thread and data level parallelism should be exploited 
for cases where full-search motion estimation is a requirement. By contrast 
however, all fast methods demonstrate that wide data-parallel hardware 
provides little performance improvement over a conservative, 4-byte single-
instruction, multiple-data (SIMD) sum-of-absolute differences (SAD) 
coprocessor. In the context of portable, consumer applications, both sets of 
results strongly suggest a multi-core approach with moderate data-parallel 
infrastructure. 
A theoretical explanation behind the experimental results both for Full 
Search as well as sub-sampling ME was then developed and a new compound 
performance/ area metric, the ICAP for algorithmic optimization in vectorized 
as well as MT applications for low-power, consumer devices was proposed. 
This theoretical analysis suggests that for optimisation purposes (i.e. 
optimised area, speed and power) four processor contexts for Full Search and 
two to four processor contexts for Fast ME methods should be used. Similarly 
a vector register length of 4 bytes for small search ranges and 8 to 16 bytes for 
large search ranges should be used for both Full Search and Fast ME methods. 
An alternative use of TLP is also proposed in which Candidate Motion 
Vectors (CMVs) are parallelized. MT instructions are then introduced into the 
Full Search algorithm. For this purpose the default Full Search algorithm, 
implemented in a spiral fashion, is converted into raster fashion and MT 
instructions are introduced into the loop surrounding the CMVs. Although 
the DIC values obtained showed a slightly greater complexity compared to 
187 
the TLP version described above, the hardware implementation of this 
alternative TLP may lead to reduced area and power as the thread/processor 
context in this case. 
SAD calculations based on a reduced numbers of bits (RBSAD) have 
been suggested (as an alternative to previous complexity reducing methods) 
which gives power and time savings, but the reduced dynamic range means 
that picture quality (PSNR) can be compromised. Consequently a corrected-
RBSAD algorithm, which corrects RBSAD to full SAD resolution, under 
certain circumstances, was presented. The compromise achieved provides the 
low power of the reduced bits with a higher accuracy, closer to that of a Full 
Search. This is an alternative which is orthogonal to the previous methods of 
accelerating the ME algorithms. 
A number of other algorithms, which correct the RBSAD to full 
resolution, under appropriate conditions, was then presented. The results 
demonstrate that the optimal conditions for correction include knowledge of 
the motion vectors of neighbouring blocks in space and/ or time. 
Finally a VISI layout for a new low-power Motion Estimation engine 
was presented. It is based on the use of a Reduced-Bit Sum of Absolute 
Differences (SAD) metric with an optional correction to the full SAD 
calculation for attractive candidate blocks. Reductions in power of close to 
50% can be achieved with minimal loss in PSNR values. The design can be 
further optimized for power/area by using an optimized adder tree to 
compute the SAD values. The generalized reduced-bit SAD engine occupies 
an area of around half that for full resolution. Pipelining two of these units on 
a single chip would allow faster SAD calculations, using roughly the same 
area of silicon, without any additional memory reads and provide power 
savings of around 50%. 
The combination of DLP (giving a factor of around 2/3), TLP (giving a 
factor of 2/5) and RBSAD (giving a factor of 1/2) should reduce the DlC 
value and hence clock frequency by a total of 2/3 x 2/5 x 1/2 = 2/15, i.e. close 
to an order of magnitude reduction in fmm• 
188 
APPENDIX I 
List Of Publications 
1) Chouliaras, V., Nunez-Yanez, J.L. and Agha, 5., 'Silicon Implementation 
of a Parametric Vector Datapath for Real-Time MPEG2 Encoding', 6th 
lASTED International Conference on Signal and Image Processing, Honolulu, 
Hawaii, August 2004, pp 298-303, ISBN 0-88986-442-X . 
2) V. M. Dwyer, S. Agha and V. Chouliaras , 'Low Power Full-Search Block 
Matching using reduced bit SAD values for early termination', Proceedings 
of Mirage 2005 International conference on Computer Vision/Computer 
Graphics collaboration techniques, INRIA Rocquencourt Paris, France, March 
2005, pages 191-196 
3) S. Agha, V. M. Dwyer and V. A. Chouliaras, 'Motion estimation with low 
resolution distortion metric', Electronic Letters, Vol. 41, No. 12, June 2005, 
pages 693 - 694 
4) Vincent M Dwyer, Shahrukh Agha and Vassilios A Chouliaras, 'Reduced-
Bit, Full Search Block-Matching Algorithms and their Hardware 
Realizations', Proceedings of the 7th International Conference in Advanced 
Concepts for Intelligent Vision Systems (ACIVS 2005), September 2005, pages 
372 - 380, Antwerp, Belgium 
5) Vassilios A. Chouliaras, Vincent M. Dwyer and Shahrukh Agha, 'On the 
performance improvement of sub-sampling MPEG-2 Motion Estimation 
Algorithms with vector/SIMD architectures', Proceedings of the 7th 
International Conference in Advanced Concepts for Intelligent Vision Systems 
(ACIVS 2005), September 2005, pages 595 - 602, Antwerp, Belgium 
189 
6) V. A. Chouliaras, V. M. Dwyer, S. Agha, J. L. Nunez-Yanez, D. Reisis, K. 
Nakos, K. Manolopoulos, 'Customization of an embedded RISC CPU with 
SIMD extensions for video encoding: A case study', submitted to 
Integration, the VLSI journal, 24-10-05 
7) V. A. Chouliaras, S. Agha, T. R. Jacobs, V. M. Dwyer 'Quantifying the 
benefit of thread and data parallelism for fast motion estimation in MPEG-
2', submitted to lEE Electronic Letters 17-11-05 
8) V. M. Dwyer, S. Agha and V. A. Chouliaras, 'Reduced-bit VLSI SAD 
Engine for low-power Motion Estimation', submitted to IEEE Transactions 
on Consumer Electronics 
190 
SILICON IMPLEMENTATION OF A PARAMETRIC 
VECTOR DATAPATH FOR REAL-TIME MPEG2 
ENCODING 
V. A. Chouliaras, J. L. Nunez-Yanez, S. Agha 
6th lASTED Internahonal Conference on Signal and Image Processing, Honolulu, 
Hawari, August 2004, pp 298-303, ISBN 0-88986-442-X . 
ABSTRACT 
We discuss the architecture specification, RTL development and ongoing 
physical implementation of a parametric vector/SIMD accelerator for real-
time MPEG2 encoding. The MPEG2 TM5 reference code was systematically 
optimized through tapping the Data-Level-Parallelism (DLP) of the inner 
loop of Motion Estimation (ME) via custom vector extension instructions. We 
show that these instructions reduce the computational complexity of the 
encoding process by up to 60% for full-search motion estimation. The 
combined RISC CPU/Vector accelerator is being implemented as a hard 
macro. This work focuses on the flow from algorithmic C to a placed and 
routed database for the datapath of the vector accelerator in a high-
performance O.1311m, 8-layer copper silicon process, for a number of register 
file configurations. 
Summary 
This paper presents software (vector instructions) and hardware (VLSI 
Macro) implementations of a parametric vector coprocessor by extending the 
RISC based configurable LEON2 processor in order to exploit the data 
parallelism inherent in the DISTl function mainly for the Full Search 
algorithm. An explanation of the DIC calculation is presented in section 4.3. 
Vectorized instructions/macros are inserted into the instruction space of the 
Simple Scalar ISA (32-bit RISC with 64-bit instruction operation code). 
Vectorized binaries are run both in x86 (for validating the equivalency of 
191 
bitstreams generated by the non-vectorized MPEG2 TM5 and vectorized 
binary in x86 using 'cmp' (Linux utility», as well as in simple scalar SS mode, 
corresponding to three vector register lengths (i.e. a vector datapath of 
respectively 4, 8, and 16-bytes) for a number of video sequences and search 
ranges to get a vectorized DIC Similarly a scalar binary was run in SS mode 
only to get a scalar DIC 
The work in this paper then presents the hardware implementation of 
vector datapath (the Vector ISA and DLP microarchitecture presented in 
Section 4.4 of this thesis) in the form of vector coprocessor tightly coupled 
with RISC (LEON2) CPU. The RTL code of the datapath is simulated and 
sysnthesized and three layouts are generated corresponding to three vector 
register configurations (see Sections 4.5 and 4.6) 
The work in this paper and these sections show that these instructions 
reduce the computational complexity of the encoding process by up to 60% 
corresponding to datapath of 4-bytes for full-search motion estimation. 
192 
LOW POWER FULL-SEARCH BLOCK MATCHING 
USING REDUCED BIT SAD VALUES FOR EARLY 
TERMINATION 
V M Dwyer, S Agha and V Chouliaras 
Proceedings of Mirage 2005 International conference on Computer Vision/Computer 
Graphics collaboration techniques, INRIA Rocquencourt Paris, France, March 2005, 
pages 191 -196 
ABSTRACT 
Full-search motion estimation is often employed for selection of the best 
motion vector through a minimum SAD by iterating over all candidate 
motion vectors of the search area. However, although the datafIow is regular 
and the architectures straightforward, the computational complexity is high. 
Considering all possible candidate motion vectors and calculating a 
distortion measure at every search position produces a high computational 
burden, typically 60-80% of a video encoder's computational load. 1bis 
makes it unsuitable for real time video applications. To alleviate the problem 
SAD calculations based on reduced numbers of bits (RBSAD) have been 
suggested which gives power and time savings, but the reduced dynamic 
range means that picture quality can be compromised. 1bis current work 
presents a corrected-RBSAD algorithm, which corrects to full SAD resolution 
under certain circumstances. The compromise achieved provides the Iow 
power of the reduced bits and a higher accuracy, closer to that of a Full 
Search. 
Summary 
This paper investigates an implementation of ME engine/Video encoder 
using RBSAD metric with conditional correction. The main advantages of 
using the RBSAD (upper four bits only) include reduced power and area, and 
193 
increased speed. The drawback is its reduced dynamic range, and hence 
reduced quality for 'head and shoulder' type sequences. An obvious and 
simple criteria to improve this reduction in quality is to revert to a full 
resolution value whenever RBSAD = 0 (Section 6.4, Algorithm I). 
The hardware implications of such a scheme are obvious as whenever 
RBSAD = 0, the full resolution calculation is simply the 4-bit RBSAD of the 
lower bits. 
Two possible hardware options are presented. One implies pipelining 
the upper and four bit calculations and the second option implies creating 
two identical layouts of RBSAD architecture. The lower bit portion will only 
be activated when correction is required. In either case there are savings in 
power and either speed or area The results (Table 6.1) shows that the quality 
degradation can be improved with few percentage of corrections while 
keeping the simplicity of the hardware architecture. 
194 
MOTION ESTIMATION WITH LOW RESOLUTION 
DISTORTION METRIC 
S Agha, V M Dwyer and V. A. Chouliaras 
Electronic Letters, Vol. 41, No. 12, June 2005, pages 693 - 694 
ABSTRACT 
Considering all the possible candidate motion vectors in a given search area 
and calculating a distortion measure at every search position, as with the 
Full-search motion estimation algorithm, places a prohibitively high 
computational burden on the video encoder, making it unsuitable for real-
time/portable video applications. Here we present a version of the Reduced 
Bit Sum of Absolute Difference (RBSAD) algorithm with an optional 
correction to full bit resolution as a means reducing the computational 
complexity. 
Summary 
In this paper, the behaviour of the low resolution SAD distortion metric, the 
4-bit RBSAD, with the added option to correct to the full resolution (8-bit) 
SAD, is studied. A number of algorithms (Section 6.4, Algorithms I - IV) have 
been proposed along with their hardware implementations to act as a 
criterion for the correction of RBSAD (4-bit, Reduced Bit) to SAD (8-bit, Full 
Resolution), for highly predictable sequences like such as the newscaster 
sequence 'Oaire', for which RBSAD perform significant less well. As a result, 
a corrected_RBSAD may be used which produces around 50% power saving 
for sequences with significant motion between frames, and around 40-45% for 
those sequences, like the "Claire" sequence that do not. This will also provide, 
dependent on the actual VLSI architecture used, substantial savings in time. 
This work is presented here in Sections (6.1 - 6.5). 
195 
REDUCED-BIT, FULL SEARCH BLOCK-MATCHING 
ALGORITHMS AND THEIR HARDWARE 
REALIZATIONS 
Vincent M Dwyer, Shahrukh Agha and Vassilios A Chouliaras 
Proceedings of the 7th International Conference in Advanced Concepts for IntellIgent 
Vision Systems (ACIVS 2005), September 2005, pages 372 - 380, Antwerp, Belgium 
ABSTRACT 
The Full Search Block-Matching Motion Estimation (FSBME) algorithm is 
often employed in video coding for its regular dataflow and straightforward 
architectures. By iterating over all candidates in a defined Search Area of the 
reference frame, a motion vector is determined for each current frame 
rnacroblock by minimizing the Sum of Absolute Differences (SAD) metric. 
However, the complexity of the method is prohibitively high, amounting to 
60-80% of the encoder's computational burden, and making it unsuitable for 
many real-time video applications. One means of alleviating the problem is to 
calculate SAD values using fewer bits (Reduced-Bit SAD), however the 
reduced dynamic range may compromise picture quality. The current work 
presents an algorithm, which corrects the RBSAD to full resolution under 
appropriate conditions. Our results demonstrate that the optimal conditions 
for correction include knowledge of the motion vectors of neighboring blocks 
in space and/ or time. 
Summary 
This paper investigates hardware implementations of a number of algorithms 
including algorithms based on spatio-temporal correlations to act as a 
criterion for correction (Section 6.4). The hardware implementation of an 
RBSAD ME engine with these algorithms requires additional registers to store 
temporal and spatial MVs, equality detector, zero detector, MUXs, 'AND' and 
'OR' logic etc (Sections 6.6 and 6.7). The majority of standard architectures, 
196 
based on a bit slice design and a single processing element per current frame 
MB, can easily be adapted to implemented these algorithms. The algorithms 
(Section 6.4) show a significant improvement in accuracy towards the FSBME 
method but with most of the power and time savings of the Reduced-Bit SAD 
method. While Algorithm IV is best for the 'Oaire' sequence it is likely that 
Algorithm VII, which generates a super-set of Algorithm IV, should be 
applied in general. 
197 
ON THE PERFORMANCE IMPROVEMENT OF SUB-
SAMPLING MPEG-2 MOTION ESTIMATION 
ALGORITHMS WITH VECTOR/SIMD 
ARCHITECTURES 
Vassilios A. Chouliaras, Vincent M. Dwyer and Shahrukh Agha 
Proceedings of the 7th International Conference In Advanced Concepts for Intelligent 
Vision Systems (ACIVS 2005), September 2005, pages 595 - 602, Antwerp, Belgium 
ABSTRACT 
The performance improvement of a number of motion estimation algorithms 
are evaluated following vector/SIMD instruction set extensions for MPEG-2 
TM5 video encoding. Simulation-based results indicate a substantial 
complexity metric reduction for Full-Search, Three step search, four step 
search and diamond search making the later three appropriate for execution 
on a high performance embedded VISI platform. We develop a theoretical 
explanation behind the experimental results both for Full Search as well as 
sub-sampling ME and propose a new compound performance/ power metric, 
the complexity-power-product (CPP) for algorithmic optimization in 
vectorized applications for low-power, consumer devices. 
Summary 
This paper demonstrates that the exploitation of DLP, via vector/SIMD 
instruction set architecture extensions and the use of fast (sub-sampling) ME 
algorithms, is absolutely vital for the real-time execution of the complex 
MPEG2-TM5 video encoder. CPU architects utilize architectural and/ or trace-
driven simulation to determine the optimal mix of microarchitecture (DLP) 
and algorithmic optimisations. A complementary approach is advocated 
which is based around the development of an analytical complexity model for 
the vectorized MPEG-2 application through extrapolating the experimental 
198 
(architecture-level) data as shown in section 5.1. As shown, the model quite 
closely matches the experimental results for a search window range of 
between 8 and 64 pels (Section 4.3). Subsequently, we proposed a new 
complexity metric, the complexity-power-product (CPP) as well as 
complexity-area product (ICAP) (in Section 5.1) for both TLP and DLP both 
corresponding to Full Search and Fast ME algorithms to drive the 
optimization process without the need for prohibitively long, exhaustive 
simulation of the algorithmic and microarchitectural space. In addition to this 
an alternative TLP implementation is suggested, i.e. inserting MT instructions 
in the Full Search function which may lead to smaller size of processor 
context. 
The results (Sections 5.1 and 5.2) suggests that two or four-processor 
SoC architecture, with 4-byte data-parallel acceleration per processor, would 
therefore represent a reasonable compromise as a general purpose silicon 
engine capable of accelerating both full search and fast ME. 
199 
Customization of an embedded RISC CPU with SIMD 
extensions for video encoding: A case study 
V. A. Chouliaras, V. M. Dwyer, S. Agha 
Submitted to Integration, the VLSI journal, 24-10-05 
ABSTRACT 
This work presents a detailed case study in customizing a configurable, 
extensible, 32-bit RISC processor with vectorjSIMD instruction extensions for 
the efficient execution of block-based video-coding algorithms utilizing a 
proprietary co-design environment. In addition to the default Full-Search 
motion estimation of MPEG-2 TM5, we implemented fourteen fast ME 
algorithms in both scalar and vector form. Results demonstrate a reduction of 
up to 68% in the dynamic instruction count of the full search-based encoder 
whereas the fast ME algorithms achieved a reduction of 90%, both accelerated 
via threel28-bit vectorjSIMD instructions. We address in detail the profiling, 
vectorization and the development of these vector instruction set extensions, 
discuss in depth the implementation of a parametric vector accelerator that 
implements these instructions and the introduction of that accelerator into a 
32-bit RISC processor pipeline, in a closely-coupled configuration. 
Summary 
In this paper Data Level Parallelism (SIMD instructions) is employed in the 
inner loops (DIm function) of both Full Search as well as Fast ME 
algorithms by implementing vectorized data path (vector coprocessor) by 
extending a configurable LEON2 (SPARC-V8 Compliant) processor. Detailed 
profiling of MPEG2 codec is addressed with and without vector instructions 
added, for both Full Search and Fast ME algorithms, as presented in section 
4.3 and 4.9. An in depth implementation of a parametric vector accelerator 
that implements these instructions and the introduction of that accelerator 
200 
into a 32-bit RISC processor pipeline, in a closely-coupled configuration is 
also discussed, (Sections 4.4 - 4.6). Results (Sections 4.3 and 4.9) demonstrate 
a reduction of up to 68% in the dynamic instruction count of the full search-
based encoder, whereas the fast ME algorithms achieved a reduction of 90%, 
both accelerated via three 12S-bit vector/SIMD instructions. 
201 
QUANTIFYING THE BENEFIT OF THREAD AND DATA 
PARALLELISM FOR FAST MOTION ESTIMATION IN 
MPEG-2 
V. A. Chouliaras, S. Agha, T. R. Jacobs, V. M. Dwyer 
Submitted to lEE Electronic Letters 17-11-05 
ABSTRACT 
This work presents preliminary results of a concise investigation of the 
performance benefits obtained by exploiting thread and data parallelism in 
fast motion estimation algorithms in MPEG2. Thirteen such fast ME 
algorithms were implemented using both thread-parallel and data-parallel 
schemes to determine their computational requirements in an embedded 
environment. The results are then compared to both the default 
(nonparallelized, full-search) as well as their respective (non-parallelized, 
fast) versions. Results conclusively demonstrate that both thread and data 
level parallelism should be exploited for cases where full-search motion 
estimation is a requirement. By contrast however, all fast methods 
demonstrate that wide data-parallel hardware provides little performance 
improvement over a conservative, 4-byte single-instruction, multiple-data 
(SIMD) sum-of-absolute-differences (SAD) coprocessor. In the context of 
portable, consumer applications, both sets of results strongly suggest a multi-
core approach with moderate data-parallel infrastructure. 
Summary 
This paper presents an investigation into the benefits of exploiting DLP and 
TLP in fast ME algorithms. For this purpose instruction space of SS ISA is 
extended with vectorized as well as MT instructions to create a 
Multithreaded Instruction Set Simulator (Sections 4.7 and 4.8). Simulations 
are then carried out for thirteen fast ME algorithms implemented using both 
202 
DLP and TLP schemes. These simulations involve running the scalar binary 
(unparallelized MPEG2) as well as parallelized (DLP and TLP) binaries. The 
results (presented in Section 4.9) demonstrate that TLP in fast ME methods 
give complexity reduction in the range 50 - 60% with up to 22 processor 
contexts and TLP in Full Search gives 70 - 75% complexity reduction for a 
search window of 17x17 with up to 22 contexts. Similarly DLP in fast 
methods gives 90% and Full Search gives up to 68% complexity reduction 
with datapath size of 8-bytes (Section 4.9). This also demonstrates that both 
thread and data level parallelism should be exploited for cases where full-
search motion estimation is a requirement. By contrast however, all fast 
methods demonstrate that wide data-parallel hardware provides little 
performance improvement over a conservative, 4-byte single-instruction, 
multiple-data (SIMD) sum-of-absolute-differences (SAD) coprocessor. For 
portable consumer application these results strongly suggest the use of 
moderate DLP along with TLP. 
203 
REDUCED-BIT VLSI SAD ENGINE FOR LOW-POWER 
MOTION ESTIMATION 
V. M. Dwyer, S. Agha and V. A. Chouliaras 
Submitted to IEEE Transactions on Consumer Electronics 
ABSTRACT 
A VLSI layout for a new low-power Motion Estimation engine is presented. 
It is based on the use of a Reduced-Bit Sum of Absolute Differences (SAD) 
metric with an optional correction to the full SAD calculation for attractive 
candidate blocks. Reductions in power of close to 50% can be achieved with 
minimal loss in PSNR values. The design can be further optimized for 
power/area by using an optimized adder tree to compute the SAD values. 
The generalized reduced-bit SAD engine occupies an area of around half that 
for full resolution. Pipelining two of these units on a single chip would allow 
faster SAD calculations, using roughly the same area of silicon, without any 
additional memory reads and provide power savings of around 50%. 
Summary 
A study of the algorithms including algorithm VIII, the centre biased raster 
RBSAD algorithm (Sections 6.4, 6.6 of the thesis) is presented. Then hardware 
architectures for full resolution and RBSAD calculations are proposed 
(Section 6.7) along with their VLSI layouts (Sections 6.9 and 6.10). The work 
here discusses means of maintaining picture quality by updating the RBSAD 
value, for promising candidate blocks to full resolution by adding the 
correction term (Section 6.3, eqn. (6.2» through a novel recovery mechanism 
(Sections 6.6 and 6.7). Full resolution may be restored by pipelining the two 
halves of the calculation. The advantage of this approach is that for most 
sequences the correction term is unnecessary; for example only 3% of the 
RBSAD calculations require corrections for 100 frames of the 'Oaire' 
204 
sequence. The silicon area for the pipelined solution is roughly half that of the 
full resolution case and the operating frequency is also higher. A VLSI 
implementation incorporating two parallel blocks, with the correction block 
running only 3% of the time, will result in power savings of close to 50% and 
with only a 5.6% increase in area (Sections 6.9 and 6.10). 
205 
References 
[1] www.microsoft.com/windowsmobiIe/devices/portablemediacenter lover 
viewmspx 
[2] http://wwwmacpower.com.tw/cgi-bin/helpdesk/kb.cgi?view=92&Iang=en 
[3] pcworId.about.com/news/Oct022000idI8693.htm 
[4] imm.demokritos.gr/ doriforiko/imgs/ docs/VoD.htm 
[5] Richard Brice, "Newnes Guide to Digital Television", Oxford: Newnes, 
2000. Includes index. ISBN: 0750645865. 
[6] Digital compression for multimedia :principles and standards /Jerry D. 
Gibson. San Francisco, Calif. : Morgan Kaufmann Publishers, 1998. Includes 
bibliographical references (p. [437]-446) and index. ISBN: 1558603697. 
[7] K. R. Rao, J. J. Hwang, 'Techniques and standards for image, video and 
audio coding', Upper Saddle River, N.J. : Prentice Hall PTR; London 
: Prentice Hall International (UK), 1996 Includes index. ISBN: 0133099075 
[8] A. Murat Tekalp, "Digital video Processing", Hemel Hempstead, Prentice 
HaIl Signal Processing Series, 1995. ISBN: 0131900757 
[9] Arch C. Luther, "Principles of Digital Audio and Video", Norwood, Mass., 
USA and London, UK, 1997, Includes bibliography and index. ISBN: 
0890068925. 
[10] Peter Kuhn: "Algorithms, Complexity Analysis and VLSI Architectures 
for MPEG4 Motion Estimation", Technical University of Munich, Germany, 
Kluwer Academic Publishers. ISBN 0-7923-8516-0, 2003. 
[11] http://www.licensingphilips.com/information/mpeg/ documents355.html 
[12] http://tns-www.lcs.mit.edu/manuals/mpeg2/FAQ 
[13] http://brnrc.berkeley.edu/frame/research/mpeg/fag/mpeg2.html 
[14] http://wi-fiplanet.webopediacom/TERM/L/lossless compression.html 
[15] http://en.wikipedia.org/wikilLossless data compression 
[16] http://en.wikipedia.org/ wikilIPEG 
[17] www.mpeg.org 
[18] http://computing-dictionarv.thefreedictionary.com/MPEG-3 
206 
[19] Viet-Anh Nguyen; Yap-Peng Tan, "Efficient block-matching motion 
estimation based on Integral frame attributes", IEEE Transactions on Circuits 
and Systems for Video Technology, March 2006, Volume 16, Issue: 3, page(s): 
375-385. 
[20] Yeping SUi Ming-Ting Sun, "Fast multiple reference frame motion 
estimation for H.264/ A VC", IEEE Transactions on Circuits and Systems for 
Video Technology, March 2006, Volume 16, Issue: 3,page(s): 447-452. 
[21] Saponara, S., Fanucci, L, "Data-adaptive motion estimation algorithm and 
VLSI architecture design for Iow-power video systems", Computers and 
Digital Techniques lEE Proceedings, Jan. 2004, Volume 151, Issue: 1, page(s) 
51-59. 
[22] Melani, M., Fanned, L., Saponara, S., " Algorithmic/ architectural design 
for H.264/MPEG-4 AVC Iow-power video codec ", Research in 
Microelectronics and Electronics, 2005 PhD, Volume 1, 25-28 July 2005, 
page(s) 209-212. 
[23] Ahmad, 1.; Weiguo Zheng; Jiancong Luo; Ming Liou, "A fast adaptive 
motion estimation algorithm", IEEE Transactions on Circuits and Systems for 
Video Technology, March 2006, Volume 16, Issue: 3, page(s) 420-438. 
[24] Shih-Yu Huang; Chuan-Yu Cho; Jia-Shung Wang, "Adaptive fast block-
matching algorithm by switching search patterns for sequences with wide-
range motion content", IEEE Transactions on Circuits and Systems for Video 
Technology, Nov. 2005, Volume 15, Issue:ll, page(s): 1373-1384. 
[25] Chien-Min Ou, Chian-Feng Le, Wen-Jyi Hwang, "An efficient VLSI 
architecture for H.264 variable block size motion estimation", IEEE 
Transactions on Consumer Electronics, Nov. 2005, Vo!. 51, Issue: 4, page(s) 
1291-1299. 
[26] Chouliaras, V.A., Nunez, J.L., MuIvaney, D.J., Rovati, F.S., Alfonso, D, "A 
multi-standard video accelerator based on a vector architecture", IEEE 
Transactions on Consumer Electronics, Feb. 2005, Volume 51, Issue: 1, page(s) 
160-167. 
207 
[27] Sung-Tae Jung; Sang-Seol Lee, "A 4-way pipelined processing 
architecture for three step search block matching motion estimation ", IEEE 
Transactions on Consumer Electronics, May 2004, Volume 50, Issue: 2, page(s) 
674-681. 
[28] Luca Benini, Davide Bertozzi, Alessandro Bogliolo, Francesco Menichelli, 
Mauro Olivieri, "Exploring the Multi-Processor SoC Design Space with 
SystemC", The Journal of VISI Signal Processing Systems for Signal, Image, 
and Video Technology, publisher: Springer Netherlands, Sept. 2005, Volume 
41, Issue: 2, page(s) 169-182. 
[29] N. C. Paver, M. H. Khan, B. C. Aldrich and C. D. Emmons, " 
Accelerating Mobile Video: A 64-Bit SIMD Architecture for Handheld 
Applications ", The Journal of VISI Signal Processing Systems for Signal, 
Image, and Video Technology, Springer Netherlands, August 2005, Volume 
41, Issue: 1, page(s) 21-34. 
[30] Tero Kangas, Timo D. HamaIiiinen and Kimmo Kuusilinna, " Scalable 
Architecture for SoC Video Encoders", The Journal of VISI Signal Processing 
Systems for Signal, Image, and Video Technology, Springer Netherlands, 
August 2006, Volume 44, Issue: 1-2, page(s) 79-95. 
[31] Yunju Baek, Hwang-Soek Oh and Heung-Kyu Lee: "Block-matching 
criterion for efficient VISI implementation for motion estimation", lEE 
Electronics Letters, 20th June 1996, vol. 32, no. 13, Jun 1996, pp 1184-1185 
[32] SimpleScalar LLC http://www.simplescalar.com/ 
[33] http://www.mpeg.org/MPEG/video.html#video-fags 
[34] http://bmrc.berkeley.edu/frame/research/mpeg/mpeg overview.html 
[35] mae. ucdavis.edu/ -biosport/ presentations/Digital %20Video.doc 
[36] ISO/IEC CD 11172: 'Coding of Moving Pictures and Associated Audio 
for Digital Storage Media at up to 1.5 Mbps', Dec. 1991. 
[37] http://bmrc.berkeley.edu/frame/ research/mpegl mpeg2fag.html 
[38] http://www.mpeg.org/MPEG/MSSG/tm5/ 
[39] http://www mathworks com/products/matlab 
[40] www.mpeg.org/MPEG/MSSG 
208 
[41] www.lboro.ac.uk!departments!el/research!esd! 
[42] Peter Pirsch, Nicolas Demassieux, Winfried Gehrke, "VLSI Architectures 
for Video Compression - A Survey", Proceedings of the IEEE, vol. 83, no. 2, 
feb. 1995, p. 220-246. 
[43] Kuhn, P., Stechele, W., "Complexity Analysis of the Emerging MPEG4 
Standard as a Basis for VLSI Implementation", vol. SPIE 3309 Visual 
Communications and Image Processing, San Jose, Jan. 1998, pp. 498-509. 
[44] Marius Tico, Sakari Alenius, Markku Vehvilainen, "Method Of Motion 
Estimation For Image Stabilization", IEEE International Conference on 
Acoustics, Speech, and Signal Processing, Paper: IMDSP-Pl.l0, May 14-19, 
2006, Toulouse, France. 
[45] E. R. Davies (2005). Machine Vision: Theory, Algorithms, Practicalities. 
Morgan Kaufmann. ISBN 0-12-206093-8 
[46] http://en.wikipedia.org!wiki/Computer vision 
[47] Chang, M.M., Tekalp, A.M., and Sezan, M.1. (1997). 
"Simultaneous motion estimation and segmentation" . 
IEEE Transactions on Image Processing, 6(9):1326-1333 
[48] www.gel.ulavaLca!-parizeau!vi99!Proceedings!node73.htmI 
[49] www.cipprs.org!vi2002!pdfls2-1.pdf 
[50] J. V. Gisladottir, K Ramchandran and M. Orchard, "Motion-based 
representation of Video Sequences using Variable Block Sizes", SPIE Visual 
Communications and Image Processing, Vol. 2727, 1996, Orlando, p 368-374. 
[51] Eric Haratsch, "Core Experiment P5 for MPEG-4: Entropy Constrained 
Variable Block Size Motion Estimation, Motion Compensation", 
Studienarbeit, Institute for Integrated Circuits, Technical University of 
Munich, Germany, 1996. 
[52] Botho Furht, Joshua Greenberg, Raymond Westwater, "Motion 
Estimation Algorithms for Video Compression", Kluwer Academic 
Publishers, Boston/ Dordrecht/ London, 162 pp, 1997. 
209 
[53] Hans Georg Musmann, Peter Pirsch, Hans-Joachirn Grallert, "Advances 
in Picture Coding", Proceedings of the IEEE, vol. 73, no. 4, Apr. 1985, pp. 523-
548. 
[54] F. Dufaux and F. Moscheni, "Motion estimation techniques for digital TV: 
A review and a new contribution", Proc. IEEE, Vol. 83, no. 6, pp 858-876, 1995. 
[55] Joon-Seek Kim, Rae-Hong Park, "A fast feature-based block matching 
algorithm using integral projections", IEEE Journal on Selected areas in 
communications, vol. 10, no. 5, Jun 1992, pp 968-971. 
[56] Ken Sauer and Brian Schwartz, "Efficient Motion Estimation using 
Integral Projections", Transactions on Circuits and Systems for Video 
Technology, vol. 6, no. 5, Oct. 1996, pp 513-518. 
[57] J. D. Robbins and A. N. Netravali, "Recursive motion compensation: A 
review", in Image Sequence Processing and Dynamic Scene Analysis, T. S. 
Huang, ed., pp. 76-103, Berlin, Germany: Springer-Verlag, 1983. 
[58] Bemd Girod, "Motion-Compensating Prediction with Fractional Pel 
Accuracy", Transactions on Communications, vol. 41, no. 4, April 1993, pp 
604-612. 
[59] T. Koga, K. Iinuma, A. Hirano, Y. Ijima, T. Ishiguro: "Motion 
Compensated interframe coding for video conferencing" Proceedings NTC'l 
(IEEE), 1981, p G.5.3.1 - G.5.3.4. 
[60] R.Li, B.Zeng, M.L.Liu: "A new three step search algorithm for block 
motion estimation", IEEE Transactions on Circuits and Systems for Video 
Technology, vol.4, Issue 4, Aug. 1994, Page 438 - 442 
[61] Lai-Man Po, Wing-chung Ma: "A novel four step search algorithm for fast 
block matching", IEEE Transactions on circuits and systems for video 
technology, Vol. 6, pp 313 - 317, Jun 1996. 
[62] J.R.Jain, A.K.Jain:"Displacement Measurement and its application in 
interframe image coding", IEEE Transactions Commun. Vol. COM-29, Dec. 
1981, pp 1799 -1808. 
210 
[63] A.Puri, H.M.Hang, D.L.Schilling: "An efficient block matching algorithm 
for motion compensated coding", Proc. IEEE. Conf. Acoustics, Speech and 
Signal processing, 1987, pp 25.4.1 - 25.4.4. 
[64] Shan Zhu "A new diamond search algorithm for fast block matching", 
IEEE Transactions on Image Processing. Vol. 9, Issue 2, Feb. 2000, Page(s): 287 
-290 
[65] Ram Srinivasan and K.R.Rao: "Predictive Coding based on efficient 
Motion Estimation", IEEE Transactions on Communications, vol. COM-33, 
No. 8, Aug. 1985, pp 888 - 896. 
[66] M. Ghanbari: "The cross search algorithm for motion estimation", IEEE 
Transactions on communications, vol. 38, no. 7, Jul. 1990, pp 950 - 953. 
[67] Liang-Gee Chen, Wai-Ting Chen, Yeu-Shen Jehng "An efficient parallel 
motion estimation algorithm for digital image processing", IEEE Transactions 
on Circuits and Systems for video technology, vol. 1, no. 4, Dec. 1991, pp 378 -
385. 
[68] T. Zahariadis and D. Kalivas, "A Spiral Search Algorithm for Fast 
Estimation of Block Motion Vectors", VIII European Signal Processing 
Conference, Trieste, Italy, Sept. 10-13, pp. 1079-1082, Vol. 11, 1996. 
[69] Lurng-Kuo Liu, Ephraim Feig, "A Block based Gradient descent search 
algorithm for block based motion estimation in video coding", IEEE 
Transactions on Circuits and Systems for Video Technology, vol. 6, no. 4, 
Aug. 1996, pp 419 - 422. 
[70] P. Orbaek, 'A real-time software video codec based on wavelets', Proc. Of 
Intl. Conf. On Communication Technology (IFIP), page(s): 1149-1156, vol. 2, 
2000. 
[71] J. Streit, L. Hanzo, 'A Fractal Video Communicator', IEEE Vehicular 
Technology Conference (VTC), pp. 1030-1034, Stockholm, Sweden, 1994. 
[72] S.Vassiliadis, G. Kuzmanov, S. Wong, 'MPEG-4 and the New Multimedia 
Architectural Challenges', Proc. 15th International Conference on Systems for 
Automation of Engieering and Research (SAER-2001), pp. 24-31, Bulgaria, 
2001. 
211 
[73] 'Emerging H.26L Standard: Overview and TMS320C64x Digital Media 
Platfonn Implementation', White Paper, UB Video Inc., Vancouver, Canada, 
2002. 
[74] V. A. Chouliaras, V. M. Dwyer, Shahrukh Agha, ' Customization of an 
embedded RISC CPU with SIMD extensions for video encoding: A Case 
Study', Submitted to VLSI Journal. 
[75] http://www.arc.com/ 
[76] http://www.tensilica.com/ 
[77] http://www.anncom/products/CPUs/embedded.html 
[78] http://www.mips.com/content/Products/Cores/32-BitCores 
[79] www.siliconhive.com 
[80] www.aspex-semi.com 
[81] www.eIixentcom 
[82] www.cradle.com 
[83] http://www.arc.com/configurablecores/ architectl configurability.html 
[84] http://www.arc.com/configurablecores/ architect! extendibility.html 
[85] D. Burger, T. Austin, 'Evaluating Future Microprocessors: The 
Simplescalar Tool Set', 1996, http://www.simplescalar corn 
[86] V A Chouliaras, J L Nunez-Yanez, S Agha, 'Silicon Implementation of a 
Parametric Vector Datapath for real-time MPEG2 encoding', paper 444-252, 
Proc. lASTED (SIP) 2004, Honolulu, Hawaii, USA. 
[87] www.simplescalar.com/docs/README-def.txt 
[88] http://ceet.tudelft.nl/-demid/SSIAT I 
[89]http://www.lboro.ac.uk/ departments I el/research I esd/PROJECTS/VP 
D/vdp gallery.html 
[90] 'The Leon-2 processor User's manual, XST edition, www.gaisler.com 
[91] 'The SPARC Architecture Manual Version 8' http://wwwsparccom/ 
[92] T Sikora, 'MPEG Digital Video-Coding Standards,' IEEE Signal Proc. 14 
(1997) 82 -100. 
212 
[93] D. Burger, T. Austin, 'Evaluating Future Microprocessors: the 
SimpleScalar Tool Set', Technical report UWT96, University of Michigan, Ann 
Arbor, 1996. 
[94] Mark Walmsley, 'Multi-Threaded Programming in C++', London 
Springer, 2000. include index. ISBN: 1852331461 
[95] John O'Gorman 'Operating Systems', Basingstoke: Macmillan, 2000. 
[96] S. Agha, et al., 'A Thread-level parallelization paper', Submitted to lEE 
Electronics Letters. 
[97] www.engr mun ca/-venky/Notes Mar8.pPt 
[98] en.wikipedia.org/ wikij Classic_RlSC_pipeline 
[99] www.eng.auburn.edu/-agrawvd/E5200/EXAMS/Testl 5200 sol.doc 
[100] www.Vassilios-Chouliaras.com 
[101] Y Baek, H S Oh and H K Lee,' An efficient block-matching criterion for 
motion estimation and its VLSI implementation', IEEE Trans Consumer 
Electron.,42 885-892 (1996). 
[102] 5 Lee, J-M Kim and S-I Chae, 'New Motion Estimation algorithm using 
an adaptively quantized low bit-resolution image and its VLSI architecture for 
MPEG2 video encoding', IEEE Trans. Circuits Syst. Video Technol. 8, 734-744 
(1998). 
[103] V M Dwyer, S Agha and V Chouliaras, Low power full search block 
matching using reduced bit sad values for early termination, Mirage 2005, 
Versailles, France, March 2-3, 2005, page(s): 191-196. 
[104] 5 Agha, V M Dwyer and V Chouliaras, Motion Estimation with Low 
Resolution Distortion Metric, IEEE Elec. Letters, 2005, Vol. 41, No. 12, Pages 
693-694. 
[105] A.Th. Schwarzbacher, J.P. Silvennoinen and J.T.Timoney, Benchmarking 
CMOS Adder Structures, Irish Systems and Signals Conference, Cork, Ireland, 
pp 231-234, June 2002. 
[106] P 144, Botho Furht, Joshua Greenberg, Raymond Westwater, 'Motion 
Estimation Algorithms for Video Compression', Kluwer Academic Publishers, 
1997. 
213 
[107] See ref [10] p 29, 51, 52 <re: Temporal and spatial correlation of MB 
MVs>. 
[108] Feng, J., Lo, K.T., Mehrpour, H. Karbowiak, A. E: "Adaptive block 
matching motion estimation algorithm for video coding", lEE Electron. 
Letters, vol. 31, no. 18, 1995, pp 1542-1543. 
[109] lSO/IEC JTC1/SC29/WGl1-13l3-1, Coding of moving pictures and 
associated audio, 1994. 
[110] V. M. Dwyer, S. Agha and V. A. Chouliaras, 'Reduced-Bit, Full Search 
Block-Matching Algorithms and their Hardware Realizations', ACIVS 2005 
Conference, Belgium, page(s): 372-380. 
[111] V. M. Dwyer, S. Agha, V. Chouliaras " Reduced-bit VLSI SAD Engine 
for Low Power Motion Estimation ", Submitted to IEEE Transactions on 
Consumer Electronics. 
[112] COTT SG XV, "Recommendation H.261 - Video codec for audiovisual 
services", 1990. 
[113] P M Kuhn, "Fast MPEG-4 Motion Estimation: Processor based and 
flexible VLSI implementation" J. VLSI Signal Processing 23 67-92 (1999). 
[114] V G Moshnyaga, "A new computationally adaptive formulation of 
block-matching Motion Estimation", IEEE Trans. Circuits Syst. Video 
Technol.ll, 118-124 (2001). 
[115] H Jong, L Chen and T Chieuh, "Accuracy improvement and cost 
reduction of three-step search block matching algorithm for video coding", 
IEEE Trans. Circuits Syst. Video Technol. 4, 88-91 (1994). 
[116] R Srinivasan and K Rao, "Predictive coding based on efficient Motion 
Estimation", IEEE Trans. Commun., 38 950-953 (1990). 
[117] Suh, J.W.; Jechang Jeong, "Fast sub-pixel motion estimation techniques 
having lower computational complexity", IEEE Transactions on Consumer 
Electronics, Aug. 2004, Volume: 50, Issue: 3, page(s): 968- 973. 
[118] Soongsathitanon, 5.; Woom, W.L.; Dlay, 5.5, "Fast Search Algorithms for 
Video Coding using orthogonal logarithmic search algorithm", IEEE 
214 
Transactions on Consumer Electronics, May 2005, Volume 51, Issue 2, page(s) 
552-559. 
[119] Yatabe, Y.; Fujimoto, M.; Sodeyama, K.; Komi, H, "An MPEG2J4 dual 
codec with sharing motion estimation", IEEE Transactions on Consumer 
Electronics, May 2005, Volume 15, Issue 2, page(s) 660-664. 
[120] Mietens, 5.; de With, P.H.N.; Hentschel, C, "Computational-complexity 
scalable motion estimation for mobile MPEG encoding", IEEE Transactions on 
Consumer Electronics, February 2004, Volume 50, Issue: 1, page(s) 281-291. 
[121] Ramkishor, K.; Gupta, P.S.S.B.K.; Raghu, T.S.; Suman, K, "Algorithmic 
optimisations for software-only MPEG2 encoding", IEEE Transactions on 
Consumer Electronics, Feb 2004, Volume 50, Issue: 1, page(s): 366-375. 
[122] Xuan-Quang Banh; Yap-Peng Tan, "Adaptive dual-cross search 
algorithm for block matching motion estimation", IEEE Transactions on 
Consumer Electronics, May 2004, Volume 50, Issue: 2, page(s) 766-775. 
[123] Byung CheolSong, Kang-Wook Chun, "Multi-resolution block matching 
algorithm and its VLSI architecture for fast motion estimation in an MPEG2 
video encoder", IEEE Transactions on Circuits and Systems for Video 
Technology, 5ept. 2004, Volume 14, Issue: 9, page(s) 1119-1137. 
[124] PM Kuhn, G Diebel, 5 Herman, A Keil, H Mooshofer, A Karp, R Mayer 
and W Stechele, "Complexity and PSNR-comparison of several fast Motion 
Estimation algorithms for MPEG-4", SPIE 3460, Applications of Digital Image 
Processing XXI, San Diego, USA, 486 - 499 (1998). 
215 

