Variable block size motion estimation hardware for video encoders. by Li, Man Ho. & Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering.
Variable Block Size Motion 
Estimation Hardware for Video 
Enpoders 
LI,(Man Ho 
A Thesis Submitted in Partial Fulfilment 
of the Requirements for the Degree of 
Master of Philosophy 
in 
Computer Science and Engineering 
© T h e Chinese University of Hong Kong 
November 2006 
The Chinese University of Hong Kong holds the copyright of this thesis. Any 
person(s) intending to use a part or whole of the materials in the thesis in 
a proposed publication must seek copyright release from the Dean of the 
Graduate School. 
M Q 3 M M )|| 
Thesis/Assessment Committee 
Professor Wong Tien Tsin (Chair) 
Professor L E O N G Heng Wai Philip (Thesis Supervisor) 
Professor Lee Kin Hong (Committee Member) 






Abstract of thesis entitled: 
Variable Block Size Motion Estimation Hardware for Video 
Encoders 
Submitted by LI Man Ho 
for the degree of Master of Philosophy 
at The Chinese University of Hong Kong in Nov 2006 
Multimedia has experienced massive growth in recent years due 
to improvements in algorithms and technology. An important 
underlying technology is video coding and in recent years, com-
pression efficiency and complexity have also improved signifi-
cantly. Applications of video coding have moved from set-top 
boxes to internet delivery and mobile communications. 
H.264/AVC is the latest video coding standard adopting vari-
able block size, quarter-pixel accuracy, motion vector prediction 
and multi-reference frames for motion estimations. These new 
features result in higher computation requirements than that for 
previous coding standards. In this thesis, we propose a family of 
motion estimation processors to balance tradeoffs between the 
performance, area, bandwidth and power consumption on an 
field programmable gate array (FPGA) platform. 
W e combine algorithmic and arithmetic optimizations for mo-
tion estimation. At the algorithmic level, we compare different 
algorithms and analyze their complexities. At the arithmetic 
level, we explore bit-parallel and bit-serial designs, which em-
ploy non-redundant and redundant number systems. In our bit-
serial design, we study tradeoffs between least significant bit first 
(LSB-first) and most significant first (MSB-first) modes. 
i 
Finally, wc offer a library of motion estimation processors 
to suit different applications. For bit-parallel processors, we 
offer l-dimensional, 2-dimensional systolic based architectures. 
Together with tree architectures and our proposed bit-serial ar-
chitecture, our family of processors is able to cover a range of 
applications. 
The bit-serial processor is able to support full search, three 
step search and diamond search. An early termination scheme 
has been introduced to further shorten the encoding time, and 
the standard technique is further optimized via H.264/AVC mo-
tion vector prediction. . 
In this thesis, the first reported MSB-first bit-serial variable 
block size motion estimation processor is introduced. It operates 
at a maximum clock frequency of 420 MHz. This processor is 
capable of performing common intermediate format (GIF) res-
olution full search encoding in real time at 23068 macroblocks 
per second within a -16 to 15 search range and occupies 2133 
slices on a Xilinx Virtex-II Pro FPGA. 
Our architectures were implemented on an F P G A platform 
and comparisons made. The result implementations are able to 
support H.264/AVC variable block size motion estimation for 





























Firstly, I would like to give a sincere thank to m y supervi-
sor, Professor Philip Leong, who encouraged and challenged me 
throughout my research program, guided me on writing confer-
ence papers, thesis and put effort on our workings. Without his 
effort, this dissertation could not be written. Since the final year 
project, Professor Leong gave me vast amount of resources to 
grow me up as an equipped research student. 
Secondly, I would like to thank Professor Lee Kin Hong, who 
offers me an opportunity to teach Computer Architecture which 
helps me a lot in my research. At the same time m y presentation 
performance and teaching skills are improved significantly. 
Moreover, I would like to thank m y Mphil classmate, Mr. Lau 
Wai Shing, Mr. Wong Chun Kit, Mr. Cheung Yu Hoi Ocean and 
Mr. Lam Yuet Ming for their aid in my daily research problems. 
Finally, I would like to thank m y dearest family and Elaine 
for giving me fully support on daily cares, financial subsidies 
and a cheerful life from time to time. 
iv 





1 Introduction 1 
1.1 Motivation 3 
1.2 The objectives of this thesis 4 
1.3 Contributions 5 
1.4 Thesis structure 6 
2 Digital video compression 8 
2.1 Introduction 8 
2.2 Fundamentals of lossy video compression 9 
2.2.1 Video compression and human visual sys-
tems 10 
2.2.2 Representation of color 10 
2.2.3 Sampling methods - frames and fields ... 11 
2.2.4 Compression methods 11 
2.2.5 Motion estimation 12 
2.2.6 Motion compensation 13 
2.2.7 Transform 13 
2.2.8 Quantization 14 
2.2.9 Entropy Encoding 14 
2.2.10 Intra-prediction unit 14 
2.2.11 Deblocking filter 15 
vi 
2.2.12 Complexity analysis of on different com-
pression stages 16 
2.3 Motion estimation process 16 
2.3.1 Block-based matching method 16 
2.3.2 Motion estimation procedure 18 
2.3.3 Matching Criteria 19 
2.3.4 Motion vectors 21 
2.3.5 Quality judgment 22 
2.4 Block-based matching algorithms for motion es-
timation 23 
2.4.1 Full search (FS) • . 23 
2.4.2 Three-step search (TSS) 24 
2.4.3 Two-dimensional Logarithmic Search Al-
gorithm (2D-log search) 25 
2.4.4 Diamond Search (DS) 25 
2.4.5 Fast full search (FFS) 26 
2.5 Complexity analysis of motion estimation 27 
2.5.1 Different searching algorithms 28 
2.5.2 Fixed-block size motion estimation . . . . 28 
2.5.3 Variable block size motion estimation ... 29 
2.5.4 Sub-pixel motion estimation 30 
2.5.5 Multi-reference frame motion estimation . 30 
2.6 Picture quality analysis 31 
2.7 Summary 32 
3 Arithmetic for video encoding 33 
3.1 Introduction 33 
3.2 Number systems 34 
3.2.1 Non-redundant Number System 34 
3.2.2 Redundant number system 36 
3.3 Addition/subtraction algorithm 38 
3.3.1 Non-redundant number addition 39 
3.3.2 Carry-save number addition 39 
vii 
3.3.3 Signed-digit number addition 40 
3.4 Bit-serial algorithms 42 
3.4.1 Least-significant-bit (LSB) first mode ... 42 
3.4.2 Most-significant-bit (MSB) first mode ... 43 
3.5 Absolute difference algorithm 44 
3.5.1 Non-redundant algorithm for absolute dif-
ference 44 
3.5.2 Redundant algorithm for absolute difference 45 
3.6 Multi-operand addition algorithm 47 
3.6.1 Bit-parallel non-redundant adder tree im-
plementation 47 
3.6.2 Bit-parallel carry-save adder tree imple-
mentation 49 
3.6.3 Bit serial signed digit adder tree imple-
mentation 49 
3.7 Comparison algorithms 50 
3.7.1 Non-redundant comparison algorithm ... 51 
3.7.2 Signed-digit comparison algorithm 52 
3.8 Summary 53 
4 VLSI architectures for video encoding 54 
4.1 Introduction 54 
4.2 Implementation platform - (FPGA) 55 
4.2.1 Basic F P G A architecture 55 
4.2.2 DSP blocks in F P G A device 56 
4.2.3 Advantages employing F P G A 57 
4.2.4 Commercial F P G A Device 58 
4.3 Top level architecture of motion estimation proces-
sor 59 
4.4 Bit-parallel architectures for motion estimation . 60 
4.4.1 Systolic arrays 60 
4.4.2 Mapping of a motion estimation algorithm 
onto systolic array 61 
viii 
4.4.3 1-D systolic array architecture (LA-ID) . . 63 
4.4.4 2-D systolic array architecture (LA-2D) . . 64 
4.4.5 1-D Tree architecture (GA-ID) 64 
4.4.6 2-D Tree architecture (GA-2D) 65 
4.4.7 Variable block size support in bit-parallel 
architectures 66 
4.5 Bit-serial motion estimation architecture 68 
4.5.1 Data Processing Direction 68 
4.5.2 Algorithm mapping and dataflow design . 68 
4.5.3 Early termination scheme 69 
4.5.4 Top-level architecture 70 
4.5.5 Non redundant positive number to signed 
digit conversion 71 
4.5.6 Signed-digit adder tree 73 
4.5.7 S A D merger 74 
4.5.8 Signed-digit comparator 75 
4.5.9 Early termination controller 76 
4.5.10 Data scheduling and timeline 80 
4.6 Decision metric in different architectural types . . 80 
4.6.1 Throughput 81 
4.6.2 Memory bandwidth 83 
4.6.3 Silicon area occupied and power consump-
tion 83 
4.7 Architecture selection for different applications . . 84 
4.7.1 GIF and QCIF resolution 84 
4.7.2 S D T V resolution 85 
4.7.3 H D T V resolution 85 
4.8 Summary 86 
5 Results and comparison 87 
5.1 Introduction 87 
5.2 Implementation details 87 
5.2.1 Bit-parallel 1-D systolic array 88 
ix 
5.2.2 Bit-parallel 2-D systolic array 89 
5.2.3 Bit-parallel Tree architecture 90 
5.2.4 MSB-first bit-serial design 91 
5.3 Comparison between motion estimation architec-
tures 93 
5.3.1 Throughput and latency 93 
5.3.2 Occupied resources 94 
5.3.3 Memory bandwidth 95 
5.3.4 Motion estimation algorithm 95 
5.3.5 Power consumption 97 
5.4 Comparison to ASIC and F P G A architectures in 
past literature 99 
5.5 Summary 101 
6 Conclusion 102 
6.1 Summary 102 
6.1.1 Algorithmic optimizations 102 
6.1.2 Architecture and arithmetic optimizations 103 
6.1.3 Implementation on a F P G A platform . . . 104 
6.2 Future work 106 
A VHDL Sources 108 
A.l Online Full Adder 108 
A.2 Online Signed Digit Full Adder 109 
A.3 Online Pull Adder Tree 110 
A.4 S A D merger 112 
A.5 Signed digit adder tree stage (top) 116 
A.6 Absolute element 118 
A.7 Absolute stage (top) 119 
A.8 Online comparator element 120 
A.9 Comparator stage (top) 122 
A. 10 MSB-first motion estimation processor 134 
Bibliography 137 
X 
List of Figures 
2.1 Hybird video coder for H.264/AVC 12 
2.2 Improvement made by deblocking filter - Left: 
improved 15 
2.3 Selection of block sizes within a frame 17 
2.4 Motion estimation and motion vector 19 
2.5 Sub-macroblock partitions in H.264/AVC 29 
2.6 Integer, half-pixel and quarter-pixel motion esti-
mation search positions (pel stands for pixel) ... 31 
2.7 Matlab simulation on the quality of different mo-
tion estimation algorithms on Foreman 32 
3.1 LSB-first bit-serial addition algorithm 43 
3.2 MSB-first bit-serial addition algorithm 44 
3.3 Signed-digit number based sign detection algorithm 46 
3.4 Sequential SAD computation in general purpose 
processor 47 
3.5 Bit-Parallel 4x2-operand adder tree 48 
3.6 Bit-serial signed-digit adder (ol-CSFA stands for 
on-line carry-save full adder) 51 
4.1 F P G A Logic Cell Architecture (Xilinx Virtex-II 
Pro series) 56 
4.2 DSP architecture in Xilinx Virtex-5 F P G A . . . . 57 
4.3 Model of motion estimation processor 59 
4.4 Data flow in systolic array over general implemen-
tation 60 
xi 
4.5 Variable block size motion estimation algorithm . 61 
4.6 Fundamental elements in systolic and tree archi-
tectures 62 
4.7 1-D systolic architecture 63 
4.8 2-D systolic architecture 65 
4.9 1-D Tree architecture 66 
4.10 2-D tree architecture 67 
4.11 H.264/AVC motion vector prediction 69 
4.12 Top level architecture of bit-serial motion estima-
tion unit 71 
4.13 Flow chart of non-redundant to signed-digit num-
ber conversion 72 
4.14 Signed-digit adder tree that generates 41 SADs . 73 
4.15 On-line carry save and signed digit adders . . . . 74 
4.16 A 16-operand carry save adder tree 75 
4.17 16-operand signed-digit adder tree for 4x4 SADs . 76 
4.18 S A D merger 77 
4.19 Architecture of on-line comparator 78 
4.20 Timeline of bit-serial design for whole motion es-
timation computation process 79 
5.1 Throughput of different motion estimation archi-
tectures at different resolutions 93 
5.2 Occupied slices of different motion estimation ar-
chitectures 94 
5.3 Bandwidth requirements of different motion esti-
mation architectures at GIF 30 fps 95 
5.4 Throughput of different architectures at different 
motion estimation algorithms % 
5.5 Maximum throughput per slice of different mo-
tion estimation architectures 97 
5.6 Power consumptions of different architectures . . 98 
xii 
5.7 Power efficiencies of different motion estimation 
architectures 98 
6.1 Area vs throughput in different motion estima-
tion architectures 105 
xiii 
List of Tables 
2.1 Complexity profile of each compression stage in 
H264/AVC 16 
2.2 Complexity of block-based searching algorithms 
(measured in M O P S ) : . . 28 
3.1 Example for online signed-digit adder 50 
4.1 Number of cycles to complete comparison stage 
for different scenes using different starting strat-
egy (16 cycles for no early termination scheme) . 70 
4.2 On-line delay of different S A D types 76 
4.3 Delays of primitive operations employed in bit-
parallel motion estimation architectures 81 
4.4 Areas of primitive component employed in bit-
parallel motion estimation architectures 81 
4.5 Throughput of different architectural types (N: 
block size) 82 
4.6 Bandwidth requirement of different architectural 
types 83 
4.7 Area estimation of different architectural types 
(Variable block sizes not supported) 84 
4.8 Power estimation of different architectural types 
(Variable block sizes not supported) 84 
5.1 Results of ID systolic array processor 89 
5.2 Results of 2D systolic array processor 90 
xiv 
5.3 Results of ID tree-based motion estimation proces-
sor 91 
5.4 Results of 2D tree-based motion estimation proces-
sor 92 
5.5 Results of MB-first bit-serial processor 92 
5.6 Results and comparison of motion estimation proces-
sors on F P G A devices 100 
5.7 Results and comparison of motion estimation proces-




Digital video coding has gradually increased in importance since 
the 90's when MPEG-1 [16] first emerged. It has had large im-
pact on video delivery, storage and presentation. Compared 
to analog video, video coding achieves higher data compression 
rates without significant loss of subjective picture quality. This 
eliminates the need of high bandwidth as required in analog 
video delivery. With this important characteristic, many ap-
plication areas have emerged. For example, set-top box video 
playback using compact disk, video conferencing over IP net-
works, P2P video delivery, mobile T V broadcasting, etc. The 
specialized nature of video applications has led to the develop-
ment of video processing systems having different size, quality, 
performance, power consumption and cost. 
Digitization of video scenes was an inevitable step since it has 
many advantages over analog video. Digital video is virtually 
immune to noise, easier to transmit and is able to provide a 
more interactive interface to users. Furthermore, the amount 
of video content, e.g. T V content, can be made larger through 
improved video compression because the bandwidth required for 
analog delivery can be used for more channels in a digital video 
delivery system. With today's sophisticated video compression 
systems, end users can also stream video, edit video and share 
video with friends via the internet or IP networks. In contrast, 
1 
CHAPTER 1. INTRODUCTION 2 
analog signals arc difficult to manipulate and transmit. 
Generally speaking, video compression is a technology for 
transforming video signals that aims to retain original quality 
under a number of constraints, e.g. storage constraint, time de-
lay constraint or computation power constraint. It takes advan-
tage of data redundancy between successive frames to reduce the 
storage requirement by applying computational resources. The 
design of data compression systems normally involves a tradeoff 
between quality, speed, resource utilization and power consump-
tion. 
In a video scene, data redundancy arises from spatial, tem-
poral and statistical correlation between frames. These corre-
lations are processed separately because of differences in their 
characteristics. Hybrid video coding architectures have been 
employed since the first generation of video coding standards, 
i.e. M P E G . M P E G consists of three main parts to reduce data 
redundancy from the three sources described above. Motion es-
timation and compensation are used to reduce temporal redun-
dancy between successive frames in the time domain. Transform 
coding, also commonly used in image compression, is employed 
to reduce spatial dependency within a frame in the spatial do-
main. Lastly, entropy coding is used to reduce statistical redun-
dancy over the residue and compression data. This is a lossless 
compression technique commonly used in file compression. 
Hardware video compression systems can be implemented in 
application-specific integrated circuit (ASIC) and field program-
mable gate array (FPGA) technologies and, depending on the 
desired quality, real-time video encoding can be realized in both 
hardware and software technologies. Advances in codecs have 
continued since we have to enable video delivery over new medi-
ums such as IP networks. As a result, H.264 [48] and MPEG-4 
were developed to suit these network applications through en-
hanced compression efficiency and picture quality under very 
CHAPTER 1. INTRODUCTION 3 
low bit-rates. Unfortunately, the complexity of the latest video 
codecs for network applications has increased a lot over pre-
viously defined standards such as MPEG-1 and MPEG-2 [17 . 
Real-time and low power encoding requirements create great 
challenges for software and hardware engineers. 
1.1 Motivation 
Among all the blocks in a video coder, motion estimation is the 
most demanding [8]. It is also the critical part that affects the 
video quality and compression efficiency. For this reason, many 
algorithms and architectures have been proposed to optimize 
this process. With advancement of video codec standards, the 
requirements of motion estimation have increased and thus both 
software (algorithmic) and hardware (architectural) optimiza-
tions must be continuously improved to cope with the increased 
complexity. 
Power and speed are important considerations in codecs, es-
pecially running on mobile devices. Pure software implementa-
tions of video codecs usually result in large power consumption 
and low speed. To solve these issues hardware support is often 
employed. Hardware can use parallelism to obtain higher per-
formance. Together this reduces the high data bandwidth and 
the instruction fetching associated with software, which are a 
major source of power wastage. As a result, hardware imple-
mentations consume lower power than a corresponding software 
implementation. Realization of video codec in hardware plays 
an important role in mobile applications. 
Since the introduction of advanced motion estimation tech-
nique in the latest video codecs such as MPEG-4 and H.264, 
previous motion estimation architectures are no longer fully ap-
plicable. As a result, a family of new motion estimation architec-
tures has been proposed to fit different application requirements 
CHAPTER 1. INTRODUCTION 4 
while still efficiently utilizing resources. 
1.2 The objectives of this thesis 
Hardware assistance for video coding has become an impor-
tant tool for optimizing system performance. Among hybrid 
video coding architectures, motion estimation introduces most 
of the complexity. Although many fast algorithms have been 
proposed to reduce the computational complexity, quality and 
compression performance still may not meet the application re-
quirements. The complexity of motion estimation has greatly 
increased compared to MPEG-1 or MPEG-2 since the introduc-
tion of variable block size motion estimation in MPEG-4. To-
gether with multi-reference frames, sub-pixel motion estimation 
support, prior architectures have become unsuitable. 
In this thesis a family of motion estimation architectures and 
algorithms are studied and analyzed. Implementation of those 
architectures on an F P G A platform was effected to measure 
the efficiency of different architectural approaches. The VLSI 
architectures suggested in this thesis are also applicable to ASIC 
design. W e also analyze the impact of computer arithmetic to 
motion estimation. With optimization of the arithmetic, we 
discovered efficient ways to implement motion estimation for 
low to high-end applications. 
A full range of applications are supported by this work from 
low demand applications like low-resolution video conferencing 
and mobile digital video broadcasting, to high demand applica-
tions like H D T V video encoding. A family of motion estima-
tion architectures is suggested to target different applications 
while efficiently utilizing computational resources, silicon area 
and power. Using the reconfigurable feature of FPGAs, we can 
provide the most efficient option to users which meets their re-
quirement without changing the underlying hardware device. 
CHAPTER 1. INTRODUCTION 5 
In commercial design situations, the architecture selected is 
often suboptimal because of limited design time and resources. 
This work provides an overview of architectural alternatives to 
realize products in a more efficient manner. In terms of com-
puter architecture arid computer arithmetic, this thesis provides 
an architecture design space to explore optimizations for differ-
ent kinds of application domains. The idea suggested in this 
work can also be applied to other areas such as transform or 
filtering in video encoder. With such implementation informa-
tion, good tradeoffs between architecture and algorithm can be 
done to deliver a design with satisfactory performance, power 
and occupied silicon area. 
1.3 Contributions 
This thesis presents a family of FPGA-based motion estimation 
architectures for variable block size motion estimations. The 
following novel contributions result from this work. 
• A study of the complexity, quality and performance asso-
ciated with different motion estimation algorithms jointly 
with hardware architectures for H.264/AVC was made. Im-
plementations of hardware architectures were done on a Xil-
inx F P G A platform. To the best of m y knowledge, no study 
considering all these issues has been previously reported in 
the literature. 
• Employing computer arithmetic optimizations for motion 
estimation. Although MSB-first bit serial architectures with 
early termination have been proposed [46], this is the first 
reported architecture supporting variable block size motion 
estimation. 
• The initialization scheme proposed in section 4.5.3 is an 
CHAPTER 1. INTRODUCTION 6 
improvement over the standard one and enables earlier ter-
mination. 
• A family of architectures capable of supporting the latest 
codec standard H.264/AVC was developed which can meet 
the requirements of different kinds of applications under 
different constraints, e.g. performance, area, bandwidth, 
quality and power consumption. 
1.4 Thesis structure 
The main theme of this thesis is to study different implemen-
tation methods for motion estimation in the latest and future 
video coding standards. The reference coding standard in this 
thesis is H.264/AVC or MPEG-4 part 10. All the work stated in 
this thesis assumes the definition given in H.264/AVC standard. 
Chapter 2 reviews the background of digital video compres-
sion and explains the role of motion estimation and the under-
lying algorithms. A number of commonly used motion estima-
tion algorithms are introduced in detail. In terms of efficiency, 
performance and data dependency, a comparison between dif-
ferent algorithms are made. Lastly, the new added features in 
H.264/Ave are presented and their effects are analyzed. 
Chapter 3 contains background associated with computer 
arithmetic for motion estimation. Number systems and cor-
responding algorithms for addition, absolute difference, multi-
operand addition and comparison are presented using both bit-
parallel and bit-serial (MSB-first or LSB-first) approaches. 
Chapter 4 presents various VLSI architectures for variable 
block size motion estimation. Concepts including systolic ar-
rays and tree architectures are described for bit-parallel systems. 
Next, the implementation of bit-serial system is presented in a 
top-down manner. Lastly, metrics of different architectures are 
CHAPTER 1. INTRODUCTION 7 
analyzed and capabilities of these architectures are presented. 
Chapter 5 presents the results of this thesis. Several compar-
isons have been made on different architectures and technologies. 
Lastly, conclusions are drawn in chapter 6. 
Chapter 2 
Digital video compression 
2.1 Introduction 
Compression is the process of compacting data into a smaller size 
in terms of number of bytes in digital media. Text files, pictures, 
voice, and in fact any data that contains redundancy can be 
made smaller by employing compression. Since an uncompressed 
video scene can occupy a large amount of storage space, video 
compression has an important role in the digital world. 
Every compression system involves complementary units, an 
encoder (compression unit) and a decoder (decompression unit). 
The encoder exploits the redundancy among the given data and 
converts it to a compressed data stream. The decoder interprets 
the compressed data stream and restores it into the original 
format. To ensure the compressed size is satisfactory small, the 
original data may not be exactly the same as the original data 
hence some details may be lost. These kinds of compression 
systems are also called lossy compression systems. 
The compression system must be well defined and the com-
pressed data stream format known in both the encoder and de-
coder. The encode/decode pair is often described as a codec 
(coder/decoder). MPEG-1, MPEG-2, H.263 [3], H.264, etc are 
codecs defined by standardization parties. Standardization par-
ties include moving picture expert group (MPEG), a group of 
8 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 9 
the international organization for standardization; international 
telecommunication union (ITU-T); video coding expert group 
(VCEG), etc. 
Video compression exploits three kinds of data redundancy 
within a video scene. In a standard lossy video compression such 
as H.264, the temporal redundancy between adjacent frames is 
the most important redundancy in video-type data and a large 
proportion of data dependency can be reduced through motion 
estimation. Spatial redundancy, which can be exploited within 
a frame, is reduced via transform techniques. It also consti-
tutes a significant portion of redundancy since there is usually a 
.high correlation between neighbouring pixels. Lastly, statistical 
redundancy, which must occur in any kind of data source, is 
reduced by an entropy coder in the last stage of the encoder. 
2.2 Fundamentals of lossy video compression 
In digital video, lossy compression is often employed to en-
sure good compression performance. Although the quality is 
inevitably degraded, there is minimal impact on perceived qual-
ity since only the high-frequency component is eliminated and 
human eyes are not sensitive to these components. The video en-
coding system consists of two kinds of compression units, namely 
lossy and lossless. Lossless data compression is a class of data 
compression algorithms in which the exact original data can be 
reconstructed. In contrast, lossy data compression must intro-
duce some data loss during compression. The lossy compres-
sion unit contains temporal and spatial redundancy compres-
sors. The statistical compressor in the last stage is a lossless 
one. The video coder contains five stages in total [42], handling 
different kinds of compression. In the following subsection each 
compression unit is discussed. Figure 2.1 shows the block dia-
gram of a hybrid video encoder with the three compressor types 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 10 
described [42]. 
2.2.1 Video compression and human visual systems 
A video scene consists of multiple objects each with their shapes, 
textures, illuminations, colors, etc. The motion, color and bright-
ness of a scene are interpreted by the human visual systems. 
Some aspects are less sensitive to human eyes than others. For 
example, human eyes can't detect blue as well as green. In dark 
environments, only black and white can be detected. Moreover, 
the image formed in the retina remains for 10 to 30 milli-seconds. 
As a result human eye cannot detect the difference between 60 
fps and 120 fps video. Human vision cannot perceive very de-
tailed objects, i.e. high resolution objects. The objective of 
video compression is to exploit these properties of the human 
visual system, maximizing compression efficiency while keeping 
the impact of objective quality loss to a minimum. Since the 
non-sensitive components are high frequency components, e.g. 
color, frame rate, their removal results in high compression ra-
tios. 
2.2.2 Representation of color 
W e need at least three basic colors: red, blue and green, to 
display a true color picture. As a result, each pixel consists of 
at least three distinct channels of information. 
C o m m o n color representations methods include R G B and 
Y U V [42]. R G B is a representation that includes three ba-
sic colors together with brightness information. It is a com-
mon method for monitor displays. Y U V is a representation 
which divides the color space into luminance (brightness) and 
chrominance (color). The 3 original colors can be derived from 
Y U V data. Since humans are less sensitive to color than lu-
minance, color information can be suppressed compared to lu-
CHAPTER 2. DIGITAL VIDEO COMPRESSION 11 
minance data without significant quality loss. For example, in 
4:2:0 Y U V , we have four times the luminance information than 
chrominance. Other representations such as 4:2:2 and 4:4:4 de-
pend on the sampling frequencies. In the video compression, 
Y U V representation is a better method to specify a pixel since 
it more closely tracks human perception and can enhance com-
pression efficiency. 
2.2.3 Sampling methods - frames and fields 
A video can be sampled as frames or fields. This is called pro-
gressive sampling and interlaced sampling respectively. For each 
time instance, frames consist of both odd-numbered lines and 
even-numbered lines forming a picture. Fields consist of either 
odd-numbered lines or even-numbered lines consecutively. As a 
field occupies half the data of a frame, fields can be sampled at 
twice the rate of a frame. The advantage of interlaced sampling 
is it gives the viewer smoother motion, at the same time the 
data rate is reduced. The downside is the introduction of inter-
lacing artifacts when displayed on a progressive scan monitor. 
Both sampling methods are employed in video compression and 
the best choice depends on the application. 
2.2.4 Compression methods 
Since a video scene is rectangular, block-based coding is a suit-
able choice of processing element. However, other compression 
methods also exist. One example is wavelet coding which is 
employed in image encoding in JPEG2000 [19]. A three dimen-
sional version of wavelet coding for video compression was sug-
gested [15] which gives better visual quality but the computation 
complexity is much higher. Another method is called arbitrary-
shaped coding which is employed in MPEG-4 [18]. This is based 
on different moving objects whose motion is combined to form 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 12 
Input Coder 
Video Control control 
Signal ； ; 4 . . . • D a t a � 
mr �� 
i Decoder | i Scaling & Inv. \ \ Split into l-w- - r - . - i •» T f r ^ r m \ �� 
Macroblocks ； | 1 丁「如广 \ 5, 
16x16 pixels 1 I V N \ | Entropy 
I i 、 ， Coding 
！ ! De-blocking / 
i I Intra-frame Fil:er / 
i Prediction I / 
L； • b ^ N ^ Output / 
Motion- J Video , / 
I n t r a /二 C o m ^ s a t i o n r ^ " ^ ! ^ Signal j 
\ Motion 
• • t Data 
‘ Motion I 
Estimation 
Figure 2.1: Hybird video coder for H.264/AVC 
a complete frame. The performance is improved by accurate 
prediction in motion estimation. As a tradeoff, it has increased 
complexity. 
2.2.5 Motion estimation 
The motion estimation unit, shown in figure 2.1，is the first 
stage. The uncompressed video sequence input undergoes tem-
poral redundancy reduction by exploiting similarities between 
neighbouring video frames. Temporal redundancy arises since 
the difference between two successive frames are usually simi-
lar, especially for high frame rates, because the objects in the 
scene can only make small displacements. With motion esti-
mation, the difference between successive frames can be made 
smaller since they are more similar. Compression is achieved by 
predicting the next frame relative to the original frame. The 
predicted data are the residue between the current and refer-
ence pictures, and a set of motion vectors which represent the 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 13 
predicted motion direction. The process of finding the motion 
vector is optimal or suboptimal depending on the block match-
ing algorithm chosen. Since the correlation between successive 
frames is inherently very high, the compression in this stage 
has large impact on the overall performance of the whole sys-
tem. The motion predicted frames are usually called P-frames 
(Predicted frames). The other type of predicted frame is called 
B-frames (Bi-predicted frames). In this case the frame is pre-
dicted from two or more reference frames previously decoded. 
2.2.6 Motion compensation 
The motion compensation unit constructs a compensated frame, 
also called a residue frame, from the original frame and motion 
vectors. It than calculates the residue between the compensated 
frame and the reference frame. It is often employed together 
with motion estimation unit to reduce temporal redundancy of 
a video sequence. On the decoder side, the compensation unit 
acts as a reconstruction engine that combines the residue and 
motion vector to form the original frame. The frame is divided 
into subblocks so that the engine will act on each subblock se-
quentially until the whole picture is constructed. 
2.2.7 Transform 
The transform unit [32] reduces spatial redundancy within a pic-
ture. Its input is the residue picture calculated by the motion 
estimation unit. Since the residue picture has high correlation 
between neighbouring pixels, the transformed data is easier to 
compress than the original residue data since the energy of the 
transformed data is localized. The transformed data are called 
transform coefficients and they are passed to the quantization 
unit. The transformation can be done by many methods, includ-
ing the Cosine Transform, Integer Transform, Karhunen-Loeve 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 14 
Transform etc. Details of these can be found in [41'. 
2.2.8 Quantization 
The quantization unit is the only lossy compression unit in the 
system. It serves to eliminate high frequency transform coeffi-
cients so that the quantized transform coefficients are more eas-
ily compressed. The elimination of high frequencies is justified 
because of the insensitivity of human vision to high frequency 
components. Subjectively, the quality of a video scene after 
quantization will not be significantly degraded if the bit-rate is 
not highly constrained. In each video coding standard, there 
exists a defined set of quantization parameters for providing the 
best coinpression-to-quality ratios for different applications. 
2.2.9 Entropy Encoding 
The entropy encoding unit is the last stage in a video compres-
sion system. In this stage, mostly statistical redundancy remains 
in the data. The motion vectors output by the motion estima-
tion unit and quantized transform coefficients from the trans-
form unit are accepted in this stage to produce the compressed 
bit stream that can be transmitted or stored. Typically, there 
are two kinds of entropy coder. The first is a variable-length-
coder [4] in which the statistical information is initially defined. 
Second is an arithmetic coder [33] in which the statistical infor-
mation is determined online. Most modern entropy coders are 
content-adaptive [33]. The compressed data can be optimized 
adaptively independent of the nature of the video scene. 
2.2.10 Intra-prediction unit 
The intra-prediction unit is activated when the difference be-
tween consecutive frames is too large, as occurs in a scene change 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 15 • • 
Figure 2.2: Improvement made by deblocking filter - Left: improved 
or very fast moving pictures. In this case the frame is pre-
dicted by predefined block patterns instead of motion estimation 
and compensation. The output bit stream is usually smaller 
when such effects occur. H.264/AVC supports 13 prediction 
patterns in its intra-prediction unit [42]. The frames coded by 
intra-prediction unit are usually called I-frames (Intra-predicted 
frames). 
2.2.11 Deblocking filter 
Since most of the video coder employs block-based motion com-
pensation, blocking artifacts may be visible when the scene is 
reconstructed. The lower the bit-rate, the more pronounced the 
blocking effect. To reduce this blocking artifact, a deblocking 
filter is included within the encoding loop. It employs adaptive 
filtering techniques so that edges are corrcctly filtered. Includ-
ing the deblocking filter improves the visual quality in terms of 
objective (PSNR, section 2.3.5) and subjective judgment (hu-
man vision). The effect of the deblocking filter is illustrated in 
figure 2.2 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 16 
Compression stage Proportion 
Motion vector search 67.31% 
Mode selection 8.19% 
Rate distortion opt. 3.37% 
Transform and quantization 6.95% 
entropy coder 6.19% 
deblocking filter 0-03% 
Others 7.96% “ 
Table 2.1: Complexity profile of each compression stage in H264/AVC 
2.2.12 Complexity analysis of on different compression 
stages 
Each unit in the video coder contributes additional computa-
tional complexity to the overall system. Among all compression 
units, the motion estimation unit occupies most of the computa-
tion resources. In a software implementation, this is more than 
65 percent of the total computation time. The transform unit, 
entropy coder and deblocking filter add up to 15 percent. The 
remaining 20 percent is due to mode selection and other over-
heads. Thus, there is no doubt that motion estimation unit can 
be accelerated using hardware. Table 2.1 shows the profiling of 
H.264/AVC encoding on a Pentium-Ill platform by [8 . 
2.3 Motion estimation process 
2.3.1 Block-based matching method 
Block-based matching method is the most widely used motion 
estimation method for video coding since pictures are normally 
rectangular in shape and block-division can be easily done. Usu-
ally, standards bodies, e.g. M P E G , defines the standard block 
sizes for motion estimation. This can be 16 by 16, 8 by 8, etc, 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 17 
/、^^^趣園 
Figure 2.3: Selection of block sizes within a frame 
depending on the target application of the video codec. In the 
latest codec standards such as MPEG-4 or H.264/AVC, variable 
block sizes are supported which can be 4 by 4, 8 by 8 and 16 
by 16. The goal of motion estimation is to predict the next 
frame from the current frame by associating the motion vector 
to picture macroblocks as accurately as possible. The block size 
determines the quality of prediction [12] and thus the accuracy. 
Figure 2.3 shows the distribution of block sizes within a picture. 
It is easy to see that the detailed region is associated with small 
blocks whereas the large uniform region is associated with large 
blocks. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 18 
2.3.2 Motion estimation procedure 
After motion estimation, a picture residue and a set of motion 
vectors are produced. The following procedure is executed for 
each block (16x16, 8x8 or 4x4) in the current frame. 
1. For the reference frame, a search area is defined for each 
block in the current frame. The search area is typically 
sized at 2 to 3 times the macroblock size (16x16). Using the 
fact that the motion between consecutive frames is statisti-
cally small, the search range is confined to this area. After 
the search process, a ”‘best，，’ match will be found within 
the area. The "'best'" matching usually means having low-
est energy in the sum of residual formed by subtracting the 
candidate block in search region from the current block lo-
cated in current frame. The process of finding best match 
block by block is called block-based motion estimation. 
2. When the best match is found, the motion vectors and 
residues between the current block and reference block are 
computed. The process of getting the residues and motion 
vectors is known as motion compensation. 
3. The residues and motion vectors of best match are encoded 
by the transform unit and entropy unit and transmitted to 
the decoder side. 
4. At decoder side, the process is reversed to reconstruct the 
original picture. 
Figure 2.4 shows an illustration of the above procedure. In 
modern video coding standards, the reference frame can be a 
previous frame, a future frame or a combination of two or more 
previously coded frames. The number of reference frames needed 
depends on the required accuracy. The more reference frames 
referenced by current block, the more accurate the prediction is. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 19 
Reference frame 
^^ ^  ^ 
4 ^ H * ^ Current frame 
Search area 
• 
m Current block 
Best match block 
•< Minimum motion vector 
Figure 2.4: Motion estimation and motion vector 
2.3.3 Matching Criteria 
Block-based motion estimation obtains the best match by min-
imizing a cost function. Various cost functions have been pro-
posed and analyzed in the literatures, varying in complexity and 
efficiency. In this section, the mean absolute difference (MAD), 
mean square error (MSE), sum of absolute difference (SAD) and 
sum of absolute transformed difference (SATD) block matching 
criteria are explained. 4 and /„,—i in the formula below repre-
sent the macroblock in current and reference frame respectively. 
m and n are the search location motion vector and N is the 
block size, k and I represent the index of macroblocks. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 20 
Mean absolute difference (MAD) 
The M A D cost function [43] is defined as: 
MAD(iM;m’n)=击 I ： 二 。 力 ( 2 . 1 ) 
The advantage of M A D cost function is its simplicity and ease 
of implementation in hardware. Unfortunately, M A D tends 
to overemphasize small differences, giving an inferior result to 
MSE. 
Mean absolute error (MSE) 
M S E [47] is a cost function measuring the energy remaining in 
the difference block. The M S E cost function is defined as: 
MSEik,l-m,n) = -^ EIo EloiUk+iHj)-In-i{k+i+mHj+n)f (2.2) 
The advantage of M A D cost function is its accuracy but its com-
plexity is high for both software and hardware implementations. 
Sum of absolute difference (SAD) 
S A D [43] is the most common matching criteria chosen in video 
coding because of its low complexity, good performance and ease 
of hardware implementation. The S A D cost function is defined 
as: 
SAD{kMm,n)=j:to Ef=o \In{k+iHj)-In-i{k+i+mHj+n)\ (2.3) 
The only difference between S A D and M A D is that S A D takes 
the sum of all pixels while M A D measures the average pixel 
value. Since the block size is constant during subtraction, the 
average value per pixel is not important. A divide operation is 
saved and the overall computation is simplified. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 21 
Sum of absolute transformed difference (SATD) 
S A T D is another way to compute the residues between two 
blocks, where the pixel values are pre-transformed by Hadamard 
Transform [42]. Since the transformed coefficient is closer to the 
final bit stream, it offers better matching accuracy than SAD. 
The SATD cost function is defined as: 
SATD{k,l,m,n)=Y:toEf=o \Tnik+iHj)-Tn-i{k+i+mHj+n)\ (2.4) 
In the above equation, T„, and Tn—i represents the transformed 
coefficient of each block in the current and reference frames re-
spectively. Although it offers significant improvement in predic-
tion quality, transform hardware must be added within the mo-
tion estimation loop and hardware complexity is increased. On 
the other hand, the latency and thus the performance of motion 
estimation will be degraded due to the added transform hard-
ware. This technique is applied to H.264/AVC when performing 
quarter pixel accuracy motion estimation. The high accuracy is 
needed since it is the final stage of motion estimation [42.. 
2.3.4 Motion vectors 
To represent the motion of each block, a motion vector is de-
fined as the relative displacement between the current candidate 
block and the best matching block within the search window in 
the reference frame. It is a directional pair representing the dis-
placement in horizontal (x-axis) direction and vertical (y-axis 
direction). The maximum value of motion vector is determined 
by the search range. The larger the search range, the more 
bits needed to code the motion vector. Designers need to make 
tradeoffs between these two conflicting parameters. The motion 
vector is illustrated in figure 2.4. 
Traditionally one motion vector is produced from each mac-
roblock in the frame. MPEG-1 and MPEG-2 employ this prop-
CHAPTER 2. DIGITAL VIDEO COMPRESSION 22 
erty. Since the introduction of variable block size motion esti-
mation in M P E G - 4 and H.264/AVC, one macroblock can pro-
duce more than one motion vector due to the existence of dif-
ferent kinds of subblocks. In H.264, 41 motion vectors should 
be produced [42] in one macroblock and they are passed to rate-
distortion optimization to choose the best combination. This is 
known as mode selection. 
2.3.5 Quality judgment 
The quality of a video scene can be determined using both objec-
tive and subjective approaches. The most widely used objective 
measure is the peak-signal-to-noise-ratio (PSNR) [42] which is 
defined as: 
r 2552 1 
尸STVi? 二 l O Z o 仍 ( 2 . 5 ) 
where the M S E is the mean square error of the decoded frame 
and the original frame (refer to section 2.3.3 for the exact for-
mula). The peak value is 255 since the pixel value is 8 bits in 
size. 
The higher the PSNR, the higher the quality of the encoding. 
The P S N R and bit-rate are usually conflicting, the most appro-
priate point being determined by the application. Although 
P S N R can objectively represent the quality of coding, it does 
not equal the subjective quality. Subjective quality is deter-
mined by a number of human testers and a conclusion is drawn 
based on their opinions. There exist cases for which high P S N R 
results in low subjective quality [42]. However, in most cases, 
P S N R provides a good approximation to the subjective measure 
and we use this measure in the rest of the thesis. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 23 
2.4 Block-based matching algorithms for mo-
tion estimation 
Block-based matching algorithms are processes for finding mini-
m u m motion vectors in the motion estimation process. Different 
kinds of algorithm could give a different motion vector. Among 
all algorithms proposed, only full search gives optimal result 
within the search range. Other algorithms will give near-to-
optimal results but significant lower complexity by reducing the 
number of search points. In this section, we describe several 
block-matching algorithms which are commonly used in modern 
coding standards. 
2.4.1 Full search (FS) 
The full search algorithm finds a global optimal motion vector 
from the entire candidate blocks within the search window. If 
our search window is 48 by 48 pixels and the block size is 16 by 
16，there will be 16* 16 二 1024 candidates that need to undergo 
S A D computation. FS exhaustively searches all candidates until 
a minimum motion vector is found and it is the algorithm with 
the simplest data flow and control. It is suitable for hardware 
implementation since the high computational complexity can be 
overcome by parallelism and high bandwidth can be overcome 
by systolic architectures. 
For software implementations, full search is often ignored [8 
as contemporary microprocessors cannot handle full search with 
acceptable performance, especially for real time applications. 
The number of search locations to be examined by full search 
is directly proportional to the square of the search range r. The 
number of search location in search area is (2r + 1)2. So the 
algorithm complexity of full search is O(r^). 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 24 
2.4.2 Three-step search (TSS) 
The three step search algorithm has been proposed by Koga [23 
and implemented by Lee et al [27]. This algorithm makes an 
assumption that the residue values increases radically from the 
absolute minimum point within the search area. This algorithm 
searches for the direction of the greatest decrease in residue and 
from there, continues to find the minimum point. 
In the first step, TSS compares the nine search points sur-
rounding the center point with step size p equal to or larger than 
half of the maximum search range r. Among the 9 search points, 
a minimum is selected and becomes the center of the next step. 
Next, the step size is halved and 8 new search points (exclud-
ing the center) are searched and again a minimum is selected. 
The step size is halved again and search continues until the step 
size is equal to one. The minimum search point is found at this 
stage. 
Prom the above described procedure, it can be observed that 
the TSS constantly divides the search step size by two and 
is therefore a logarithmic search. The total number of search 
points is [1 + Slog{p)]. Except for the first step, the 8 search 
points are calculated in each iteration and the algorithmic com-
plexity for TSS is 0{Slog{p)) = 0{log{r/2)) where pis the initial 
step size and typically equals to r/2. 
This algorithm has advantages of much lower complexity than 
FS in terms of number of candidates to evaluate and efficient im-
plementations in both software and hardware. Even for software 
implementations, this algorithm [25] can offer real time encod-
ing. It has the drawback of degraded quality since it can easily 
be trapped into a local minimum. Also, its large initial step size 
can lead to poor results. Furthermore, its data dependencies 
restrict parallelism in hardware implementations. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 25 
2.4.3 Two-dimensional Logarithmic Search Algorithm 
(2D-log search) 
Two-dimensional logarithmic search is another logarithmic search 
proposed by Jain and Jain [21]. It has less search points than 
TSS but its prediction is more accurate. It also defines a step 
size at the beginning and terminates when the step size equals 
to one. 
First the algorithm begins by calculating the S A D in the cen-
ter of search area and another four points 士p pixels away from 
center in the horizontal and vertical directions. If the minimum 
S A D is located at the center, the step size is halved. Otherwise, 
one of the 4 search points will become the center and another 4 
search points ±p pixels away from the new center will be calcu-
lated. In this case the step size is kept until the minimum S A D 
is located at the center. When the step size is reduced to 1, 
instead of four search points, the nine search points surrounding 
the current center are searched and a minimum point is found. 
The complexity of 2D-log search is similar to TSS and equals 
to 0{log{r/2)). The difference is that the Big-0 constant is 
lower than that of TSS. 
This algorithm has advantages of better prediction quality 
than TSS with some additional control overhead introduced. 
Again, its data dependencies do not favor fully parallel hardware 
implementations. 
2.4.4 Diamond Search (DS) 
The diamond search algorithm was proposed by Zhu and M a 
in 1997 [56, 57] and, as its name suggests, employs diamond 
search patterns. It is commonly employed for fast searches since 
it provides the best quality to complexity ratio. It makes use 
of two diamond shapes for searching: a large diamond with 9 
search points and a small diamond shape with 5 points. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 26 
This algorithm initially employs a large diamond shape in the 
beginning and searches for the minimum S A D location. If the 
minimum location is not at the center, the new center is reset 
to the minimum location just found and the search is continue 
using a large diamond pattern until a minimum value is located 
at the center. Then the algorithm switches to small diamond 
pattern and the minimum point is found in this final stage. 
The complexity of this algorithm is lower than the entire 
logarithmic search described above. Its complexity is in order 
of 0(/o^(r/2)). The average search points in DS is less than 20 
for a normal scene [49 . 
DS gives better prediction quality and lower complexity than 
TSS and 2D-log search. It is the best choice for software im-
plementations. For hardware implementation, the two diamond 
sizes add a small but acceptable complexity to control circuits. 
2.4.5 Fast full search (FFS) 
In the full search algorithm, the S A D is computed for all can-
didate positions. When a smaller value is found, it is recorded 
as the current minimum SAD. It is possible to speed up the 
rejection of incorrect candidates via mathematical techniques 
such as Successive Elimination Algorithms (SEA) [29], progres-
sive norm successive algorithm (PNSA)[37] without any qual-
ity loss. Partial distortion elimination (PDE) [13] is a method 
of early comparison that compares the partial S A D with the 
current minimum SAD. The speedup is possible because these 
techniques employ other matching criteria, e.g. SEA, P N S A and 
partial SAD, which are easier to calculate than S A D and early 
elimination for impossible candidates can be achieved. 
The algorithms suggested above focus on mathematical opti-
mizations. In hardware implementations, arithmetic optimiza-
tions can also improve the efficiency of full search. For example, 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 27 
in a bit serial implementation of SAD, computation can be saved 
by employing early termination in the comparison stage. The 
early termination is brought by the fact that comparison is a 
most-significant-bit-first process. In the next chapter we will 
explain the early termination technique in detail. 
2.5 Complexity analysis of motion estimation 
In video compression applications, we define the complexity of 
an algorithm in terms of the number of required operations and 
express the complexity as M O P S (Million operations per sec-
ond). In the following subsection we will compare the com-
plexity of motion estimation algorithm in terms of M O P S . The 
following assumptions have been made for the comparisons. 
1. The macroblock size is 16 by 16. 
2. The S A D cost function requires 2* (16* 16) data loads, 16* 
16 = 256 subtraction operations, 256 absolute operations, 
256 accumulate operations, 1 compare operation and 1 data 
store operation. In total, 1282 operations are needed for one 
S A D computation. 
3. GIF resolution is 352 * 288 and H D T V 720p resolution is 
1280 * 720. The number of macroblocks in a GIF frame is 
396 and for H D T V 720p is 3600. 
4. The frame rate is 30 frames per second. 
5. The total number of operations required to encode GIF 
video in real time is 1282 * 396 * 30 * (#ofsearchpoints)= 
15.23016 * (^ofsca/rchpoiuts) M O P S . To encode H D T V 
720P video signal, 138.456 * {^ofsca/rchpoiuts") M O P S is 
needed. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 28 
Algorithm Num of searching point CIF HDTV 
when search range = 士 16 
FS 1024 ~15600 142000 
^ 33 — 502 4600 
2D-log Search 30 —456 4200 
DS ^ 381 3500 
Table 2.2: Complexity of block-based searching algorithms (measured in 
MOPS) 
2.5.1 Different searching algorithms 
Assume that the search range is confined to 土 16. Table 2.2 
shows the number of search points needed for each of the al-
gorithms described above. The last 2 columns show the num-
ber of operations needed for CIF and H D T V 720P resolutions 
respectively. The large computational requirements limit the 
ability of general purpose processors to perform these searches 
in real time. In modern general purpose processors such as Intel 
Pentium-4, its performance is a few giga operations per second 
(GOPS) and a full search is totally impractical in software. Dig-
ital signal processors, ASIC or F P G A technologies are the only 
choices available for FS. 
2.5.2 Fixed-block size motion estimation 
In the first generation coding standards, the block size is con-
fined to 8 by 8 or 16 by 16. A large block size favors encoding 
of a uniform area whereas small block sizes favor detailed area 
encoding [12]. Within a picture, detailed uniform areas coexist 
and fixed block sizes must sacrificc prediction quality to reducc 
complexity. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 29 
16x16 16x8 8x16 8x8 
39 33 34 
41 37 38 
40 35 36 
1 7 1 9 n 2 I 3 I 4 
25 26 27 28 
18 20 5 6 7 8 
21 23 9 10 11 12 
29 30 31 32 
22 24 13 14 15 16 
8x4 4x8 4x4 
Figure 2.5: Sub-macroblock partitions in H.264/AVC 
2.5.3 Variable block size motion estimation 
In order to adaptively select a suitable block size for picture mac-
roblock, variable block size motion estimation has been added in 
the latest codec standards, e.g. H.264. In H.264 [48], each pic-
ture (frame) is segmented into macroblocks. Each macroblock 
is further divided into sub-blocks with 7 different types of block 
sizes (4x4, 4x8, 8x4, 8x8, 8x16, 16x8 and 16x16) as shown in 
Figure 2.5. Each macroblock has in total 41 types of sub-blocks 
to cover the whole macroblock. In variable block size motion es-
timation, for each type of subblock, a motion vector is produced. 
In total 41 motion vectors are calculated per macroblock. 
Since the number of motion vector is increased from 1 to 41 
in variable block size motion estimation, the number of com-
parison operations in the computation of a S A D is also in-
creased. The number of operations to find the motion vcctors 
is 1282+40=1322 operations. In a software implementation this 
is not a big increase but for hardware implementation, this in-
crease is significant as the number of comparators is increased 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 30 
from 1 to 41, contributing a significant hardware cost. 
2.5.4 Sub-pixel motion estimation 
Sub-pixel motion estimation involves searching sub-sample in-
terpolated positions as well as integer-sample positions. Since 
the motion of a macroblock between two successive frames can 
be half way between two integer positions in most cases, the ad-
dition search on sub-pixel accuracy block improves the overall 
quality of the prediction. In H.264/AVC [48], quarter-pixel ac-
curacy motion estimation is supported in addition to half-pixel 
accuracy. 
In the first stage, motion estimation finds the best match 
on the integer sample. The encoder search half-sample posi-
tions immediately next to this best match to see whether the 
match can be improved. If required, quarter-pixel samples are 
searched for further improvement. The added complexity is due 
to the interpolation of sub-pixels and added search positions. 
Assume interpolation for the half-pixel accuracy pixel needs 11 
operations for each pixel and 3 operations are needed for quar-
ter pixel accuracy. W e have 9 more search points for half-pixel 
accuracy and 9 more for quarter pixels. W e have in addition 
9 * 256 * (11 + 3) = 32256 filtering operations per macroblock 
and the number of search point is increased by 18. Figure 2.6 
shows the integer position, half-pixel and quarter-pixel positions 
of search candidates. 
2.5.5 Multi-reference frame motion estimation 
In previous standards, for prediction of macroblocks, only refer-
ences to the immediate previously coded I-picture or P-picture 
are required. In H.264/AVC, this restriction is released to enable 
efficient coding by allowing selection among a larger number of 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 31 
〇 〇 〇 〇 Key; 
O Integer search positions 
II~I II 
U U U O Best integer match 
〇 O • O • O • Half-pel search positions 
分 确 I • • Best half-pel match 
A A A "~ A Quarter-pel search positions 
( J O U U A Best quarter-pel match 
〇 〇 〇 〇 
Figure 2.6: Integer, half-pixel and quarter-pixel motion estimation search 
positions (pel stands for pixel) 
pictures that have been decoded and stored. Up to five reference 
frames can be used in H.264/AVC [48:. 
Motion estimation algorithms are modified accordingly in this 
case. Depending on the searching algorithm, the complexity can 
be increased by five times in the worst case. 
2.6 Picture quality analysis 
The performance of motion estimation algorithms: FS, TSS, 
2D-log and DS are simulated in software using Mat lab. The 
simulation result on the standard Foreman sequence is shown in 
figure 2.7. This video consists of a slow moving background and 
detail facial motions. As expected, full search performs the best 
among all algorithms. For fast search algorithms, DS performs 
better than TSS and 2D-log search. As a result, DS is a common 
choice in fast search motion estimation. 
CHAPTER 2. DIGITAL VIDEO COMPRESSION 32 
PSNR comparts on of diflereni motion estimation algorithm wtth search size 16 
38 r 1 . 71 
Diamond Search 
- - F u l l search (integer pixel) 
3 ^ t e p search 
. ——2D-k>g search 
34 - \ -
\ 
3 2 - V -
I 、 \ 
1 、、 
2 3 0 - N -
、：\ V 
28 - 、 厂 V -
— � � 
� … ' ^ ^ " " ^ ^ ， - • 
241 1 1 1 1 1 
0 5 10 15 20 25 30 
Frame number 
Figure 2.7: Matlab simulation on the quality of different motion estimation 
algorithms on Foreman 
2.7 Summary 
In this chapter we gave background information on video coding 
with an emphasis on motion estimation. The motion estimation 
procedure, matching criteria, quality judgment were presented. 
W e also reviewed algorithms for performing motion vector pre-
diction, comparing their complexities and qualities. Advanced 
motion estimation features such as fractional motion estimation, 
multi-reference fames were also discussed. 
Chapter 3 
Arithmetic for video encoding 
3.1 Introduction 
Computer arithmetic is an important area of digital computer 
organization concerned with the realization of arithmetic func-
tions to support computer architectures as well as arithmetic 
algorithms for software implementation. Architectural and al-
gorithmic optimizations are studied to maximize the efficiency 
of arithmetic operations. 
In general purpose processor, efficient digital circuits for math-
ematical primitives such as +, —，x, 《、log, sin, cos are em-
ployed. Implementing a complex problem, e.g. motion esti-
mation, using general purpose processor, the designer searches 
an appropriate algorithm and implements it in an acceptable 
complexity using combinations of mathematical primitives pro-
vided. The optimization is often restricted since the maximum 
parallelism cannot be exploited. 
To address this issue, algorithm specific optimizations can be 
applied to greatly improve efficiency. In the following section we 
use motion estimation as an example to show how the arithmetic 
circuit can be optimized using arithmetic approach. 
33 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 34 
3.2 Number systems 
W e have been familiar with decimal numbers in calculation for 
thousand of years, whereas number representation for computer 
systems have only been developed this past century. In digital 
systems, numbers are encoded by means of binary digits. As 
the representation has significant cffcct on algorithm and circuit 
complexity, a suitable representation should be correctly chosen 
for any special applications if designers want to fully exploit the 
optimizations. 
In this section we will introduce the non-redundant num-
ber systems such as the sign-and-magnitude and complement 
number schemes. For redundant number systems we introduce 
carry save and signed-digit number schemes. Other representa-
tions also exist, e.g. residue number system, logarithmic number 
system and floating point number system. Since they are not 
employed in our applications, interested readers can refer to an 
arithmetic textbook [10] [28] [40] for further details. 
3.2.1 Non-redundant Number System 
A number N can be represented by a string of n digits with r 
being the radix as follows. 
N = {dn-ldn-2--'dldo)r 
di, where 0 < z < n — l,isa digit and di G {0,1,..., r — 1}. It is 
a positional weighted system where the position of di matters. 
The value of N is defined by: 
N = dn-i X r”—1 + dn-2 X r”—2 + ... + dixr^do xr^ 
n-l 
= E 4 X n 
i=0 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 35 
dn-i is called the most significant digit (MSD) and do is called 
the least significant digit (LSD). In binary encoding system, each 
digit is called a bit and the above terms become M S B and LSB. 
Sign-and-magnitude number represent at ion 
There are several ways to represent a negative number in a non-
redundant number system. In standard mathematical notation, 
士 sign is appended to the front of the string of digits to indicate 
the number is positive or negative. In case of computer systems, 
a common convention to represent negative sign is appending a 
“1” to the M S D of binary string to represent a negative number 
and a “0” for a positive one. Such representation is called sign-
and-magnitude representation scheme. As a result, the number 
of bits representing the number is increased by 1. A Sign-and-
magnitude number S is presented as: 
S = (dndn-idri-2-"did{))j. 
where dn 二 0 when it is positive and 心 二 1 otherwise. The 
value of S is defined by: 
S = (—l)^x(4-iXr"-i + 4-2Xr"-2 + � + (ioxrO) 
= (— 1产 E 必 x n 
i=0 
Advantages of sigri-and-magnitude representation include its 
conceptual simplicity, symmetric range and simple negation by 
flipping the sign bit. A disadvantage is that the addition of num-
ber needs to be handled differently when signs of two numbers 
are different. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 36 
Complement representation 
Another way to represent a signed number is to employ a com-
plement number representation. The negation of a number X 
is represented as unsigned value M — X where M is a suit-
ably large constant greater than X. Representing integers in 
range [—iV, +P] requires + + = + P + l 
for maximum coding efficiency. For example to code [—7, +8], 
M = 7 + 8 + l = 16. The value X of a r's complement number 
is defined as follows: 
X = -dn X 严 + dn-i X 严—1 + dn-2 X 2 + ... do X r^ 
n-1 
=—d n X 严 + E X n 
2=0 
I 
Addition and subtraction can be performed easily in hard-
ware using complement forms since the adder and subtracter 
can be combined in a unit. For this reason, 2's complement 
scheme is often used in computer systems. 
3.2.2 Redundant number system 
The number systems introduced in the previous sections belong 
to non-redundant, positional weighted systems. Each digit has 
only a positive value which is less than radix r. In this section we 
present another number system, the redundant number system 
including the signed-digit number scheme and carry save number 
scheme. These number systems are redundant as a value can 
be represented more than one way. In the signed-digit number 
scheme, each digit can either have a positive or a negative value 
bounded by 土a. In carry-save number scheme, each digit is a 
positive value which is bounded by 2r — 1. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 37 
Signed-Digit number 
Signed-digit number is a redundant number representation sat-
isfying the following constraints. Suppose we have a signed-digit 
number SD. It is represented by: 
SD = (Xn-lXn-2-"XlXo)r 
r 1 -
where radix r > 2，—a < Xi < a and 宁 < a < r — 1. 
To represent the signed-digit number in minimum redun-
dancy [10], we set a = ^ . To value of SD is defined as: 
n-1 
value of SD = Y1 ‘ — where Xi G {—a,…，—1,0，1,a) 
i=0 
Carry-save number 
Carry save number is also a number system with redundancies. 
Suppose we have a carry save number CS, it is represented by: 
where radix r > 2, 0 < Xj < 2r — 1 since carry G {0,1}. The 
value of a carry save number is defined as: 
n - l . 
value of CS = Y^ xi • ？ where xi G {0,1,…，2r — 1} 
i=0 
In the redundant number system, addition time can be in-
dependent of word length. Since the length of maximum carry 
propagation can be reduced, for which the adder can work at 
higher efficiency. It favors operations for which both the input 
and output operands are in redundant representations. 
The drawback is the increase in the number of bits required 
for representing a number. Moreover, it makes the magnitude 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 38 
comparison and sign detection more complex than that in non-
redundant number systems. Finally, it takes time to convert 
redundant numbers to non-redundant one when non-redundant 
results are desired. 
Application of redundant number systems 
Applications of the redundant number system appear in many 
areas such as cryptography [36], digital signal processing [44], 
etc. These fields involve calculation of either very long operands 
or multi-operand operations which can benefit from redundant 
number representations. For example, R S A decryption involves 
1024-bit arithmetic and is slow in non-redundant number sys-
tems. Redundant number systems can be used to reduce time 
complexity. Digital signal processing (DSP) operations often in-
volves filtering which employs multi-operand operations. Multi-
operand addition can be speeded up by employing redundant 
number systems. In these applications, since the performance 
speedup outweighs area overhead, redundant number systems 
are commonly chosen. For motion estimation, which involves 
many multi-operand operations, the advantages of redundant 
number systems such as low latency and high pipelinability are 
compelling. 
3.3 Addition/subtraction algorithm 
Addition and subtraction are basic operations in computer arith-
metic and also in our problem domain - motion estimation. 
Many algorithms and architectures have been proposed in previ-
ous literatures to perform addition or subtraction in hardware. 
Both non-redundant and redundant approaches are applicable 
to these algorithms. Since the radix-2 is commonly used in com-
puter system, it is assumed in all algorithms below. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 39 
3.3.1 Non-redundant number addition 
This is the simplest addition algorithm in computer arithmetic. 
Suppose we have two n-bit operands, x and y, to be added, such 
that 0 < x, y < 2"' — 1. Together with carry-in G {0,1} as 
inputs, output sum s is calculated where 0 < 5 < 2" — 1 and 
carry out Cout ^  {0,1}. W e have, 
I + y + Qn 二 2几Qmt + S (3.1) 
The solution to this equation 3.1 is, 
s = (x + y + Cin) mod T 
and 
—丨 1 if (x + y + Cin)>2-
CQIlf, \ 
0 otherwise 
= + y + 几 J 
When n = 1, the addition algorithm reduces to a primitive 
module called full-adder (FA) which is a fundamental element 
to build big adders in hardware. The recursion of the above 
equation for n = 1 gives implementation of word adder built on 
FA arrays. 
3.3.2 Carry-save number addition 
A carry save representation allows us to eliminate long carry 
propagations in a non-redundant number addition algorithm. 
Suppose we have three radix-2 n-bit non-redundant numbers 
z such that 0 < z < — 1 as inputs and produces as 
outputs the sum vector sv where 0 < < — 1 and the carry 
vector cv where 0 < cf < 2奸i — 1, we have: 
— ^ ― / 
x-^y-\-z = cv-\-sv 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 40 
where 
svi = Xi-\-yi-{- Zi mod 2 
cvi+i = + Vi + Zi)/2_ 
CVQ = 0 
0 < i < n - l 
The sum of three numbers is in carry save format {cv, cs]-. 
To convert it to a non-redundant number, a final non-redundant 
addition is needed. . 
3.3.3 Signed-digit number addition 
The objective of signed-digit addition, like carry save addition, 
is to eliminate the carry propagation for long word length num-
bers. Suppose we have two signed-digit numbers, x and y. The 
procedure of performing signed-digit addition consists of two 
steps. 
1. Compute the interim sum {w) and transfer {t) such that 
Xi + Vi = Wi + rti+i (3.2) 
at digit level 0 < z < n — 1 while n is the number of digit 
and r is the radix, w acts like the sum in the carry-save 
addition but instead of being values between 0 and r, it is 
bounded by —a < w < (5. t acts like a carry to the next 
position with t G { — 1,0,1} instead of 0,1 in carry-save 
representation scheme. 
2. Compute the sum s of w and t such that for 0 < z < n — 1, 
Si = Wi + U 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 41 
The carry-free property can be ensured by avoiding carry 
generations of adding w and t. Given —a < 5 < a, the 
following constraints should be satisfied on values of w and 
t so that no carry is produced. The constraint is given 
below. 
—a + t~ < Wi < a — f^ 
where 
-t— < ti+i < 
To implement a radix-2 signed-digit addition, this constraint 
is never met because if the radix is 2’ a 二 1 and —l-\-t~<Wi< 
1 —亡+ implies either = 0 or 々 +1 = 0 which is not possible to 
satisfy equation 3.2 [10 . 
To allow the signed-digit addition in radix-2 operands, a 
recoding modification is made [10]. As the signed-digit ad-
dition can be viewed as a recoding of digit set of Xi + i/i e 
{-2,-1,0,1,2} into digit set of {-1,0,1}, two recodings are 
performed. First, the digit set is recoded from {—2, —1，0,1,2} 
to {-2,-1,0,1}. 
工i + yi = 2/ii+i + Zi 
such that hi G {0,1} and Zi G {-2, -1，0}. The sum of 2/i.j and 
Zi has a digit set {—2,—1,0,1}. Then this digit set is recoded 
to {-1,0,1} by setting 
Zi + hi = 2ti+i + Wi 
such that ti G { — 1’ 0} and Wi G {0,1}. 
Details and examples of radix-2 signed-digit addition are pre-
sented in [10]. In this work we employ a 2-level recoding ap-
proach to implement our signed-digit multi-operand addition in 
our bit-serial architecture in chapter 4. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 42 
3.4 Bit-serial algorithms 
Bit-serial arithmetic [10] has the advantages of smaller silicon 
area, low pin count and higher operating frequency. Since the 
routing in bit-serial designs can be much shorter than corre-
sponding bit parallel implementation, the throughput can be 
competitive to a bit-parallel approaches. Moreover, performance 
per gate in a bit-serial design is often much higher. It is often a 
good choice in implementation platforms where storage elements 
are abundant, as is the case for FPGAs. 
Bit-serial implementations can be categorized into two process-
ing orders: Least-significant-bit first (LSB-first) and Most sig-
nificant bit first (MSB-first). MSB-first arithmetic is also called 
on-line arithmetic in the literature since it adds an on-line de-
lay from input to output. Some operations like addition and 
subtraction are more easily implemented LSB-first while com-
parisons, square roots and divisions are more efficient MSB-first. 
3.4.1 Least-significant-bit (LSB) first mode 
LSB-first mode is an arithmetic computation starting from the 
least weighing digit. LSB-first arithmetic can be employed in 
non-redundant number systems or redundant number systems. 
With a word size of N bits, the number of iterations to produce 
the result is N. LSB-first arithmetic can produce the output 
bits without input-to-output delays when LSB-first favored al-
gorithms are implemented. Once the input bits are ready, the 
output bits of corresponding inputs can be calculated immedi-
ately in the next computation cycle. 
LSB-first favored algorithms include additions, subtractions, 
multiplications, etc. As a result LSB-first mode is applicable to 
almost all the simple primitive mathematical operations. For 
example, a LSB-first addition algorithm for radix-2 is shown in 
figure 3.1 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 43 
Algorithm LsbBsAdd 
Input: Operands x, y and carryo 
Output: Sum s and carry out carryn 
1. for (i from 1 to N) 
2. do Si = {xi + yi + carryi-i) mod 2; 
3. carryi = {xi + + carryi—i) » logo 2; 
4. return s and carryn 
Figure 3.1: LSB-first bit-serial addition algorithm 
where carryo = 0 and the result is A^ + 1 bits including the 
carry out. This addition takes N cycles to complete. 
3.4.2 Most-significant-bit (MSB) first mode 
MSB-first arithmetic, also known as on-line arithmetic, is a bit-
serial scheduling technique in which the calculation is started 
from the largest weighing digit [11]. Its idea is to perform 
computation overlapped with digit by digit communications of 
operands. Division and square root operations can be imple-
mented efficiently using MSB-first calculation. An important 
characteristic of on-line arithmetic is that an on-line delay 6 is 
injected such that digit j input will complete calculations at j+J 
cycle. 
Addition and subtraction must employ redundant number 
representation to be computed in a MSB-first manner. A com-
monly used redundant number set is signed-digit numbers. For 
example, a MSB-first addition algorithm for radix-r (r > 2) is 
shown in figure 3.2: 
where wjv = 0 and zq = 0. The result is in signed-digit format 
and 1 cycle on-line delay is introduced. In contrast, two cycles 
on-line delay are introduced for radix-2 two operand additions 
1 0； . 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 44 
Algorithm MsbBsAdd 
Input: Operands x, y 
Output: Sum z 
1. for (i from N-1 down to -1) 
2. do Wi = {xi + Ui) mod r; 
3. ti+i = (xi + Vi) » log2 r; 
4. Zi+i = Wi+i + ti+i\ 
5. return z 
Figure 3.2: MSB-first bit-serial addition algorithm 
3.5 Absolute difference algorithm 
This section describes how the operation "Absolute difference" 
(|a — 6|) can be done in bit-parallel and bit-serial approaches. 
The procedure of absolute difference is described as follows. 
1. Subtraction of the two numbers a and b. 
2. Check if the result is negative. 
3. Convert the negative number to a positive result if the dif-
ference is negative. 
The following subsection describes how to realize these three 
steps in non-redundant and redundant number systems. 
3.5.1 Non-redundant algorithm for absolute difference 
In the first step, a subtraction is done by converting b to its two's 
complement format and performing the addition. This involves 
an (N+l)-bit addition. The sign of the difference can be deter-
mined by looking at the most significant bit of the difference. 
Lastly, the result is converted to its 2's complement representa-
tion if it is negative. At the hardware level, the worse case is 
two (N+l)-bit additions and a sign-detection. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 45 
Non-redundant number system is commonly used in bit par-
allel absolute difference implementation since it can determine 
the sign in a short time. But for S A D in motion estimation, the 
absolute difference calculated is not necessary presented in non-
redundant numbers since multi-operand additions in the sub-
sequent stage can accept redundant inputs. As a result, the 
absolute difference is modified to: 
1. Perform A-\- B where A is a bit-wise N O T on A. 
2. Check if carry out = 1, if yes, B < A. 
3. If carry out = 1, set {A, B) as output. Set (A, B) as output. 
In this way, result is represented by two N-bit pairs. The 
worse case is one N-bit addition and one comparison only (ne-
glecting the cost of the inverter operation). This approach has 
been used in a S A D processor proposed in [53 . 
3.5.2 Redundant algorithm for absolute difference 
When we perform absolute difference in a signed-digit number 
system, we have to redefine the comparison and absolute value 
operations. In motion estimation a and h in \a — b\ are 8-bit 
positive numbers representing the intensity of a pixel. The first 
step a — b can be done by converting a — b into signed-digit 
format directly. 
In signed-digit number system, each significant digit is rep-
resented by {—1,0,1}，or X-} e {(0,0), (0,1), (1,0)，（1,1)} 
in binary format. {(0,0)，(1,1)} is redundant and equal to zero 
value. As a result, a—b can be represented in signed-digit format 
setting (x/“ — a./, x^ = bi) where 0 < z < A^ — 1 and TV = 8 in 
this case. Second, a - 6 in signed-digit format is checked to see 
if it is a positive or a negative number. Finally, it is converted 
to positive signed-digit number as output. The conversion is 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 46 
Algorithm SignDectectSignNum 
Input: Current block pixel: a Reference block pixel: h 
Output: |a - b\ in signed-digit format: x 
1. aJarger = 0; 
2. for {i from N downto to 1) 
3. do Xi = (ai.bi)] 
4. for (i from N downto to 1) 
5. do if (xi = (0,0) or Xi = ( 1 , 1 ) ) 
6. then continue; 
7. if ( x = (l,0)) 
8. then a-larger = 1; 
9. break; 
10. if ( x = (0，1)) 
11. then aJarger = 0; 
12. break; 
13. if {aJarger = 0) 
14. then for (i from N to 1) 
15. do Xi = {bi,ai)\ 
16. else for (i from N to 1) 
17. do Xi = (ai,bi); 
18. return x 
Figure 3.3: Signed-digit number based sign dctcction algorithm 
achieved by just interchanging x'^ with x~ if a negative signed-
digit number is detected, i.e. {x^ = bi,xl = ai). The sign 
checking for signed-digit numbers is described in figure 3.3. No-
tice that iV = 8 is the word-length and the checking is performed 
in a MSB-first manner. 
X is the result of \a — b\ in signed-digit format. In hardware, 
a signed-digit absolute difference requires no addition or sub-
traction. The sign detection can be done on-tlie-fly as it is a 
MSB-first favored algorithm and thus the hardware latency is 
not significant compared to a conventional two's complement 
approach. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 47 
Algorithm SadSeq 
Input: Current block pixel: c, Reference block pixel: r 
Output: Sum of absolute difference: SAD 
1. SAD[0] = 0; 
2. for {i from 1 to MN) 
3. do SAD[i] = SAD[i-l] + Iq - ,�.|; 
4. return SAD 
Figure 3.4: Sequential SAD computation in general purpose processor 
3.6 Multi-operand addition algorithm 
Multi-operand addition plays an important role in motion esti-
mation as it comprises most of complexity in a S A D calculation. 
S A D operation is given by: 
MN 
SAD = E \ c i -
1=0 
where M,N are the width and height of a macroblock respec-
tively. The summation can be performed in parallel or sequen-
tially. In general purpose processors, the summation is executed 
sequentially shown in figure 3.4 but a parallel approach is often 
chosen for high-end applications. In the following section, three 
parallel implementations are presented. 
3.6.1 Bit-parallel non-redundant adder tree implemen-
tation 
Multi-operand addition based on a non-redundant adder tree is 
the easiest way to implement multi-operand additions. Suppose 
we have an N-bit 2-operand adder (2N-Add) as a calculation 
element, the M x A^'-operand adder tree can be built as follows. 
1. • 2N-Add adders are needed for the first level addition. 
^ ^ + 1 result operands are produced. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 48 
0p1 Op2 Op3 Op4 Op5 Ope Op7 Op8 
8-bit adder 8-bit adder 8-bit adder 8-bit adder 
l J ^ ~ ^ 
8-bit adder 8-bit adder 
8-bit adder 
Result 
Figure 3.5: Bit-Parallel 4x2-operand adder tree 
MN I 1 
2. Feed the result operands into another series of ^ 
2N-ADD adders as inputs and produce half number of the 
result operands calculated at step 1. 
3. Continue to add the result operands until number of result 
operands is equal to 1 and the summation terminates. 
This type of adder tree can be pipelined to increase the 
throughput. Its cycle time depends on the word size and num-
ber of operands to be added. For example, if we have 256 8-bit 
operands to be added, the critical path delay in a hardware 
pipeline is {log2256) + 8 = 16: a 16-bit addition delay. 
The speedup of this adder tree is MN/logiMN over the se-
quential approach. Assume that the cycle time in the sequential 
implementation is equal to that of adder tree and no pipeline 
execution, MN = 256 implies the speedup is 256/Zo仍256 = 32, 
which is a significant improvement over general purpose proces-
sors. With pipelined execution, the speedup can be even more 
significant. A n example of 4x2-operand adder tree is shown in 
figure 3.5. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 49 
3.6.2 Bit-parallel carry-save adder tree implementa-
tion 
Non-redundant number system is employed in non-redundant 
adder trees. In fact, the cycle time in a non-redundant adder 
tree can be reduced by employing carry-save adders in which 
the carry propagation is independent of the operand size. To 
illustrate the carry-save based multi-operand adder tree, a com-
putation element, 4-to-2 compressor, is used as a primitive cal-
culation. It adds 4 operands and produces 2 result operands. A 
example of 4-to-2 compressor based adder tree for addition of 
16 operands is shown in [10 . 
When using a 16-bit 4-to-2 compressor as a primitive com-
ponent, the construction of 4-to-2 compressor based adder trees 
is similar to that of non-redundant adder trees. The number of 
adder tree levels is reduced by one in carry-save adder trees. The 
output of carry-save adder tree is in carry-save format and the 
final output should be added using a carry propagation adder 
to convert it into a non-redundant number when desired. As 
a result, their effective pipeline levels are equal. In contrast, 
the cycle time of a 4-to-2 compressor is less than that of a non-
redundant adder since its carry save nature shortens the carry 
propagation delay. As a result the speedup over general purpose 
processors is greater than non-redundant adder trees. 
3.6.3 Bit serial signed digit adder tree implementation 
The mentioned adder trees above are implemented in a word-
parallel manner. A primitive computation element involves N-
bit additions and occupies a large silicon area. To address this 
issue, we try to implement multi-operand additions in bit-serial 
approach. Only MSB-first approach is discussed here. 
Signed-digit representation must be used to enable MSB-first 
addition. Signed-digit adders [10] are used throughout the adder 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 50 
Cycle A- B- Rr R^ 
^ ^ ^ Q 1 1 Q ~Q Q 
1 —0 0 1 0 0 0 
2 0 1 1 0 1 1 
3 1 0 i 0 ~ 0 1 
4 0 1 1 0 0 r~ 
~ ~ ^ 1 0 0 1 0 
^ 6 0 i i 0 1 r~ 
7 0 1 0 1 0 0 
8 —0 0 0 0 1 0 
9 0 0 0 0 0 0 • 
M = T O T I T I T T ( - 1 5 1 I O ) 
' ' 5 = 1 1 1 1 1 T I T ( 2 4 5 I O ) 
C/?. = 0 0 0 1 1 0 0 0 T 0 ( 9 4 K ) ) 
Table 3.1: Example for online signed-digit adder 
tree. Figure 3.6 shows the architecture of a signed-digit bit-serial 
adder for 2-operand additions. Table 3.1 shows an example of 
how 2-operand addition is performed for signed-digit inputs. 
The construction of adder tree based on signed-digit adders 
is also similar to non-conventional adder tree but the cycle time 
is shorter since it is carry-free and bit-serial. The construction 
of signed-digit adder tree is discussed in [50]. It gives several 
optimizations for silicon area and delay for different number of 
operands. 
3.7 Comparison algorithms 
In motion estimation, each S A D computed should be passed to 
a comparison unit and checked if its value is less than the cur-
rent minimum value. The minimum S A D and motion vector 
is updated when this is the case. In MPEG-1, MPEG-2, the 
macroblock size is 16 by 16 and the final S A D size is 16 bits. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 51 
A + A - B+ B -
I A I i 
ol-CASFA D I 
oI-CSFA 
R - R+ 
o l - S D F A 
Figure 3.6: Bit-serial signed-digit adder (ol-CSFA stands for on-line carry-
save full adder) 
A 16-bit comparison should be made. In contrast to MPEG-2, 
H.264/Ave supports variable block size motion estimation and 
the number of comparisons for each S A D is increased to 41. The 
comparisons are divided into 12-bit, 13-bit, 14-bit, 15-bit and 
16-bit types for 41 minimum S A D values. As comparison algo-
rithms can be MSB-first or LSB-first, the mode selected affects 
only the complexity. MSB-first is common since the algorithm 
can be terminated earlier. In the following sections a MSB-first 
comparison algorithm is described. 
3.7.1 Non-redundant comparison algorithm 
In non-redundant number systems, implementing the compar-
ison of two numbers starting from the M S B is simple and the 
process can be terminated earlier without examining all bits. For 
example, it can be deduced that “0111” is larger than "0011" in 
two's complement format after examining two bits. For negative 
numbers, the process is the same but the criteria determining 
which one larger is changed. For example, “1100” is smaller than 
“1011” in negative representation. In both cases, the distinction 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 52 
is drawn as soon as the first different digit appears. 
3.7.2 Signed-digit comparison algorithm 
In the signed-digit number system, the comparison of two num-
bers is not as simple. Because of the property of number re-
dundancy, the distinction cannot be drawn as soon as the first 
different digit appears. For example, "0111" is smaller than 
"0010". Similar cases occur for negative numbers. 
An algorithm to perform on-line comparison is suggested [45 . 
It uses the fact that if the difference of two digit strings being 
compared is equal to or greater than two, the comparison can be 
terminated and result can be determined. The proof is shown 
below [45 . 
Given two radix-2 signed-digit numbers P and Q which are 
N-digit numbers such that 
0 0 
P = P X 2 and Q - ^ q^ x 2\ 
l=N-l l=N-l 
Suppose the comparison can determine that P is larger than 
Q at the digit. P and Q can be divided into 
k 0 
P = p'' X + E P � 2 ' 
l=N-l l=k-l 
Q = X + E X 
l=N-l l=k-l 
where 0 </c < TV - 1, G {T, 0,1}. 
CHAPTER 3. ARITHMETIC FOR VIDEO ENCODING 53 
where 
As a result, if the comparison of two digit strings reaches to 
the state that (pNPN-i-'-Pk — QNQN-i'--Qk) > 2, the comparison 
can be terminated at k-th digit without examining all the lower 
significant digits. 
To sum up, in a radix-2 signed-digit comparator, once we find 
out that the difference is larger than or equal to two, decision 
can be made and the process can be terminated. As a result, 
early termination is still valid in signed-digit computation. 
3.8 Summary 
III this chapter we have explained how different operators can 
be implemented using different representations and processing 
orders which can be employed for motion estimation. MSB-first 
processing of addition, absolute difference, summation and com-
parison are explained and as we shall see, can lead to bit-level 
pipeline and high hardware efficiency. Similar techniques can be 
used in other parts of video coding such as integer transform, 
deblocking filter, etc. 
Chapter 4 
VLSI architectures for video 
encoding 
4.1 Introduction 
In this chapter variable block size motion estimation architec-
tures are evaluated for the requirements of the H.264/AVC stan-
dard. Suitable consideration of different design tradeoffs can 
lead to an efficient architecture design for a given motion estima-
tion algorithm. The purpose of this chapter is to evaluate motion 
estimation algorithms, mainly for full-search, from a hardware 
point of view, assuming H.264/AVC. A new design metric that 
considers processing speed in terms of throughput, silicon area 
occupied, memory bandwidth requirement and power consump-
tion is introduced. Various VLSI architectures for full-search 
motion estimation are evaluated based on this metric. W e em-
ploy 1-D, 2-D systolic array, tree architectures and signed-digit 
bit-serial architecture in our family of motion estimation proces-
sors. In the next section, our implementation platform is intro-
duced first. In the subsequent sections, different processor archi-
tectures will be described in detail. The first-reported MSB-first 
bit-serial motion estimation processor will be described in de-
tail. The real implementation will be presented in next chapter. 
A theoretical method to determine the efficiency of different ar-
54 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 55 
chitectures are given in the last section. 
4.2 Implementation platform - (FPGA) 
A Field Programmable Gate Array (FPGA) [9] is a device that 
consists of a large array of reconfigurable cells in a single chip. 
Each of these cells is a computation unit which can be used to 
implement logic functions. Configurable routing is used to allow 
inter-cell communication across the reconfigurable cells. Xilinx 
is a leading commercial company producing F P G A products and 
the description that follows uses their terminology. A commonly 
used member of this family is the Virtex series, the flagship 
product being the Virtex-5 which as based on 65nm process 
technology. 
A typical F P G A contains an array of individual cell called 
logic cell (LC) interconnected by a matrix of wires and pro-
grammable switches. Logic circuits are built based on these 
cells and interconnect. F P G A also contains dedicated hard-
ware for common-use building blocks like block memories, I/O 
pins, clock management blocks (DCM) , digital signal process-
ing blocks (DSP) and embedded microprocessors. These build-
ing blocks enable system on chip (SOC) development on F P G A 
platforms. 
4.2.1 Basic FPGA architecture 
Each logic cell (LC) has look-up tables (LUT), D-type flip flops 
(DFF) and fast carry logic. An N-input L U T in the FPGA, N = 4 
or N=6，is a memory-like component that can be programmed to 
compute any function of up to N inputs. One output is produced 
for each LUTs. DFFs can be used for registers, pipeline storage, 
state machines, etc. Fast carry logic is a dedicated feature for 
speeding up carry-based computation like addition. A basic 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 56 
F X I N A O h M U y X 
Fx iNBo U n r 
J OY 
CDDY 
l u t [ ) - | d ^ — — C ^ Y Q O FF/Ur Qjo D CE 
input丨 1 O P丨 tf 
cr> c 认 
SR REV 
_ t f ^ 
BY CZ> F5 l-h MUXF5 
r - ~ C Z > X 
LUT p^ 





B X O 
CEO 
CLKO 
S R UG002_C3_017_030703 
Figure 4.1: FPGA Logic Cell Architecture (Xilinx Virtex-II Pro series) 
structure of LC in Xilinx Virtex-II Pro series with two 4-input 
LUTs, two DFFs or latches, and fast carry logic chain is shown 
in figure 4.1 [1 . 
Based on these logic cells and interconnects, an F P G A chip is 
realized. Together with clock management blocks, I/O blocks, 
microprocessors in the FPGA, the generic F P G A architecture is 
produced. 
4.2.2 DSP blocks in FPGA device 
Recently, DSP block are included in F P G A fabrics [2] for high-
speed digital signal processing applications. In the first gener-
ation DSP blocks, e.g. Xilinx Virtex-II, only multipliers were 
included. Later, multiply and add, multiply and subtract were 
also included for different kinds of DSP applications. Filtering 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 57 
CARRYCASCWr 
「-f^ ""(j^ r|"A:;T …"^： A"B “ MuTTli^ NoT^ TllPc^ r-j 
j ^ - ‘ 令0 — ALUMODE ~ I 
I 4p > P —' - i s ！ 
IB LzH i I 
|c 49,17] I  0 ~ h d^etect J 
j “ z V r f H I 丫 ^ 隱 画 ETEcfj 
j ‘ ' i7-BltShHt , C R E G / C Bypass M a s k | 
I ~ CARRY,N ^ I n 丨謙丨G 剛 . 丨I 
j / � OPMODE ^ I CARRYCASCIN' | 
I CARRYINSEL I 
I Aiy — j 
j BCIN* A C I N ' J_ J P C j N ^ _ J 
•These signals are dedicated routing paths internal to the D S P 4 8 E column. They are not accessible via fabric routing resources. 
Figure 4.2: DSP architecture in Xilinx Virtcx-5 FPGA 
and transform operations in video codec benefit from DSP re-
sources. Its high performance and low power consumption make 
DSP blocks attractive compared with the same functions imple-
mented in configurable logic blocks. The second generation DSP 
block architecture in Xilinx Virtcx-5 F P G A s is shown in figure 
4.2 [2]. 
4.2.3 Advantages employing FPGA 
Moore's Law [35] directly benefits development of high-speed, 
high-density and low-power F P G A devices. Advancement in 
F P G A architectures and process technologies, have narrowed 
the performance, area, power gaps between F P G A and ASIC 
devices [14]. Modern F P G A devices are suitable for complex, 
large systems that in the past could only be implemented in 
ASICs. With its reconfigurable nature, designs can be updated 
at the hardware level easily without replacement of the whole 
chip, which reduces the overall system cost. Moreover, with 
advances in dynamic reconfiguration technologies, F P G A can 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 58 
electively load the hardware within the period that the func-
tion is needed and detach the hardware afterwards [5]. As a 
result, less logic resources are idle or wasted compared to ASIC 
implementations in which all logic is fixed. 
F P G A s benefit motion estimation by providing excellent logic 
performance, large amounts of silicon resources and flexible hard-
ware. With its reconfigurable nature, a motion estimation proces-
sor can be customized based on application requirements. Our 
family of motion estimation processors is a set of architectural 
choices for reconfiguration. This "load when needed" method-
ology reduces the required resources and power consumption 
needed and better fits the application requirement. 
4.2.4 Commercial FPGA Device 
Many commercial F P G A families are currently available on the 
market. Xilinx (www.xilinx.com) offers the Virtex and Spar-
tan series, Altera (www.altera.com) offers Stratix and Cyclone 
series and Actel offers the Fusion series (www. act el. com). All 
commercial products range from high-end to low-end device. W e 
choose Virtex-II Pro platform from Xilinx for our implementa-
tion because it satisfies most of the requirements for our motion 
estimation processors. These devices are expensive but a less 
expensive choice, the Spartan series from Xilinx is a good sub-
stitute. The downside is that Spartan offers a smaller amount of 
hardware resources and slower performance. Fortunately, it is 
enough to fulfill low and mid-end applications in video process-
ing [14:. 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 59 
Motion estimation processor 
Search area Address Motion vector 
Memory generation unit 
个 十 I da ta 
C u r r e n t B l o c k • „ Motion vector 
_ Processing array M • and min SAD 
M e m o r y ^ registers 
I • 
I 她 _ """"TT 
1 f Global address bus J 
M • 
Global data bus 
M — • 
Figure 4.3: Model of motion estimation processor 
4.3 Top level architecture of motion estima-
tion processor 
H.264/Ave top level architecture for a motion estimation co-
processor is shown in figure 4.3. In the motion estimation proces-
sor, there are four fundamental units, namely the pixel mem-
ory, processing array, address generation unit and motion vector 
memory. Typically, there are two memories storing search area 
and current block pixels. The processing array is designed to 
calculate the required SAD. The address generation unit calcu-
lates the addresses for the following data in memory. For dif-
ferent search algorithms, different address generation schemes 
are used. The resulting SADs and motion vectors are stored in 
a small memory accessible via an external bus which acts as a 
communication bridge between the motion estimation processor 
and a general purpose processor. 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 60 
——M <—1 M < 
—— ~ I I ~ r n 
~ PE + PE > PE 
Figure 4.4: Data flow in systolic array over general implementation 
4.4 Bit-parallel architectures for motion esti-
mation 
In bit-parallel motion estimation architectures, systolic arrays 
are commonly employed since they can effectively sustain high 
bandwidth between memory and computation cores while at the 
same time provide good performance by utilizing many compu-
tation elements in parallel. W e will give background on systolic 
arrays and explore alternatives in using systolic arrays for mo-
tion estimation. 
4.4.1 Systolic arrays 
A systolic array [26] is an array processor architecture that con-
sists of a number of identical processing elements inter-connected 
via local communication links. The computation is performed in 
a pipelined manner and results are passed through the process-
ing elements. Its advantages are low communication overhead, 
simplicity and the architecture VLSI implementation. Figure 
4.4 demonstrates the data flow in a systolic array as compared 
with a general approach. 
) 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 61 
Variable block size motion estimation algorithm 
SADp,Qim,n) = Ef=�E?=o |c(i,j)-r(i+m,j+n)| (4.1) 
-range < m,n < -\-range, (4.2) 
P.Q e { 4 , 8 ,16 } ’ （4.3) 
u = minm,n [SADp^Q (m, n ) } (4.4) 
MVmin[P,Q) = (m, n) (4.5) 
where range is the search range having values 土 16 or 土32. 
Figure 4.5: Variable block size motion estimation algorithm 
4.4.2 Mapping of a motion estimation algorithm onto 
systolic array 
The algorithm is first decomposed into basic operations and con-
verted into a form where each result is assigned to a unique 
variable. Referring to chapter 2, the variable block size motion 
estimation algorithm for full search is defined in figure 4.5. 
Parameters P and Q have been added to rcflcct the vari-
able block size. In motion estimation, S A D is computed over 
a four-dimensional index space, z, j, m, n for each macroblock 
in a frame. Equation 4.1 shows two 2-D index spaces only. 
The first one is generated by the indexes for calculating 
SADp^qim, n). The second one is generated by m, n. The mini-
m u m S A D is found and a motion vector deduced after exploring 
all m, n pairs in the full search. The indexes i, j can be projected 
onto a ID or 2D systolic array. The number of computation 
nodes depends on the block size. The other way is to project 
i, m onto the systolic plane and results in an array dependent 
on block size and search range. The parallelism can be higher 
if the search range is larger than the macroblock size. For dif-
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 62 
丄 
” Z FF(A) ~ • |A-B| ——FF(B) 
A • AD-ID B W W 
1 1 Adder 
|A-B| + S 
\ |A-B| + S 
S / B S 
/ T t r 
——I_ I |A-B| I 
A _ _ ^ AD-2D _ 
A • (B) ’， ’， 
Adder 
\ ’ 
\ |A-B| + S 
B ~ • ADD ~ • A + B . — — 
MUX 
SAD • MIN MV SAD Comp.™ •> MoUovcBr 今 
f Mln(SAP) r 
^ ^ ^ ^ ^ ^ FFCMiu SAO) 
Figure 4.6: Fundamental elements in systolic and tree architectures 
ferent mapping alternatives, a number of examples of mapping 
motion estimation algorithms to systolic arrays is given by [24 . 
Each computation node handles a subtraction, an absolute value 
operation, and an accumulation in the systolic architectures. 
For the description of motion estimation architectures some 
basic elements are defined. They are the processing element 
(PE) such as AD-ID and AD-2D, the A D D node and MIN node. 
The detailed dataflow for these nodes are shown in figure 4.6. In 
figure 4.6, FF standards for flip flop and the absolute difference, 
addition and comparison units are implemented in a bit-parallel 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 63 
Reference block d . l . Cur r tn l block d . U 
R( I , IS) R(I,2) R ( U > R ( I , 0 ) / • AD(1D) ~ \c(M)> C( l , l ) C ( U ) W I S ) 
1(1+1,14) R( I+1 , I ) R( I+ I ,0 ) / * _ A D ( 1 D ) ^ * \ c ( l + I , 0 ) C (H- I , I ) C ( l + l , U ) 
K I + 2 , 1 3 ) — — R ( I + 2 , 0 ) / . . — • > AD(1D) • • \ C(l+2,0) - - C(»+2.I3) 
卜 • * ~ • AD(1D} ~ • * • \ C(l+15,0) 
^ A D D 1 ^ MIN • M V 
Figure 4.7: 1-D systolic architecture 
fashion. 
4.4.3 1-D systolic array architecture (LA-ID) 
Figure 4.7 depicts the ID systolic processing array. r{x,y) and 
c(x, y) stand for reference block from search area and current 
block respectively. It is classified as a local accumulation archi-
tecture since it involves summation within the processing node. 
This design requires external memories to store the search area 
and current block, resulting in a high memory bandwidth re-
quirement. 
Each A D PE calculates the absolute difference of two pixels 
from the reference block and current block, adds the result to 
the already calculated partial sum for the same search position 
given by the neighbour PE, and passes the result to the next 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 64 
PE. At the end of the chain of PEs, the S A D calculation is fin-
ished. The result is compared with the previous minimum S A D 
result within the MIN element. This architecture is scalable for 
search range and block size. The area used is small but the 
performance is slow compared to a 2D-array. It is suitable for 
low-end applications. 
4.4.4 2-D systolic array architecture (LA-2D) 
The 2D systolic processing array is a two dimensional version of 
the LA-ID architecture. Figure 4.8 shows its datapath and tim-
ing of the data flow. The reference data is passed horizontally 
from one AD-2D to the next. Data reuse is possible by mak-
ing use of delay lines and by moving data from one PE to the 
next. This architecture offers the advantage of further reducing 
memory bandwidth compared to the LA-ID architecture. The 
current data is initially shifted to each PE and will be stored 
and reused until the current block motion vector is found. This 
architecture requires large area but the performance is high be-
cause of the number of parallel computations. As a result, it is 
suitable for high-end applications. 
4.4.5 1-D Tree architecture (GA-ID) 
The global accumulation architecture, GA-ID, is also referred 
to as a "tree architecture". Figure 4.9 below shows this ar-
chitecture for a 4x4 macroblock size. The absolute difference 
of a previous and current pixel is calculated in the absolute 
value P E and the result is accumulated in an adder-tree exter-
nal to the P E array. The adder tree is usually implemented as a 
non-redundant adder-tree or carry-save adder tree with pipeline 
registers inserted between the stages. The previous data are 
fed continuously into the PEs whereas the current block data 
is loaded when the current block is changed and are kept in 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 65 





C(0.1) C U f i ] ^ ^ ^ - ^ - " ^ ^ * 
cjwu.---^^ * 
R»fw»nc» blocfc data j ^ 1 | —^—] |——^ 
rO.IS) r(U) r(W)/ ^ AD • AD > AD 
r(H-1,14) r(k1.1) r\M.OY * »> A D ^ A D • A D 
7 
r(K2.13) f(K2.1)/ * * 叫 A D » A D • A D 
/ A • 
r(K19,9) / • • * • A D > A D • A D 
0 ——ADD ——• A D D ** A D D ——MIN - r - ^ hv 
Figure 4.8: 2-D systolic architecture 
registers if local caches are added. 
This architecture reduces memory bandwidth for current pixel 
data compared to the LA-ID architecture. This architecture can 
also be fully pipelined in an F P G A or ASIC. In addition, ad-
vanced features such as variable block size motion estimation 
can be easily supported in this architecture by adding immedi-
ate registers to store up SADs of subblocks. 
4.4.6 2-D Tree architecture (GA-2D) 
The GA-2D architecture depicted in figure 4.10 is the two di-
mensional extension of the GA-ID architecture, in which N x N 
PEs are used, as illustrated for a 4x4 block size case in figure 
4.10. As shown in the figure, the search area pixels are fed con-
tinuously into the PEs whereas the current block data is loaded 
only once when the current block is changed. 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 66 
M V 
M I N 
J i 
A D D 
A D D 
ADD ADD 
| A - B | | A - B | | A - B | | A - B | Snr 
R (i.«) C d . f ) R( l . t ) C ( » . l ) R (i.2) C ( • . ] ) K ( l . l ) C<t.S) 
R (!• I .fl) C ( l .1) l> C ( l , l ) 1.1) C ( l . l ) l l ( U I . ) ) C ( l . J ) 
R (H2.0) C O . I ) R (••J.I) C< ] , l ) *<«*2.a) CO,2 ) l l<U] . l ) C<],J) 
C O . I ) C(J . I ) C ( J . I ) CO.9) 
C(«.«) C < l , l ) C(»,2) C ( l . 3 ) 
Figure 4.9: 1-D Tree architecture 
The local communication between PEs reduces memory band-
width on reference data compared with GA-ID. Since it avoids 
local accumulation and can be fully pipelined efficiently in FPGA, 
the performance of this architecture is the best among four im-
plementation alternatives. 
4.4.7 Variable block size support in bit-parallel archi-
tectures 
There are two methods to support variable block sizes in bit-
parallel architectures. For local accumulation architectures like 
LA-ID and LA-2D, first method is to include 16 partial S A D 
registers in each node for 4x4 subblocks within a 16x16 mac-
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 67 
Current block data 
• I i 1 
C(O.O) C(O.l) C(0 )^ C((U) 
I I I I I I I Reference block data 
I A-B I ^ ~ I A-B I <~ I A-B I — I A-B | Rao r<m,o> lym.o)， 
I T ^ V V 
I A-B I ^ I A-B I ^ I A - B I M I A-B I ^ R<I.I) Rd+I.o R(I+U)， 
V V V V 
I A - B I t ~ I A - B I ~ I A - B I I A - B | ——R<m.2> , 
V V V V 
I A - B I ^ I A - B I ^ — — I A - 6 I < 4 — — | A - B | ^ ~ R(U) mi+i.3) R(I+I.3) ? , 
� � \ / / 
� � � ^ / 





Figure 4.10: 2-D tree architecture 
roblock so that each register stores its corresponding subblock 
S A D [6]. The second method is to divide the systolic array into 
its smallest possible block size architecture (sub-systolic array). 
For example, a 16x16 systolic array is divided into sixteen 4x4 
systolic arrays, each handling a 4x4 SAD. The sixteen 4x4 SADs 
are then combined to form a large S A D via SAD-reuse technique 
through an adder tree [7]. The first method imposes a large reg-
ister overhead on each processing node and can have a large 
impact on silicon area. The second may increases the required 
bandwidth as the local communications between sub-systolic ar-
rays are broken. 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 68 
In tree architectures, the implementation is easier since there 
is no partial S A D stored in the systolic elements. The support 
of variable block sizes can be done in the adder tree by adding 
intermediate registers connected to comparison units in differ-
ent stages of the adder tree. Similarly, using the SAD-reuse 
technique, variable block size motion estimation is supported 
without significantly sacrificing memory bandwidth and silicon 
area. 
4.5 Bit-serial motion estimation architecture 
4.5.1 Data Processing Direction 
In bit-serial implementations, data processing from the M S B 
and LSB leads to the same result but performance may be dif-
ferent. Generally, addition and subtraction are LSB-first favored 
algorithms but can be implemented in MSB-first by deploying a 
redundant number system. The final operation in motion esti-
mation involves comparison, a MSB-first favored algorithm. The 
comparison operation can finish earlier if it is processed MSB-
first. As a result, for better throughput, we choose a MSB-first 
implementation for bit-serial motion estimation. 
4.5.2 Algorithm mapping and dataflow design 
When employing a bit-serial approach, the inherit bandwidth is 
reduced since in each cycle the bandwidth required is divided 
by n for n-bit pixels. As the bit-serial architecture is not deeply 
pipelined, the pipeline flushing due to the cffcct of data hazard 
introduced by data dependent algorithms such as TSS, DS is 
smaller than that in systolic and tree designs. As a result, bit-
serial implementations are suitable for fast algorithms and thus 
for low to mid-end applications. 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 69 
• Prev ious Mot ion vector 
寺 Predicted Mot ion vector 
t 
I 
i B L K A B L I C B 
B L K D I 
B L K C ( C u K § n t j 
\ blockj^休 1 
V ^ 、 i 
I Start ing point | 
I \ 丨 
j BLK P Search window | 
M V D = Median {MV A , M V B, M V C} 
Figure 4.11: H.264/AVC motion vector prediction 
4.5.3 Early termination scheme 
There are two related advantages having a good initial value 
for the minimum SAD. The first is that early termination of 
comparisons to the current minimum S A D can be affected more 
frequently, and the second is that updates to the minimum S A D 
value take extra cycles, and better initialization can serve to re-
duce their occurrence. H.264/AVC uses motion vector predic-
tion mode (figure 4.11) and we can initializes the search to the 
predicted location. 
In the typical case, this serves to reduce the number of S A D 
updates as the search is started with a near-minimum value. 
Table 4.1 summarizes our simulation results in Mat lab showing 
the number of clock cycles needed to complete the comparison 
operation for different video scenes with different motion vector 
initialization strategies. A non early termination implementa-
tion requires 16 cycles since the longest word size is 16-bit for 
the comparison. The News scene is an almost still motion video. 
Zero-assumed motion and predicted M V initialization performs 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 70 
Video type Sequential Zero MV Predicted MV 
News 6.95 5.39 5.4 
^ F l o w e r ^ ^ ^ 
Stefan 7.26 6.54 6.46 
Table 4.1: Number of cycles to complete comparison stage for different scenes 
using different starting strategy (16 cycles for no early termination scheme) 
better than a standard sequential scheme. In fast motion scenes, 
such as flower and Stefan, which constitute a fast moving back-
ground and a sports scene, the H.264/AVC predicted M V ini-
tialization scheme performs the best and has an average of 5.79 
cycles. On average our scheme offers a (16-5.79)/16=63.8% sav-
ings in comparison operations. For the entire motion estimation 
computation, in total (12+16)=28 cycles (figure 4.20) are re-
quired in the worst case, and on average our scheme offers a 
36.5% improvement. 
4.5.4 Top-level architecture 
Motion estimation involves the calculation of S A D values be-
tween current block and reference block as shown in equation 
2.4. By rewriting equation 2.4 in bit-serial fashion, we get equa-
tion 4.6 with a triple summation. 
SADp,Q(m,n)=Zlo Ef=o I E L o x{c{i,j,k)-r(i+mj+n,k})\ ( 4 . 6 ) 
The double summation (P，Q) are mapped to the signed-digit 
adder tree and computed spatially while the innermost sum-
mation (0 to 7) of bit-serial part is computed iteratively. The 
remaining problem is how to generate signed-digit numbers from 
current and reference pixel values. Both current and reference 
pixels are positive 8-bit integers. The computation of their 
difference in signed-digit representation can be done easily by 




block SD 41 
memory 2 5 6 / pair, SAD: 
^ Conventional .,, ... 
number to ^ SD-ADDER rnMPAOAx Termination 
signed-digit TREE detector 







Figure 4.12: Top level architecture of bit-serial motion estimation unit 
making the current pixel positively weighed and the reference 
pixel negatively weighed as discussed in section 3.5. The ab-
solute value operation can be done by on-the-fly checking of the 
signed-digit number until 1 or -1 is detected for the first non-zero 
digit. The positive weighing is interchanged with the negative 
weighing part to complete the absolute value operation if -1 is 
encountered. 
Then, we describe the entire bit serial motion estimation 
process in 4 stages: non-redundant positive number to signed-
digit number conversion, summation, comparison and early ter-
mination. The top level system is shown in figure 4.12. 
4.5.5 Non redundant positive number to signed digit 
conversion 
As described in chapter 3, the \ci-ri\ operation, where ci and ri 
are 8-bit positive integers from the current and reference blocks, 
can be converted to a SD representation by setting ci and ri as 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 72 




_ From current 
pixel and ref. 
pixel 
_ —— 
> - y e s - No 
N ^ i 
No yes | 
_ _ ！ _ _ _ _ _ _ _ i _ 
仍 = > {R+.R-}= {r.c} 
I ； 
V ：： i 
Yes 
Get next Get next 
No significant bit No significant bit 
I i ！ ！ 
I 入 I 入 
� R e s e t ? � I � R e s e t ? � 
1 _.�— i 
Figure 4.13: Flow chart of non-redundant to signed-digit number conversion 
being positively and negatively weighed respectively and finally 
doing a sign-detection to check if changing the sign of result 
is necessary. The circuit that implements this functionality re-
quires few hardware resources and little computation delay is 
introduced. A finite state machine which detects the first non-
zero digit is required for the absolute value. Together with a 
pair of multiplexers for interchanging the signed-digit, \ci — rj 
in signed-digit form is produced. 
Figure 4.13 shows the flow chart for sign-detection of the 
signed-digit number. In total there are 16 x 16 = 256 absolute 
difference stages in our motion estimation processor. 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 73 
SIxiMn 4x4»dd«rlrM 
I 
\ 8A04x4 / \ SAD«n4 / \ 8A0ta4 / \ tA04«4 / 
,叉：又 
\ s A D M e r g e / 
I ln l . l fa l•，AO 
Figure 4.14: Signed-digit adder tree that generates 41 SADs 
4.5.6 Signed-digit adder tree 
The macroblock size of H264/AVC is 16 pixels by 16 pixels with 
a 4x4-block as its smallest sub-block. To find all the minimum 
motion vectors of a 16xl6-block and its subblocks, we employ of 
a SAD-reuse strategy [7]. As a result, the 4x4-SAD computa-
tion becomes our primitive element and is reused to form other 
SADs. Since the different macroblock modes are overlapped in 
the spatial domain (Figure 2.5)，the S A D can be calculated us-
ing 4x4 SADs and a sequence of merging steps to obtain other 
SADs. For example, an 8x4 and 4x8-SAD can be formed by com-
bining corresponding values of 4x4-SADs (e.g. 4x4-SAD (Block 
1, 2) in figure 2.5 to form 8x4-S A D (Block 17)). Similarly, an 
8x8-SAD can be formed from 4x8-SADs, 16x8 and 8xl6-SADs 
can be calculated from 8x8-SADs and finally a 16xl6-SAD is 
combined from 16x8-SADs. The top level adder tree is shown 
in figure 4.14. 
The S A D for a 4x4 subblock is produced by 16 pairs of 
operands summed in signed-digit format, implying we need to 
add 32 bit operands in our adder tree. According to [50], we 
could implement a 16-operand signed-digit adder tree based on 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 74 
A1 B j q “ h 丨-
ol-CASFA D I 
Full Adder ^ 
•、|si 叫 
D I ol-CSFA 
I ITT 
cj+1 sj+1 9 
fltCSEA R- R“DFA 
Figure 4.15: On-line carry save and signed digit adders 
two ol-CSFA trees and an ol-SDFA. These primitives are shown 
in figure 4.15. Together, the hardware utilization in this adder 
tree is minimized [50]. This is illustrated in figure 4.16 and fig-
ure 4.17 and consists of 8 levels with 8 cycles of on-line delay. 
The total number of cycles to calculate the 12-bit summation 
including the on-line delay is 8 + 8 = 16 cycles. The output 
of a SAD4x4 adder tree is the S A D value of a 4x4-subblock in 
signed-digit format. This value is passed to the S A D Merger 
unit to calculate other necessary SADs. 
4.5.7 SAD merger 
In our design we need sixteen SAD4x4 adder trees to compute 
the S A D of 16 subblocks in parallel. The sixteen SAD4x4 values 
computed are passed to the SAD merger as inputs (figure 4.18). 
The sixteen 4x4-SADs are fed to a series of ol-SDFAs, i.e. S A D 
merger, and combined to form 4x8, 8x4, 8x8, 16x8, 8x16 and 
16x16 SADs. The number shown in figure 4.18 indicates which 
block's S A D is calculated at that node. The block index is 
shown in figure 2.5. In total, the number of ol-SDFAs in S A D 
merger is 8+8+4+2+2+1=25. Pipelining registers are added 
between SAD4x4 adder trees and the S A D merger to split the 
combinatorial path and boost the operating frequency. In our 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 75 
pO pi p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 pi3 pi4 p15 
I——‘ I I f ——I 11 III ~ III —^— 
O L F A O L F A O L F A O L F A O L F A 丨 ^ 
L g L ^ ^ ^ T 
O L F A O L F A O L F A I <1 I <1 
__bPr bf-V 丁 丁 
III 「 r -
O L F A O L F A 丨 I <n 
^ ^ T T 
O L F A O L F A 
L f ~ I — f ~ 
I p0-i>15 
^ - f n r 
OLFA f—^  
C S 1 6 A D D E R 
I . T R E E 
Lc |J 
厂 ) gS l f iAOPERTREE. c.rry S.v . 
Carry Save 
Figure 4.16: A 16-operand carry save adder tree 
F P G A prototype, one pipeline register obtains a good balance 
between maximum frequency and latency. 
Finally, the 41 S A D values are passed to an on-line compara-
tor for final stage processing. Since the arrival times of different 
S A D results are different, the completion times to determine the 
minimum S A D vary. Table 4.2 shows the delay for each type of 
SAD. 
4.5.8 Signed-digit comparator 
In the comparison stage, we compare the current S A D to the 
current minimum S A D for each subblock type in a MSB-first 
manner. A signed-digit comparator is used for this purpose. 
The architecture of the comparator suggested in [45] is shown 
in figure 4.19. If the number being compared has a difference 
of two or more, we can determine which SD number is bigger. 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 76 
pO+ . . . p i 5 + p O - . . . p 1 5 -
/ / 
/ / 
C S 1 6 A D D E R C S 1 6 A D D E R 
T R E E T R E E 
c s c s 
n [iL ^ I 
______C 
A* A- B-
O L _ S D F A 
I p n I SAP4x4 
S A D 4 x 4 _ p o s S A D 4 x 4 _ n e g 
\ / 
T o n e x t O L _ S D F A t o 
ca lcu la te o t h e r s ize S A D 
Figure 4.17: 16-opcrand signed-digit adder tree for 4x4 SADs 
SAD type Delay (cycles) 
4x4 — 16 
4x8,8x4 19 
M 21 一 
8x16,16x8 23 
1 6 x 1 6 ^ 25 
Table 4.2: On-line delay of different SAD types 
The on-line comparator will stop when this situation arises. A 
proof for this algorithm is given in [45] and described in chapter 
3.7.2. The on-line comparator can determine the result in 2 
cycles minimum. 
4.5.9 Early termination controller 
Early termination of the S A D computation allows the avoidance 
of redundant calculations. In terms of processor throughput, 
100% speed-up can be achieved when 50% of calculations can 
be eliminated. In our case, we have to deal with the variable 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 77 
I $AD8x4 I 25 26 27 28 29 30 31 32 
ol-SDFA ol-SDFA oI-SDFA ol-SDFA ol-SDFA ol-SDKA ol-SDFA o»-SDFA 
IfromsWeenl ^ X , ！ X ^ 
SAD4x4Acldei 1 2 5 6 3 4 7 8 9 10 13 14 11 12 15 16 
I free | 
ol-SOFA ol-SDFA ol-SDFA ol^DFA ol-SDFA ol-SDFA ol-SDFA ol^DFA 
ol-SDFA ol-SDFA ol-SDFA ol-SDFA 
\sAD8xa\ ^ _ _ , , _ 和 , I 35 ,__‘__^ 
n I I__I i r I I " 
ol-SDFA okSDFA ol-SDFA otSDFA 
SAD16X8, I _ ‘ 1 I . . ^ SAD4x4 Index 
SAD8X16 37 38 40 ^ _ ^ _ ^ _ ^ _ ^ 
I 1 1 2 3 4 
r I 5 6 7 8 
ol-SDFA 
I 1 1 9 10 11 12 
All the cannecuon HneiaaZ-
bit Mm (Sigiwd-digit number) ^^ 15 
SAP Mgrqgr [5401^41 ' ' ' ' ' 
Figure 4.18: SAD merger 
block size effect, which affects our early termination scheme. 
Since we have to compute 41 parallel comparisons, some can be 
terminated earlier than the others. There exists dependencies 
between successive types of SADs, e.g. 8x4 depends on 4x4, and 
we cannot terminate the 4x4 summation process even if we are 
sure the current 4x4 S A D must be rejected. For the sake of 
simplicity, we check for early termination on all SADs and when 
all have terminated, the current summation can be completed 
and begins next searching position. Termination can be detected 
by OR-ing all the comparator outputs. 
t 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 78 
^ ~ “ p2&q2||/p2&/q2 
P - —1 . 
q � W k j O _J 
•Ign p2&q21| /p2&/q2 —27 yes 
qO q1 — M N 
咖g 5J — Z H q1&&qO) ~ J ) ~ q2 —I 
SD comparator 


































































































































































































































































































































































































































































































CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 80 
4.5.10 Data scheduling and timeline 
In a bit-serial based architecture, we need to handle word-to-
serial conversion which is unnecessary in a bit-parallel design. 
In addition, we have to handle extra scheduling brought upon 
by MSB-first arithmetic. For example, summation of 16 8-bit 
signed-digit numbers results in a 12-bit result, which involves 8 
cycles of on-line delay. W e have to generate 8 consecutive cycles 
of all-zero operands feeding into adder tree to compensate the 
online delay. Similarly, a 16x16 S A D requires 12 consecutive 
cycles of zeros as shown in figure 4.20. The 16-bit 16xl6-SAD 
result is calculated in 28 cycles in the worse case. -
4.6 Decision metric in different architectural 
types 
In this section we analyze different bit-parallel architectures in 
terms of throughput, occupied silicon area, memory bandwidth 
requirement and power consumption. Their values are estimated 
theoretically. The analysis can give designers a first impression 
of how the characteristics of different design parameters are af-
fected by different architectures. W e make the following assump-
tions in the analysis that follows. The unit areas for primitive 
blocks are collected from the Xilinx ISE implementation tools. 
1. The delay of a full adder constitutes 1 unit time. 
2. A 1-bit full adder occupies 12 unit areas. 
3. A multiplexor (MUX) occupies 6 unit areas. 
4. A FF occupies 8 unit areas. 
5. The maximum frequency of processors depends on the total 
unit time delay between pipeline registers. 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 81 
Operation Unit time required 
N-bit addition Nxl=N 
N-bit abs(a-b) (AT + 1) x 2 + 1 = + 3 
N-bit comparison N + l + l = N + 2 
Table 4.3: Delays of primitive operations employed in bit-parallel motion 
estimation architectures 
Component Unit area occupied 
P E ( I D ) 1 2 X 12 + 2 X 9 X 12 + 6 X 8 + 8 X 12 = 504 
PE(2D) 1 2 x 1 2 + 2 x 9 x 1 2 + 6 x 8 + 8 x 8 + 8 x 1 2 = 568 
PE(Tree) 8 x 8 + 2 x 9 x 1 2 + 6 x 8 = 328 
ADD 16 X 12 + 16 X 8 = 320 . 
COMP 17 X 12 + 6 X 16 + (12 + 16) X 8 = 524 
Table 4.4: Areas of primitive component employed in bit-parallel motion 
estimation architectures 
6. Block size of a macroblock N x N. 
The delays and areas for primitives are deduced in table 4.3 
and 4.4. The PE for 2D systolic arrays require the largest hard-
ware demand and the PE for tree architectures require the least. 
Since the calculation method is different between bit-serial and 
bit-parallel approaches, the comparison shown here doesn't in-
clude bit-serial architectures. A comparison between two will be 
given in chapter 5 based on real data. 
4.6.1 Throughput 
W e define throughput in terms of number of operations per-
formed in a cycle time. For systolic array type architectures, 
a subtraction, an absolute value and an accumulation are per-
formed per cycle time. For the tree type architectures, a sub-
traction and an absolute value are performed per cycle time. As 
a result, the delay in tree PEs is less than that of systolic P E 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 82 
Architectures Operations per unit systolic cycle time 
ID systolic {(2N + 3) + (TV + 4))N + TV + (iV + 2) 
= + • + 2 
2D systolic {{2N + 3) + (iV + -^n^ + N + 2 
+ 
^ I D tree 1.63 x {(2N ^ 3)NN^N + 2) 
= 1.63 X + 4A/' + 2) 
2D tree 1.63 x ((27V + + {N^ - l ) N - \ - N + 2) 
=1.63x(3iV3 + 37V2 + 2) 
Table 4.5: Throughput of different architectural types (N: block size) 
by an accumulation delay. Given that: 
1. The delay of systolic PE per unit cycle is (9+l+9)+12= 31 
unit. 
2. The delay of tree PE per unit cycle is (9+1+9) = 19 unit. 
A tree architecture can operate on average of a = 1.63 
higher frequency than the systolic array ideally. The maximum 
throughput in table 4.5 can be deduced. 
Table 4.5 shows the maximum throughput can be achieved 
in these architectures. With more hardware resources, 2D ar-
chitectures can perform more operations than ID architectures. 
Tree architectures perform better than others with its higher 
frequency. 
Usually, data dependency prevents fully parallel operation of 
processing elements. The calculations assume a 100% efficiency. 
Among them, 2D architectures are more sensitive to data depen-
dencies. The degradation of 2D architectures is more significant 
than ID arrays when data hazards occur. 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 83 
Architectures Bandwidth (bytes per unit systolic cycle time) 
ID systolic m 
2D systolic iV + 1 
ID tree (No cache) 3.2(3iV 
ID tree (With cache) l.GSN + 1.63 
2D tree — 1.63A^ + 1.63 
Table 4.6: Bandwidth requirement of different architectural types 
4.6.2 Memory bandwidth 
Bandwidth requirements come from reading the current and ref-
erence block pixels. A pixel size is 1 byte. In FS, the "reference 
block of one search point is usually overlapped with that of pre-
vious reference blocks. The frequency of reading the same pixels 
within the FS determines the memory bandwidth. If there are 
no local caches for overlapped pixels, the resulting bandwidth 
can be very high. Assume that tree architectures operates 1.63 
higher frequency than that of systolic array, the maximum band-
width requirements of table 4.6 are drawn. 
The operating frequencies required for motion estimations are 
adjustable depending on the motion estimation algorithm. Full 
search requires the highest frequency while fast algorithms are 
less demanding. As a result, the memory bandwidth can be 
lowered by employing a fast search algorithm. 
4.6.3 Silicon area occupied and power consumption 
Silicon areas approximately depend on operation count per unit 
cycle while power consumptions depend on the bandwidth re-
quirement and throughput. Based on the assumption made 
above and table 4.4, the following table 4.7 and 4.8 was cal-
culated. 
a and /3 are weighting parameters combining the effects of 
throughput and bandwidth. Typically the bandwidth constant 
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 84 
Architectures Silicon area 
— ID systolic 504iV + 320 + 524 
~ ~ 2D systolic 568iV2 + 3207V + 524 
—ID tree (No cache) 328iV + 3207V + 524 
ID tree (With cache) 328iV + 3207V + 524 + 64Ar2 
— 2D tree 328yV^ + 320(yv2 _ i ) + 524 
Table 4.7: Area estimation of different architectural types (Variable block 
sizes not supported) 
Architectures Power consumption 
ID systolic _ a(3iV2 + qat + 2) + (5{2N) 
2D systolic “ + + iV + 2) + (5{N + 1) 
ID tree (No cachc)— 1.63a(37V^ + 4iV + 2) + 1.63/?(27V)— 
ID tree (With c a c h ^ 1.63a(3iV2 + + 2) + i.63/j(Ar + l")~ 
2D tree 1.63q(37V3 + 3^2 + 2) + i.63/?(7V + 1) 
Table 4.8: Power estimation of different architectural types (Variable block 
sizes not supported) 
is larger since reading or writing data from and to memory bus 
demands larger power consumption. 
4.7 Architecture selection for different appli-
cations 
4.7.1 CIF and QCIF resolution 
CIF and QCIF resolution are 352x288 and 176x144 respectively 
and 30 frames per second is a standard frame rate. The mo-
tion estimation hardware is thus require to process = 
11880 macroblocks per second, which translates to 170 Gops/sec. 
This resolution is commonly used in mobile video delivery or 
video conferencing and real time encoding at this resolution is 
desirable. Power consumption is also an issue in these appli-
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 85 
cations. All architectures described are capable to handle the 
encoding in real time whatever algorithms are employed. ID ar-
chitectures require 200MHz and 2D architectures require 15MHz 
at real time processing. The architecture finally chosen depends 
on the other requirements such as area, power consumption or 
memory bandwidth. 
4.7.2 SDTV resolution 
In standard television, 704 x 480 and 640 x 480 at 30 fps have 
been commonly used for the last two decades. Careful design 
needs to be made in order to achieve real time encoding. The 
processing requires 彻丄公二乂�。二 396OO macroblocks per second, 
which translates to 570 Gops/sec. Our ID and bit-serial ar-
chitectures are not able to process in real time if full search is 
employed. Operating frequency over 600MHz is required for ID 
architectures and it is not applicable in modern F P G A devices. 
In contrast, only 50MHz is required for 2D architectures. Al-
ternatively, fast algorithms, e.g. TSS, DS, can be used with a 
bit-serial design to achieve real time with some sort of picture 
quality loss. In addition, power consumption is usually not an 
issue as they are employed in a non-mobile device, e.g. set-top 
box, television. As a result, 2D architectures, bit-serial archi-
tectures at fast searches suit this application domain. 
4.7.3 HDTV resolution 
In H D T V application, the resolution involves 720p, 1080i and 
1080p. They are used in high-quality video editing or cinema 
movie playback. The complexity is 4 times to 10 times higher 
than for in D V D video. Particularly when variable block size, 
multi-reference frames and sub-pixel motion estimation features 
are used, the complexity increases are far beyond the MPEG-2 
standard. To encode real time video in these cases, a fully par-
CHAPTER 4. VLSI ARCHITECTURES FOR VIDEO ENCODING 86 
allel architecture with the highest parallelism is required. For 
the highest profile 1080p, the motion estimation processor has 
to process = 243000 macroblocks per second (3499 
Gops/sec) which is 20 times higher than encoding GIF video 
in real time. Typically, only 2D bit-parallel architectures at 
300MHz with ASIC technology or high-end F P G A implementa-
tion can achieve this throughput. 
4.8 Summary 
111 this chapter, several architecture alternatives are discussed 
together with their implementation strategies. Bit-parallel 1-D, 
2-D and tree architectures are discussed and their variable block 
size support analyzed. For bit-serial designs, we suggested a 
MSB-first architecture supporting variable block sizes. W e also 
presented an early termination scheme to save power and boost 
performance. Several applications are suggested for mapping of 
architectures. The ranges of supported applications are from 
low to high-end. 
Chapter 5 
Results and comparison 
5.1 Introduction 
In this chapter the result and comparison of different architec-
tures are presented. Based of the results, the family of motion 
estimation processors supporting variable block size is built. 
Lastly, we compare our architectures to previous work in the 
literatures. 
5.2 Implementation details 
The proposed motion estimation architectures are written in 
V H D L hardware language, implemented, simulated and verified 
on a Xilinx Virtex-II Pro -6 speed grade device. The motion 
estimation processors are synthesized using Simplicity Pro 8.4. 
Place-and-route was done and the power consumption is esti-
mated using XPower provided by Xilinx ISE tools. The area 
utilization, maximum frequency and estimated power consump-
tion are shown in following sections. In order to make a fair 
comparison between different architectures, "performance per 
slice/gate" and “power consumption per slice/gate" are indi-
cated to show our motion estimation processors efficiency on 
F P G A platform. 
87 
CHAPTER 5. RESULTS AND COMPARISON 88 
W e compare both implementations of the architectures sup-
porting fixed block size and variable block size. The search 
range is fixed to -16 to +15 so that the search area is 48 x 
48 pixels in size. The macroblock size is 16 by 16. The full 
search algorithm is employed. To measure the efficiency of dif-
ferent architectures, we ignore the area occupied by data storage 
for current block pixels, reference block pixels, motion vectors 
and minimum SADs since they are required in other parts of 
a H.264/AVC coder. The amount of area that can be ignored 
depends on architectures. For 2D architectures the data storage 
for the current block cannot be ignored since it is implemented 
inside the systolic array. In contrast, for bit-serial architectures, 
the minimum SADs and motion vectors can be stored in exter-
nal memories and this area is not counted. The throughput of 
1 .丄上 . • 1 cycles per macroblock i “ , 
architectures is given by , max frequency""" where cycles 
per macroblock" depends on the particular motion estimation 
algorithm. 
5.2.1 Bit-parallel 1-D systolic array 
Column 1 in Table 5.1 summarizes the ID systolic array imple-
mentation without variable block size support. It consists of 16 
processing elements, 1 A D D and 1 C O M P component. Including 
the pipeline speedups, it takes 16 cycles on average to calculate 1 
SAD. The total number of cycles need is 32x32x16+17 = 16401 
cycles to process one 16x16 macroblock. 
Column 2 in Table 5.1 shows an implementation of a ID 
systolic array with variable block size support. Variable block 
size support is done by the conversion of one 16-PEs array into 
four 4-PEs arrays. An adder tree is connected to output of 
these four arrays to produce SADs for larger block sizes. In this 
design, we have to add 3 A D D and 3 C O M P components for 
4x4 S A D comparisons. It takes 32 x 32 x 16 + 20 = 16404 cycles 
CHAPTER 5. RESULTS AND COMPARISON 89 
Fixed block size variable block size 
Design Strategy Bit parallel Bit parallel 
Max frequency(MHz) m 230 
Area (Slices) S ^ 1457 
Area (Gate) 16788 25158 
Throughput (MB/s) 14424 14021— 
Max bandwidth required (MB/s) 7615 6228 
Performance/Slice 17.3 9.6 
Performance/Gate 0.859 0.557 
Total power (mW) 1794 
Power/Slice (mW/Slice) 1.61 “ 1.23 
Power/gate (mW/gate) 0.080 ‘ 0.071 
Table 5.1: Results of ID systolic array processor 
to process a 16x16 macroblock. 
5.2.2 Bit-parallel 2-D systolic array 
Column 1 in Table 5.2 shows the implementation results for a 2D 
systolic array without variable block size support. It consists of 
16 X 16 = 256 processing elements (PE), 16 A D D and 1 C O M P 
component. Because of the pipeline stages, it takes 16 cycles of 
latency to calculate 1 SAD. When employing data dependency-
free algorithms, e.g. full search, on average only 1 cycle is re-
quired for calculation of 1 S A D because of pipelining. As a 
result, the total number of cycles needed is 32 x 32 + 17 = 1041 
cycles for one 16x16 macroblock. 
Column 2 in Table 5.2 shows the implementation results for 
variable block size via the conversion of a 16xl6-PEs array into 
sixteen 4x4-PEs arrays. An adder tree is connected to the output 
of these 16 arrays to form SADs of larger block size. In this 
design, we need to add 48 A D D , 15 C O M P units and an adder 
tree to support variable block size. Similar to fixed block size 
CHAPTER 5. RESULTS AND COMPARISON 90 
Fixed block size variable block size 
Design Strategy Bit parallel Bit parallel 
Max frequency(MHz) 232 227 
Area (Slices) 9478 10794 
Area (Gate) 193345 215315 
Throughput (MB/s) 222862 217433 
Max bandwidth required (MB/s) 7424 29056 
Performance/Slice 23.5 20.1 
Performance / Gate 1.15 1.01 
Total power (mW) 26755 29875 
Power/Slice (mW/Slice) 2.82 
Power/gate (mW/gate) 0.138 _ 0.139" 
Table 5.2: Results of 2D systolic array processor 
architectures, it takes 32 x 32 + 20 = 1044 cycles to process a 
16x16 macroblock. 
5.2.3 Bit-parallel Tree architecture 
Table 5.3 shows the implementation results of ID tree architec-
tures that includes variable block size support and pipelining. 
Its performance is the most efficient among all ID architectures. 
The ID tree architecture consists of 16 PEs, each handling an 
absolute difference operation. Since accumulation is eliminated, 
it operating frequency appears higher. In total, it takes 16390 
cycles to perform a full search of 1 macroblock. For variable 
block size support, four 4x1 trees are needed with 3 additional 
comparison units added. It takes 16393 cycles to perform a full 
search of 1 macroblock. 
Table 5.4 shows the implementation results for 2D tree archi-
tectures. It has the highest throughput among all architectures 
and stores the current block pixels in each of its PEs. Thus its 
area may appear larger. This is a tradeoff as memory bandwidth 
CHAPTER 5. RESULTS AND COMPARISON 91 
Fixed block size variable block size 
Design Strategy Bit parallel Bit parallel 
Max frequency(MHz) 240 
Area (Slices) 350 925 
Area (Gates) ^ 20614 
Throughput (MB/s) 14643 14641 
Max. bandwidth required (MB/s) 7680 7680 
Performance/Slice 41.8 15.8 
Performance/Gate 1.768 0.71 
Total power (mW) 1160 2834 
Power/Slice (mW/Slice) 3.31 3.06 
Power/gate (mW/gate) 0.140 ‘ 0.137 
Tabic 5.3: Results of ID tree-based motion estimation processor 
can be greatly reduced. Variable block size is supported by at-
taching comparison element at the correct position of the adder 
tree to produce SADs of larger block sizes. Both architectures 
take 1049 cycles to process 1 16x16 macroblock. 
5.2.4 MSB-first bit-serial design 
Bit-serial design fills the gap between ID and 2D architectures 
and balances throughput, bandwidth, area and power consump-
tion. It takes 18432 cycles to process one macroblock for a full 
search algorithm. With variable-block-size support, table 5.5 
gives its implementation results and capability. 
CHAPTER 5. RESULTS AND COMPARISON 92 
Fixed block size variable block size 
Design Strategy Bit parallel Bit parallel 
Max frequency(MHz) m 
Area (Slices) 5789 8513 
Area (Gates) 142234 18432^ 
Throughput (MB/s) 230547 228924 
Max. bandwidth required (MB/s) 3920 3920 
Performance/Slice 39.8 26.9 
Performance/Gate 1.62 1.24 
Total power (mW) 19324 - 32337 
Power per Slice (mW) 3.34 3.79 
Power per gate (mW/gate) 0.136 0.175 
Table 5.4: Results of 2D tree-based motion estimation processor 
variable block size 
Design Strategy Bit serial 
Max frequency(MHz) 420 
Area (Slices) 2133 
Area (Gate) 55301 
Throughput (MB/s) 2 3 0 ‘ 
Max. bandwidth required (MB/s) 13440 
Performance / Slicc 10.8 
Performance / G ate 0.417 
Total power (mW) 13919 
Power/Slice (mW/slice) 6.5 
Power/Gate (mW/gate) 0.252 
Table 5.5: Results of MB-first bit-serial processor 
* 
CHAPTER 5. RESULTS AND COMPARISON 93 
700 
600 52F9 ^ 
^ 500 ^ -
；400 ,.;〜;;；•.::.. .. .CIF 
1 •“. . : . . .網 : • • 
I 300 • H D T V 
P 81 91 
200 ^ s 
100 ~~^ Irr"""58 
IDSYS 2DSYS IDTree 2DTree BS 
Architectures -
Figure 5.1: Throughput of different motion estimation architectures at dif-
ferent resolutions 
5.3 Comparison between motion estimation 
architectures 
5.3.1 Throughput and latency 
The throughputs of different architectures at different resolu-
tions are shown in figure 5.1. The throughput of 2D bit-parallel 
designs is the highest among all alternatives since its pipeline 
and parallel characteristics. For example, 2D tree architectures 
have ten times the throughput of a bit serial design. It also 
provides ten times less latency than a bit serial design. 
Comparing ID architectures to the bit-serial design, the bit-
serial architecture has better throughput when they are working 
at their maximum frequency. Latency is similar between ID 
systolic and bit serial architectures. 
Two dimensional architectures are suitable for dealing with 
high throughput applications like 1080p encoding, cinema qual-
ity video creation, etc. Low throughput applications, like video 
CHAPTER 5. RESULTS AND COMPARISON 94 
12000 
10000 • H ^ ^ ^ ^ H ^ H H B l H ^ ^ J H ^ y 
_ 8000 B B B i B ^ ^ B ^ ^ ^ H l B W M a ^ B ^ 
S 6000 
4000 I I ^ ^ ^ B ^ ^ e M g l ^ f f l g H ^ ^ H ^ ^ y ^ B 
2000 
IDSYS 2DSYS IDTree 2DTree BS 
Architectures -
Figure 5.2: Occupied slices of different motion estimation architectures 
conferencing, are suitable for ID systolic and bit serial architec-
tures. 
5.3.2 Occupied resources 
The occupied areas for different architectures are shown in figure 
5.2. 2D architectures occupy the most resources, the second 
being the bit-serial architecture. The ID systolic array occupies 
the least amount of resources. In general, the area occupied is 
proportional to its performance. 
As area occupation directly affects the price of hardware de-
vice, suitable architecture should be selected for minimizing pro-
duction cost. In modern technology, implementation of 2D ar-
chitectures on F P G A is still expensive since it requires over 10k 
slices which is usually provided only in high-end F P G A devices. 
ID and bit-serial architectures are a moderate choice for cost-
constrained applications. 
CHAPTER 5. RESULTS AND COMPARISON 95 
A 8 0 0 0 丨丨丨i|丨而imgmnnuBMMBMHM 




I 3000 交 
^ 2000 
I 1000 
0 I_•UIIIIIIMIIII I I—• wm^^E^ 
IDSYS 2DSYS IDTree 2DTree BS 
Architectures -
Figure 5.3: Bandwidth requirements of different motion estimation architec-
tures at GIF 30 fps 
5.3.3 Memory bandwidth 
The required bandwidth for GIF 30 fps of different architectures 
are shown in figure 5.3. Bit-serial design has the largest band-
width requirement, 64 bytes/cycle, inherited by its non-systolic 
architecture. The local communication in ID, 2D systolic and 
tree architectures significantly reduce bandwidth requirements. 
Among ID and 2D architectures, ID architectures require more 
bandwidth. 
Memory bandwidth significantly affects the power consump-
tion. In battery-powered applications, high bandwidth architec-
tures should be avoided. On the other hand, memory is a slow 
device compared to computation logic. Smaller bandwidth re-
quirement means we could process data at a higher throughput. 
5.3.4 Motion estimation algorithm 
Pipelining impedes efficient processing of data dependent al-
gorithms like fast motion estimation algorithms. The pipeline 
CHAPTER 5. RESULTS AND COMPARISON 96 
140 I - - ： • ; : ; 
( . . . • 
no .  .:..:、./: Ul-
2 . . ’ ’ . J 
卷 100 — ： , :.、:: , . 
土 - ^ H ； ： • 
Q 80 ^~- ^  ^ -69 
t: • j It f 
""pmM M m M \ 
IDSYS 2DSYS IDTree 2DTree BS 
Architectures -
Figure 5.4: Throughput of different architectures at different motion estima-
tion algorithms 
must be flushed in the decision making process, which is a kind 
of data hazard. Flushing a pipeline leads to wastage of resources 
and execution time. In ID or 2D systolic arrays, 16 and 32 cycles 
are flushed respectively when a data hazard occurs. For exam-
ple, TSS in a 2D systolic array requires 448 cycles to calculate 
SADs for 25 search points. The number of cycles per search 
point is decreased from 1.017 to 17.92 cycles/search point. 
In our bit-serial architecture, since it is not systolic based, 
the efficiency for fast algorithms can be much higher. It is able 
to process TSS in around 450 cycles compared to 18432 in FS. 
The number of cycles per search points is almost kept constant. 
Figure 5.4 shows the effect of different architectures by TSS. 
Obviously, bit-serial design performs the best among the archi-
tectures. Figure 5.5 concludes the efficiency of different archi-
tectures per slice. FS and TSS are given for comparisons. 
As a result, the algorithmic flexibility of a bit-serial design 
is the highest among all architectures. It gives a larger design 
space for algorithm designers to design any algorithms they like 
* 
CHAPTER 5. RESULTS AND COMPARISON 97 
7 0 0 — ' — — ： ^ 
_ 狐 
I" 600 ^ 
I 500 — : . : • 、 , , . / : ) ' . 4 3 7 " 
Xi t r . v ' ^ H ‘ 一“.V•…...... , 
8 300 . . : • • : • : : . : 」 _ ^ . T S S 
• 2 得 f 
卜:J •盛J 
IDSYS 2DSYS IDTree 2DTree BS 
Architectures -‘ 
Figure 5.5: Maximum throughput per slice of different motion estimation 
architectures 
and gets rid of hardware concern. 
5.3.5 Power consumption 
Power consumption is due to four main factors. The area oc-
cupied, operating frequency, bandwidth requirement, and algo-
rithm involved. Bandwidth can be reduced by introducing more 
local memories, but the architecture gate count is increased. 
The power consumption utilizations are shown in figure 5.6. The 
high power consumption is due to the high frequency required by 
bit-serial architectures. Employing fast algorithms other than 
full search can greatly reduce its power consumption. Although 
some quality may be lost, it is acceptable in many low-end ap-
plications. As a result, bit-serial design is still energy-efficient 
in fast algorithms. 
Since pipeline based architectures favor full search, those ar-
chitectures are not a good choice for minimizing power although 
they require less bandwidth and operating frequency. Bit-serial 
CHAPTER 5. RESULTS AND COMPARISON 98 
o 隱 ip^〜约淋'’、雷’--�-!!—mMmii 
零7腳！^響、^ H^B 
E H H 
崖誦 BMW—MBB 
IDSYS 2DSYS IDTree 2DTree BS 
Architectures 
Figure 5.6: Power consumptions of different architectures 
mU^Mi 
IDSYS 2DSYS IDTree 2DTree BS 
Architectures 
Figure 5.7: Power efficiencies of different motion estimation architectures 
f 
CHAPTER 5. RESULTS AND COMPARISON 99 
is a good solution as it does not discard any calculations in fast 
algorithms. Figure 5.7 concludes the power efficiencies of differ-
ent architectures per slice. 
5.4 Comparison to ASIC and FPGA archi-
tectures in past literature 
In this section we compare our first-reported bit-serial architec-
ture, the MSB-first bit-serial architecture, to previously reported 
F P G A or ASIC designs. The bit-serial architecture is chosen for 
comparisons as it is the first reported bit-serial motion estima-
tion processor for H.264/AVC. 
The processors include bit-parallel and bit-serial architec-
tures with or without variable block size support. Since we can 
obtain an equivalent gate count from Xilinx ISE tools, we are 
able compare our architectures to ASIC implementations. No-
tice that the gate count collected from ISE tools likely to be is 
overestimated. Readers should have a sense that the equivalent 
gate counts collected from ASIC tools are usually smaller. Table 
5.6 and table 5.7 show the comparisons. 
In F P G A implementations, since variable block size is not 
supported in some cases, their area utilizations are underesti-
mated. Thus, for those which don't support variable block size, 
their performance per slice are not used for comparison. For the 
remaining, our bit-serial processor outperforms other variable-
block-size supported processors in performance per slice. Our 
architecture is operating at the highest frequency among all ar-















































































































































































































































































CHAPTER 5. RESULTS AND COMPARISON 101 
[39] [22] [54] [55] Our Bit-serial 
Design strategy BP BP BP BP ^ 
Num. PEs 256 256 16 16 N / X 
Max frequency(MHz) _ 200 100 100 ^ 420 
Area (Gate) 597k 154k 108k ~ 6 1 k 55k 
Throughput (MB/s) ~195313 97560 5560 23068 
Performance/gate 0.327 0.634 0.051 0.292 0.417 
Table 5.7: Results and comparison of motion estimation processors on ASIC 
devices 
In ASIC comparison, we select only H.264/AVC supported 
architectures for a fair comparison. The number of PEs in table 
5.7 indicates what class of motion estimation processor belongs. 
Typically, 16-PE architectures belong to ID class. 256-PE ar-
chitectures belong to 2D class. In typical ID implementation, 
performance per slice is lowest while full parallel 2D architec-
tures obtain the highest score. Our bit-serial design scores in 
between and sometimes better than 2D architectures even the 
gate count is overestimated. 
5.5 Summary 
In this chapter a comparison of different architectures are pre-
sented and analyzed. W e created a family of hardware sup-
porting a range of throughput, bandwidth, area, power, and 
flexibility. For any application with well defined requirements, 
a suitable architecture can be selected. This chapter also gives 
an overview for designers to pick up an appropriate architecture 




In this work, we studied and analyzed the hardware architec-
tures for motion estimation in the latest video codec standard 
H.264/AVC. Through algorithmic, architectural and arithmetic 
optimizations, we suggested and implemented a family of mo-
tion estimation processors on a F P G A platform. Modifications 
were made to architectures proposed in previous literature to 
support variable block sizes. W e proposed a family of archi-
tectures with different throughputs, area utilizations, memory 
bandwidths, power consumptions and algorithm flexibilities. As 
a result, designers can select the appropriate one when all these 
metrics are known. 
6.1.1 Algorithmic optimizations 
W e studied several motion estimation algorithms in the past 
literatures. The algorithms can be classified into two cate-
gories: exhaustive search and fast search. The former is com-
monly known as full search. The latter is comprised of a num-
ber of different approaches to perform motion estimation via 
heuristic techniques. Those studied were three step search, two-
dimensional logarithmic search and diamond search which all 
102 
CHAPTER 6. CONCLUSION 103 
have slight differences in their scarch qualities and complexi-
ties. W e studied their computational requirements, searching 
qualities and ease of implementation in hardware. The available 
algorithms enable tradeoffs between the throughput and picture 
quality in our motion estimation processors. 
Our family of motion estimation processors is able to process 
the motion estimation algorithms presented at near to 100% 
processing element utilizations. High utilizations of the sys-
tolic and tree architectures can be achieved by employing a full 
search. For data dependent algorithms such as TSS and DS, we 
employ our bit-serial architectures to achieve a high utilization 
ratio. As a result, our family of architectures can implement any 
standard of algorithm with a high utilization rate, in terms of 
maximizing the logic performance. This flexibility is important 
in many areas. Within the given quality, developers always want 
to achieve the highest performance via algorithm optimizations. 
6.1.2 Architecture and arithmetic optimizations 
It is possible to efficiently map motion estimation algorithms 
into systolic arrays. Through systolic arrays, we can fully paral-
lelize the computations and reduce the required bandwidth. ID 
and 2D systolic and tree architectures were presented. W e also 
made modifications to systolic-based motion estimation hard-
ware to support variable block size motion estimation by em-
ploying adder trees to enable the reuse of partial SADs. 
At the computer arithmetic level, we study both bit-parallel 
and bit-serial approaches. In bit-parallel architectures, we em-
ploy conventional number systems to perform mathematical cal-
culations. 2D systolic array and 2D tree architectures are de-
veloped for high-end applications. Small area architectures such 
as the ID systolic array and ID tree architectures are also de-
veloped to support low-end applications. W e also proposed a 
CHAPTER 6. CONCLUSION 104 
bit-serial motion estimation processor for rnid-end applications. 
In the bit-serial design, we redefined the S A D operations present 
and employed redundant number and signed digit number sys-
tems. After analyzing the properties of the comparison opera-
tion, we employ a MSB-first approach to solve motion estimation 
jointly with the early termination scheme. W e further optimized 
the early termination scheme by a more accurate starting point 
by employing the H.264/AVC motion vector prediction tech-
nique, which can reduce the updates of minimum S A D during 
comparisons. 
Hardware developers often search for the best tradeoffs be-
tween performance, bandwidth, area and power. Our motion 
estimation processors provide different characteristics in which 
some are performance maximized, some of area optimized, etc. 
With predefined constraints on hand, hardware developers are 
able to make tradeoffs within a short time, thus shorten the de-
velopment cycle. Without these measurements, designers can 
only estimate the performance, area, etc based on experiences. 
6.1.3 Implementation on a FPGA platform 
Different architectures are synthesized, implemented and place-
and-routed on Xilinx Virtex-II Pro device using Xilinx ISE and 
Synplicity as synthesis and implementation tools. The maxi-
m u m frequency, slices occupied and power consumption are re-
ported. In bit-parallel architectures, encoding of H D T V at 28 
fps can be achieved in 2D architectures. Our bit-serial archi-
tecture can perform encoding of GIF at 58 fps. In our F P G A 
platform, the performance of our architectures is able to perform 
real time encoding up to 1080p. An area-throughput chart for 
architectures is shown in figure 6.1 showing different architec-
tures mapped to different applications. The area is in terms of 
Xilinx Virtex-II slices and the throughput calculations assume 






� 4 0 0 0 
2000 "7" 
0 ‘ ‘ ‘ 
0 200 400 600 800 
Throughput at GIF (fps) 
Figure 6.1: Area vs throughput in different motion estimation architectures 
adequate memory bandwidth for I/O to the video coder core. 
When employing general purpose processors, previous work 
20] shows that a Pentium 4 2.8E GHz processor can only per-
form encoding of GIF video at 0.28 fps when no algorithmic 
optimizations are done. Motion estimation occupies 65% of the 
total encoding time. As a result, our designs are at least 81 
times faster than motion estimation implementations on gen-
eral purpose processors. The systolic, tree and bit-serial archi-
tectures proposed in this work show that an F P G A design can 
have much higher performance than general purpose processors. 
Moreover, the power consumption and memory bandwidth are 
reduced at the same time as the power needed for microproces-
sors is in range of 70 to 100 Watts (http://www.inteLcom and 
http://www.amd.com). 
In this work, we proposed the first MSB-first bit-serial vari-
able block size motion estimation architecture for H.264/AVC. 
The architecture, with careful choosing of algorithms, is able to 
f 
CHAPTER 6. CONCLUSION 106 
perform motion estimations efficiently on a low cost F P G A de-
vice. As an example, our bit-serial architecture is able to fit in 
a low cost Spartan-3 device such as XC3S200. 
6.2 Future work 
Video coding systems can be accelerated by hardware because 
of inherit opportunities for parallelism. Optimizations of video 
coding through software are limited since modern general pur-
pose processors are not able to compute at several giga oper-
ations per second. Algorithms such as motion estimation can 
not be implemented efficiently as in general purpose processors. 
ASIC or F P G A technologies, making them necessary for high 
performance solutions. 
Besides motion estimation, many algorithms in video codec 
can be accelerated: 
1. Interpolation for fractional motion estimation that involves 
a large amount of pixel filtering. 
2. Integer transform from residue values to transformed coef-
ficients in the transform stage. 
3. Deblocking filtering of pixels between blocks in the deblock-
ing stage. 
Furthermore, higher radix (e.g. radix-4) bit-serial implemen-
tations of motion estimation processors may have performance 
advantages (although at the cost of increased area) and may bet-
ter exploit the 6-input LUTs in the recently announced Xilinx 
Virtex-5 device. 
A family of hardware cores for video codecs can be built in 
a similar fashion to this work. Although the complexities of 
these stages appear smaller than that of the motion estimation, 
they tend to be complicated in modern and future codecs. The 
CHAPTER 6. CONCLUSION 107 









entity olFA is 
Port ( elk : in STD_LOGIC; 
X : in STD_LOGIC; 
y : in STD.LOGIC; lo 
z : in STD_LOGIC; 
s : out STD_LOGIC; 
c : out STD_LOGIC); 
end olFA; 
architecture Behavioral of olFA is 
signal s2: std_logic; 
begin 20 
s2 < = X xor y xoi. z; 
c < = ( X and y) or (x and z) or (y and z); 
process(clk) 
begin 
if(clk'event and c l k = ' l ' ) then 
108 
APPENDIX A. VHDL SOURCES 109 
s <= s2; 
end if； 
end process ； 30 
end Behavioral； 
A.2 Online Signed Digit Full Adder 
library IEEE; 
use IEEE. STDXOGIC-1164. ALL ； 
use IEEE. STDXOGIC-ARITH. ALL ； 
use IEEE. STD.LOGICUNSIGNED. ALL ； 
entity olSDA is 
Port ( elk : in STD.LOGIC; 
a : in STDXOGIC; 
b : in STD.LOGIC; 
c : in STD LOGIC; lo 
d : in STDXOGIC; 
neg : out STDXOGIC; 
pos : out STD LOGIC)； 
end olSDA; 
architecture Behavioral of olSDA is 
component olFA 
Port ( elk : in STD.LOGIC; 
X : in STDLOGIC; 
y : in STDLOGIC; 20 
z : ill STDLOGIC; 
s : out STDLOGIC; 
c : out STD LOGIC)； 
end component ； 
signal c l , s l : std"logic; 
signal c2,s2: std"logic； 
signal d'pl: std.logic; 
signal b.n: std'logic; 
30 
begin 
b.n <= not(b)； 
- • 
APPENDIX A. VHDL SOURCES 110 
neg <= not(c2)； 
pos <= s2; 
ul: olFA port map (clk,a,b'n,c,sl,cl)； 
u2: olFA port map (clk.cl ,sl ,d'pl,s2,c2)； 
process (elk) 40 
begin 
if (elk, event and elk = then 




A.3 Online Full Adder Tree 
library IEEE; 
use IEEE. STDXOGIC" 1164. ALL ； 
use IEEE. STDXOGIC-ARITH. ALL ； 
use IEEE. STD LOGIC UNSIGNED. ALL ； 
entity olFA.tree.16op is 
Port ( elk : in STD.LOGIC; 10 
xl : in STD LOGIC; 
x2 : in STD LOGIC; 
x3 : in STDXOGIC; 
x4 : in STD LOGIC; 
x5 : in STD LOGIC; 
x6 : in STD LOGIC; 
x7 : in STD LOGIC; 
x8 : in STDXOGIC; 
x9 : in STD LOGIC; 
xlO : in STD LOGIC; 20 
x l l : in STDLOGIC; 
xl2 : in STD LOGIC; 
xl3 : in STD LOGIC; 
APPENDIX A. VHDL SOURCES 111 
xl4 : in STD LOGIC; 
xl5 : in STD.LOGIC; 
xl6 : in STD LOGIC; 
sum : out STD LOGIC; 
carry : out STD LOGIC)； 
end olFA.trce.16op; 
30 
architecture Behavioral of olFA.ti.ee. 16op is 
signal sl,s2,s3,s4,s5,s6,s7,s8,s9,sl0,sll ,sl2,sl3,sl4: std.logic； 
signal cl，c2, c3，c4, c5, c6，c7, c8，c9,clO, cl 1,cl2,cl3，cl4: std.logic ； 
signal s5.pl,s5.p2: std.logic; 
signal sl2.pl: std.logic; 
signal xl6'pl,xl6'p2,xl6'p3: std.logic; .. 
signal sll 'sp,sl2'sp,cirsp,cl2'sp: std.logic; 
signal xl6"p2'sp: std"logic; 40 
signal s6"sp,s7'sp,s8'sp,c6"sp,c7'sp,c8'sp: std.logic; 
signal sS'pl'sp: std.logic; 
component olFA 
Port ( elk : in STD.LOGIC; 
X : in STDLOGIC; 
y : in STDLOGIC; 
z : in STDLOGIC; 
s : out STDLOGIC; 
c ： out STD.LOGIC); 50 
end component ； 
begin 
- - l eve l 1 
ul : olFA port map (clk,xl ,x2,x3,sl ,cl )； 
u2: olFA port map (elk,x4,x5, x6,s2,c2)； 
u3: olFA port map (elk, x7, x8，x9, s3, c3); 
u4: olFA port map (clk,xlO,xll,xl2,s4,c4) ； 60 
u5: olFA port map (clk,xl3,xl4,xl5,s5,c5)； 
- - l eve l 2 
- - N o pipeline 
u6: olFA port map (clk,cl ,s l ,c2,s6,c6) ; 
u7: olFA port map (clk,s2,c3,s3,s7,c7)； 
APPENDIX A. VHDL SOURCES 112 
u8: olFA port map (clk,c4,s4,c5,s8,c8)； 
70 
- - level 3 
u9: olFA port map (clk,c6,s6，c7，s9’c9); 
ulO: olFA port map (clk,s7,c8,s8,sl0,cl0)； 
- - level 4 
- - N o pipeline 
ul l : olFA port map (clk,c9,s9,cl0,sll ,cll)； 
ul2: olFA port map (clk,sl0,s5"p2,xl6'p3,sl2,cl2)； 
8 0 
- - level 5 .. 
Ill3: olFA port map (clk,cll,sll ,cl2,sl3,cl3)； 
- - level 6 
ul4: olFA port map (clk，cl3,sl3,sl2.pl，sl4,cl4); 
process(elk) 
begin 
if (elk ‘ event and elk = then 
xl6_pl <= xl6; 90 
xl6_p2 <= xl6_pl; 
xl6_p3 <= xl6_p2; 
sl2_pl <= sl2; 
s5_pl <= s5; 
s5_p2 <= s5_pl; 
end if； 
end process; 
sum <= sl4; 100 
carry <= cl4; 
end Behavioral； 
A.4 SAD merger 
library IEEE; 
use IEEE.STDXOGIC1164.ALL; 
APPENDIX A. VHDL SOURCES 113 
use IEEE. STDXOGICARITH. ALL ； 
use IEEE. STD'LOGIC UNSIGNED. ALL ； 
entity olSDFA'tree is 
Port ( elk : in STD LOGIC; 
SAD.4x4.p : in STD LOGIC VECTOR (15 downto 0); 
SAD.4x4.n : in STD LOGIC VECTOR (15 downto 0); 
oSAD'4x8'p : out STD LOGIC VECTOR (7 downto 0) ； lo 
oSAD.4x8.n : out STD LOGIC VECTOR (7 downto 0)! 
oSAD.8x4.p : out STD LOGIC VECTOR (7 downto 0) i 
oSAD.8x4 n : out STD LOGIC VECTOR (7 downto 0) J 
oSAD.8x8 p : out STD LOGIC VECTOR (3 downto 0) i 
oSAD.8x8.n : out STD LOGICVECTOR (3 downto 0) ！ 
oSAD.8xl6'p : out STD LOGIC VECTOR (1 downto 0); 
oSAD.8xl6.n : out STD LOGIC VECTOR (1 downto 0); 
oSAD.16x8.p : out STD LOGIC VECTOR (1 downto 0) 
oSADiexSn ： out STDLOGICVECTOR (1 downto 0); 
oSAD"16xl6'p : out STD LOGIC; ‘ 20 
oSAD.16x16.11 : out STD'LOGIC)； 
end olSDFA'tree; 
architecture Behavioral of olSDFA'tree is 
component olSDA 
Port ( elk : in STD LOGIC; 
a : in STD.LOGIC; 
b : in STD LOGIC; 
c : in STDXOGIC; 30 
d : in STD LOGIC; 
neg : out STDXOGIC; 
pos : out STDXOGIC)； 
end component ； 
signal SAD'8x4"p, SAD.8x4.n : std" logic "vector (7 downto 0); 
signal SAD'4x8'p, SAD.4x8.n : std'logic Vector (7 downto 0); 
signal SAD"8x8'p, SAD'8x8'n : std'logic "vector (3 downto 0); 
signal SAD.8x16.p, SAD"8xl6"n : std'logic"vector(l downto 0)； 
signal SAD'16x8"p, SAD.16x8.n : std'logic'vectorCl downto 0); 40 
signal SAD'4x8'pipe'n, SAD"4x8"pipe'p: std'logic vector(7 downto 0)； 
signal SAD.8x8.pipe.n, SAD.8x8.pipe.p: std.logic.vector(3 downto 0)； 
signal SAD'8xl6'pipe'n, SAD.8xl6.pipe.p: std'logic'vectorCl downto 0); 
signal SAD.16x8.pipe.n, SAD.16x8.pipe.p: std'logic'vectorCl downto 0); 
APPENDIX A. VHDL SOURCES 114 
begin 
~ 4x8 SAD 
SAD.4x8.r2: olSDA port map (elk, SAD"4x4p(0), SAD.4x4.n(0)， 50 
SAD.4x4.p(l)，SAD.4x4.n(l)’ 
SAD-4x8pipe'ii(0), SAD.4x8.pipc.p(0)); 
SAD.4x8.3.4: olSDA port map (elk, SAD.4x4.p(2), SAD.4x4.n(2), 
SAD-4x4-p(3), SAD-4x4n(3), 
SAD.4x8.pipe.n(l), SAD'4x8pipep(l)) ; 
SAD.4x8'5.6: olSDA port map (elk, SAD.4x4-p(4)，SAD.4x4'n(4), 
SAD-4x4p(5), SAD-4x4-n(5). 
SAD.4x8.pipe.n(2), SAD-4x8.pipe.p(2)); 
SAD.4x8T8: olSDA port map (elk, SAD.4x4.p(6), SAD.4x4.n(6), 
SAD.4x4.p(7), SAD-4x4-n(7), 60 
SAD.4x8.pipe.n(3), SAD.4x8.pipe.p(3)); -
SAD'4x8"9'10: olSDA port map (elk, SAD'4x4'p(8), SAD.4x4'n(8), 
SAD.4x4-p(9), SAD-4x4.n(9), 
SAD.4x8.pipe.n (4) , SAD.4x8.pipe'p (4 ) ) ; 
SAD"4x8'iri2: olSDA port map (elk, SAD.4x4.p(10), SAD.4x4.n(10), 
SAD.4x4_p(ll), SAD.4x4.n(ll), 
SAD.4x8.pipe.n(5)，SAD’4x8.pipe.p(5)); 
SAD.4x8.13.14: olSDA port map (elk, SAD.4x4.pa2), SAD.4x4.n(12), 
SAD-4x4-p(13), SAD-4x4-ri(13), 
SAD.4x8.pipe.n(6)，SAD-4x8.pipe.p(6)) ； 70 




SAD.8x4T5: olSDA port map (elk, SAD.4x4-p(0) ’ SAD'4x4.n(0)， 
SAD.4x4.p(4), SAD.4x4.n(4), 
SAD.8x4-n(0)，SAD.8x4.p(0)); 
SAD'8x4.2.6: olSDA port map (elk, SAD'4x4.p(l), SAD.4x4.n(l), 
SAD.4x4.p(5)，SAD.4x4.n(5)’ go 
SAD-8x4-n(l), SAD.8x4.p(l)); 
SAD"8x4'37: olSDA port map (elk, SAD.4x4.p(2), SAD.4x4.n(2), 
SAD.4x4.p(6), SAD'4x4-n(6), 
SAD.8x4.n(2), SAD.8x4-p(2)); 
SAD.8x4.4'8: olSDA port map (elk, SAD.4x4.p(3), SAD'4x4n(3), 
SAD.4x4.p(7)，SAD.4x4.n(7), 
SAD.8x4.n(3)，SAD.8x4-p(3)); 
SAD"8x4"9"13: olSDA port map (elk, SAD"4x4"p(8), SAD"4x4n(8), 
SAD'4x4'p(12), SAD-4x4-n(12), 
SAD-8x4n(4), SAD.8x4-p(4)); 恥 
APPENDIX A. VHDL SOURCES 115 
SAD.8x4.10.14: olSDA port map (elk, SAD.4x4.p(9), SAD.4x4.n(9), 
SAD-4x4-p(13), SAD-4x4-n(13), 
SAD.8x4.ii(5), SAD.8x4.p(5)); 
SAD-8x4.11.15: olSDA port map (elk, SAD.4x4.p(10), SAD.4x4.n(10), 
SAD.4x4.p(14), SAD.4x4.n(14), 
SAD.8x4.n(6)，SAD.8x4.p(6)); 




~ 8x8 SAD 
SAD.8x8.0: olSDA port map (elk, SAD.4x8.p(0), SAD.4x8.n(0), 
SAD'4x8p(2), SAD-4x8"n(2), 
SAD'8x8pipe'n(0), SAD'SxS'pipe'p(O)); 
SAD.8x8.1: olSDA port map (elk, SAD'4x8p(l) , SAD-4x8n(l), 
SAD_4x8_p(3), SAD’4x8’n(3), 
SAD.8x8.pipe.ii(l)，SAD.8x8'pipe.p(l)); 
SAD'8x8'2: olSDA port map (elk, SAD.4x8.p(4), SAD.4x8.n(4), 
SAD-4x8'p(6), SAD’4x8‘n(6), 
SAD.8x8.pipe.n(2), SAD.8x8'pipe'p(2)); no 
SAD.8x8.3: olSDA port map (elk, SAD.4x8.p(5), SAD.4x8.n(5), 
SAD-4x8-p(7), SAD-4x8n(7), 
SAD.8x8.pipe'ii(3)，SAD'SxSpipepO)); 
- - 8 x 1 6 SAD 
SAD'SxlG'O: olSDA port map (elk, SAD'8x8.p(0), SAD.8x8.n(0), 
SAD'SxSpd), SAD-8x8'n(l), 
SAD.8xl6.pipe.ii(0) ’SAD.8xl6.pipe.p(0)); 
SAD'8xl6"l: olSDA port map (elk, SAD'8x8"p(2), SAD'8x8'n(2), 
SAD'8x8p(3), SAD-8x8n(3), 120 
SAD-8xl6pipe-n(l) ,SAD"8xl6-pipe-p(l)); 
~ 16x8 SAD 
SAD.16x8.0: olSDA port map (elk, SAD"8x8'p(0), SAD.8x8.n(0)， 
SAD-8x8-p(2), SAD’8x8‘n(2), 
SAD.16x8.pipe.n(0) ,SAD.16x8'pipe.p(0)); 




~ 16x16 SAD 
SAD"16xl6: olSDA port map (elk, SAD'8xl6p(0), SAD'8xl6.n(0), 
SAD'Sxie'pd), SAD-8xl6n(l) , 
oSAD-16xl6-n, oSAD‘16xl6‘p); 
APPENDIX A. VHDL SOURCES 1 1 6 
process (elk) 
begin 
if (elk'event and elk =，1，）then 
oSAD_4x8_p <= SAD_4x8_p; 
oSAD_4x8_n <= SAD_4x8_n; 140 
oSAD_8x4_p <= SAD_8x4_p; 
oSAD_8x4_n <= SAD_8x4_n; 
oSAD_8x8_p <= SAD_8x8_p; 
oSAD_8x8_n <= SAD_8x8_n; 
oSAD_8xl6_p <= SAD_8xl6_p; 
oSAD_8xl6_n <= SAD_8xl6_n; 
oSAD_16x8_p <= SAD_16x8_p; 
oSAD_16x8_n <= SAD_16x8_n; 
SAD一4x8一p <= SAD_4x8_pipe_p; 150 
SAD_4x8_n <= SAD_4x8_pipe_n; 
SAD_8x8_p <= SAD一8x8一pipe一p; 
SAD_8x8_n <= SAD_8x8_pipe_n; 
SAD_8xl6_p <= SAD_8xl6_pipe_p; 
SAD_8xl6_n <= SAD_8xl6_pipe_n; 
SAD_16x8_p <= SAD_16x8_pipe_p; 
SAD_16x8_n <= SAD_16x8_pipe_n; 
end i f ; 
end process； 
end Behavioral ； 160 
A.5 Signed digit adder tree stage (top) 
library IEEE; 
use IEEE. STDXOGIC-1164. ALL ； 
use IEEE.STDLOGIC.ARITH.ALL; 
use IEEE.STD.LOGIC.UNSIGNED.ALL; 
entity sd'16op'tree is 
Port ( elk : in STDLOGIC; 
— rst: in STDLOGIC; 
p : in STD.LOGIC.VECTOR (15 downto 0); 
n : in STD LOGIC'VECTOR (15 downto 0); lo 
neg : out STD.LOGIC; 
pos : out STDLOGIC); 
APPENDIX A. VHDL SOURCES 117 
end sd'lGop'tree; 
architecture Behavioral of sd"16op'tree is 
component olFA.tree.16op 
Port ( elk : in STD LOGIC; 
xl : in STD LOGIC; 
x2 : in STD LOGIC; 20 
x3 : in STD LOGIC; 
x4 : in STD.LOGIC; 
x5 : in STD LOGIC; 
x6 : in STD LOGIC; 
x7 : in STD LOGIC; 
x8 : in STD LOGIC; 
x9 : in STD LOGIC; ‘ 
xlO : in STD LOGIC; 
x l l : in STD.LOGIC; 
xl2 : in STD LOGIC; 3 � 
xl3 : in STD LOGIC; 
xl4 : in STD LOGIC; 
xl5 : in STD LOGIC; 
xl6 ： in STD'LOGIC; 
sum : out STD LOGIC; 
carry : out STD LOGIC)； 
end component ； 
component olSDA 
Port ( elk : in STD.LOGIC; 40 
a : in STDLOGIC; 
b : in STDLOGIC; 
c : in STDLOGIC; 
d : in STDLOGIC; 
neg : out STD'LOGIC; 
pos : out STD'LOGIC)； 
end component ； 
signal p'p, n.p: std"logic'vector(15 downto 0); 
signal neg'p, pos.p: ski.logic; 50 
signal sl ,s2,cl ,c2: std'logic; 
begin 
treel: olFA.tree.16op port map (clk,p p(0) ,p 'p ( l ) , 
P'P(2) , P P ( 3 ) ,pp(4) ,p-p(5) .p-p(6) .p-p(7), 
APPENDIX A. VHDL SOURCES 118 
p"p(8) ,p-p(9) ,p-p(10) ,p-p(ll ) ,p.p(12) 
,p-p(13),p-p(14),p-p(15),sl,cl)； 
tree2: olFA.tree.16op port map (clk,n'p(0) ,n 'p ( l ) , 
n.p(2) ,np(3) ,n'p(4) ,n"p(5) ,n"p(6) , np (7 ) , 60 
n.p(8) ,n-p(9) .npClO) ,n.p(l l ) ,np(12) , 
n-p(13),n-p(14),np(15),s2,c2); 
olSDAl: olSDA port map (elk, cl,c2,si,s2,neg'p,pos'p)； 
process (elk) 
begin 
if (elk ‘ event and c l k = ' P ) then 
P-P <= P; . 70 
n_p <= n; 
neg <= neg一p; 





A.6 Absolute element 
library IEEE; 
use IEEE. STDXOGIC-1164. ALL ； 
use IEEE. STDXOGIC-ARITH. ALL ； 
use IEEE. STDXOGIC UNSIGNED. ALL ； 
entity abs'stage is 
Port ( elk : in STD LOGIC; 
msd : in STDXOGIC; 
refpixel : in STDXOGIC; 
curr pixel : in STDXOGIC; lo 
abs'pos : out STDXOGIC; 
abs'neg : out STDXOGIC)； 
end abs.stage; 
architecture Behavioral of abs'stage is 
APPENDIX A. VHDL SOURCES 119 
signal msd.pos, msd.neg : std'logic; 
signal exchange: std"logic; 
begin 20 
exchange <= not (msd.pos) and msd.neg; 
process(elk) 
begin 
if (elk ‘ event and elk = ' 1 0 then 
if(msd = , r ) then 
msd_pos <= curr_pixel； 
msd.neg <= ref_pixel; 
i f ( (not(curr.pixel) and ref—pixel) = '1') then 30 
abs.pos <= ref—pixel; 
abs.neg <= curr—pixel; 
else 
abs_pos <= curr.pixel; 
abs_neg <= ref .pixel ; 
end if； 
else 
if(exchange = ，1，） then 
abs_pos <= ref .pixel ; 
abs_neg <= curr.pixel； 40 
else 
abs_pos <= curr_pixel; 
abs_neg <= ref .p ixel ; 
end if； 





A.7 Absolute stage (top) 
library IEEE; 
use IEEE. STDXOGIC1164. ALL ； 
use IEEE. STDXOGICARITH. ALL ； 
use IEEE. STDXOGIC UNSIGNED. ALL ； 
APPENDIX A. VHDL SOURCES 120 
entity abs'stage'top is 
Port ( elk : in STDXOGIC; 
msd : in STDLOGIC; 
refpixel : in STD.LOGIC.VECTOR (255 downto 0); 
curr'pixcl : in STD.LOGIC.VECTOR (255 downto 0) ； lo 
sd'num'pos : out STD LOGIC'VECTOR (255 downto 0); 
sd.rium.neg : out STD LOGIC'VECTOR (255 downto 0 ) ) ; 
end abs'stage'top; 
architecture Behavioral of abs.stage.top is 
component abs.stage 
Port ( elk : in STDLOGIC; 
msd : in STDLOGIC; ‘ 
refpixel : in STDLOGIC; 20 
ciirr'pixel : in STDLOGIC; 
abspos : out STDLOGIC; 
absneg : out STDLOGIC)； 
end component ； 
begin 
abs.stage.generate: for i in 0 to 255 generate 
abs'stage: abs stage port map (elk, msd, refpixel(i), 30 
currpixel(i), sd'numpos(i), sd.num.neg(i)); 
end generate ； 
end Behavioral ； 
A.8 Online comparator element 
library IEEE; 
use IEEE. STDXOGIC-1164. ALL ； 
use IEEE. STDXOGIC ARITH. ALL ； 
use IEEE. STD'LOGIC UNSIGNED. ALL ； 
entity ol.comp is 
APPENDIX A. VHDL SOURCES 121 
Port ( elk : in STDLOGIC; 
rst : in STDLOGIC; 
p-pos : in STDLOGIC; lo 
pneg : in STDLOGIC; 
q'pos : in STDLOGIC; 
qncg : in STDLOGIC; 
result : out STD LOGIC)； 
end ol.comp; 
architecture Behavioral of ol.cornp is 
signal sign'p, sign'q: std'logic; 
signal mag.p, mag'q: std"logic; 20 
signal s.p, s'q: std.logic.vector(2 downto 0); 
signal s.pO, s'pl, s'qO, s.ql: std"logic; 
signal s'p2, s"q2: std"logic; 
--signal set"p, set.q, reset"p, reset.q: std.logic; 
begin 
--initialization 
signp <= p'neg; 30 
sign'q <= q'neg; 
mag.p <= p.pos or p.neg; 
mag.q <= q'pos or q'neg ； 
—contatentation of bit 2, bit 1 and bit 0 
s.p <= s.p2 & s'pl & S.pO; 
s.q <= s'q2 & s.ql & s'qO; 
process (elk, rst) 如 
begin 
if (rst =，1，）then 
s_pO < = '0 ' ; 
s_qO < = '0 ' ; 
s_pl <=，0, ; 
s.ql < = ,0,; 
s_p2 < = ,0，； 
s_q2 <=，0, ; 
result <=，0, ; 
else 50 
APPENDIX A. VHDL SOURCES 122 
if(clk'event and elk = ' 1 ' ) then 
- - b i t 0 is always equal to mag of p and q 
s_pO <= mag一p; 
s_qO <= mag_q; 
- - b i t 1 is always depend to bit 0 and sign 
s_pl <= s_pO xor sign一p; 
s_ql <= s_qO xor sign一q; 
60 
- - b i t 1 depends on bit 0 and bit 2, bit 2 has higher priority 
i f ( (s_p2 = and s_q2 = ' 0 ' ) or 
(s_p2 = '1 ' and s_q2 = ,1 , ) ) then 
s_p2 <= (not(sign_p) and s_pl) or 
(s_pl and s_pO) or (not(s_ql or s_qO) and sign.q)； 
s_q2 <= (not(sign_q) and s_ql) or 
(s_ql and s_qO) or (not(s_pl or s_pO) and sign.p)； 
else 
s_p2 <= (not(sign一p) and s_p2) or 
(s_p2 and s_pO) or (not(s_q2 or s_qO) and sign_q)； 70 
s_q2 <= (not(sign_q) and s_q2) or 
(s_q2 and s_qO) or (not(s_p2 or s_pO) and sign.p)； 
end i f ; 
if(((s_p>s_q)and(s_p-s_q >= 2)) or 
((s_q>s_p)and(s_q-s_p>=2))) then 
result <= ; 
else 
result <= '0 , ; 80 
end if； 
end i f ; 
end i f ; 
end process; 
end Behavioral； 
A.9 Comparator stage (top) 
library IEEE; 
use IEEE.STD.LOGIC. 1164.ALL; 
APPENDIX A. VHDL SOURCES 123 
use IEEE. STD-LOGICARITH. ALL ； 
use IEEE. STDXOGIC UNSIGNED. ALL ； 
entity comp.stage.top is 
Port ( elk : in STD LOGIC; 
rst : in STD LOGIC; 
SAD.4x4.pos : in STD.LOGIC.VECTOR (15 downto 0)； 
SAD.4x4.rieg : in STD LOGIC VECTOR (15 downto 0) ; lo 
SAD.4x8 pos : in STD LOGIC VECTOR (7 downto 0)； 
SAD.4x8.neg ： in STD LOGIC VECTOR (7 downto 0)； 
SAD.8x4 pos : in STD LOGIC VECTOR (7 downto 0)； 
SAD.8x4 neg : in STD LOGIC VECTOR (7 downto 0) ; 
SAD.8x8.pos : in STD LOGIC VECTOR (3 downto 0)； 
SAD.8x8.neg : in STD LOGIC VECTOR (3 downto 0)； 
SAD'8xl6pos : in STDLOGICVECTOR (1 downto 0) ; 
SAD.8xl6.neg : in STD LOGIC VECTOR (1 downto 0) ; 
SAD.16x8.pos : in STD LOGIC VECTOR (1 downto 0) ; 
SAD.16x8.neg : in STD.LOGIC.VECTOR (1 downto 0) ; 20 
SAD'16xl6pos : in STD.LOGIC; 
SAD-16xl6.neg : in STD.LOGIC; 
mv'4rp : out std'logic'vector(40 downto 0) ; 
mv.4rn : out std.logic.vector(40 downto 0) ; 
stop: out STD LOGIC)； 
end comp'stage'top ； 
architecture Behavioral of comp.stage.top is 
component ol.comp 3q 
Port ( elk : in STDLOGIC; 
rst : in STDLOGIC; 
p-pos : in STDLOGIC; 
pneg : in STD.LOGIC; 
q.pos : in STDLOGIC; 
q.neg : in STDLOGIC; 
result : out STD LOGIC)； 
end component ； 
40 
- - = M i n i m u m Motion vectors and temp motion vectors= 
signal min'sad'4x4'p, min'sad"4x4'n: std'logic'vector (191 downto 0 ) ; 
signal temp'sad"4x4'p, temp.sad•4x4.11: std'logic'vector (191 downto 0 ) ; 
signal min"sad'4x8"p, min'sad'4x8'n: std'logic'vector (103 downto 0)； 
APPENDIX A. VHDL SOURCES 124 
signal temp'sad'4x8'p, temp "sad" 4x8'n: std" logic "vector (103 downto 0); 
signal min'sad'8x4'p, min'sad'8x4'n: std'logic'vector(103 downto 0); 
signal ternp'sad"8x4"p, temp.sad'8x4.n: std'logic "vector (103 downto 0); 
50 
signal min'sad'8x8"p, min'sad'8x8'n: std.logic.vector(55 downto 0); 
signal temp'sad'8x8"p, temp .sad. 8x8. n: std.logic, vcctor (55 downto 0); 
signal min'sad'8xl6'p, miri'sad'8xl6'n: std.logic.vector(29 downto 0); 
signal temp'sad'8xl6'p, temp'sad'8xl6'n: std "logic "vector (29 downto 0); 
signal min.sad.l6x8.p, min'sad'16x8'n: std.logic.vector (29 downto 0); 
signal temp'sad'16x8'p, temp.sad. 16x8.n: std'logic'vector(29 downto 0) ; 
signal min'sad"16xl6'p, min.sad.16x16.n: std'logic'vector(15 downto 0); 
signal temp'sad"16xl6"p, temp.sad.16x16.n: std'logic'vector(15 downto 0); 60 
signal comp"result'4x4: std'logic'vector (15 downto 0); 
signal comp'result'4x8: std'logic'vector(7 downto 0); 
signal comp'result"8x4: std'logic'vector (7 downto 0); 
signal comp'result"8x8: std.logic.vector (3 downto 0); 
signal comp.result.8xl6: std'logic'vector(1 downto 0)； 
signal comp.result. 16x8: std'logic'vector (1 downto 0)； 
signal comp.result. 16x16: std'logic; 70 
signal sad"4x4"msd"p, sad"4x4"msd"n: std'logic'vector (15 downto 0); 
signal sad'4x8'msd'p, sad"4x8'msd'n: std'logic'vector(7 downto 0); 
signal sad.8x4.msd.p, sad'8x4'msd'n: std'logic'vector(7 downto 0); 
signal sad"8x8"msd'p, sad.8x8.msd.n: std'logic'vector(3 downto 0); 
signal sad.8x16.msd.p，sad'SxlG'msd'n: std'logic'vector(1 downto 0); 
signal sad'16x8'msd'p, sad"16x8"rnsd'ri: std'logic'vector(1 downto 0); 
signal sad. 16x16.msd.p, sad. 16x16.msd.n: std.logic; go 
signal cnt: std'logic'vector (3 downto 0)； 
signal min4x4: std'logic'vector(15 downto 0)； 
signal min4x8，min8x4 : std'logic'vector (7 downto 0)； 
signal min8x8: std. logic .vector (3 downto 0); 
signal minl6x8,min8xl6 : std'logic'vector(1 downto 0)； 
signal minl6xl6: std'logic; 
begin go 
APPENDIX A. VHDL SOURCES 125 
mv.4rp <= sad'4x4'msd"p & sad'4x8'msd'p & 
sad.8x4.msd.p & sad'8x8"msd'p & 
sad'8xl6'msd'p & sad. 16x8.msd.p & 
sad.l6xl6'msd.p; 
mv.4rn <= sad'4x4"msd'n & sad"4x8"msd'n & 
sad'8x4'msd'n & sad'8x8'msd'n & 
sad.8x16.msd.n & sad. 16x8.msd.n & 
sad. 16x16.msd.n; 
100 
sad4x4: for i in 0 to 15 generate 
sad'4x4"msd'p(i) <= min"sad'4x4'p(ll+12*i) 
when min4x4(i) = '0 ' else 
temp_sad_4x4_p( 11+12*i)； . 




sad4x8: for i in 0 to 7 generate no 
sad_4x8_msd_p(i) < = inin_sad_4x8_p(12+13*i) 
when min4x8(i)='0' 
else temp_sad_4x8_p(12+13*i); 
sad_4x8-msd_n(i) < = min_sad_4x8_n(12+13*i) 
when min4x8(i)=，0， 
else temp_sad_4x8_n( 12+13*i); 
sad_8x4_msd_p(i) < = min_sad_8x4_p(12+13*i) 
when min8x4(i)=，0， 
else tcmp-sad_8x4_p( 12+13*i)； 




sad8x8: for i in 0 to 3 generate 
sad_8x8_msd_p(i) < = iniii_sad_8x8_p(13+14*i) 
when min8x8(i)='0' 
else temp_sad_8x8_p( 13+14*i); 
sad_8x8_msd_n(i) < = iniii_sad_8x8_n(13+14*i) 
when min8x8(i)=，0, 130 
else temp_sad_8x8_n(13+14*i); 
end generate; 
sad8xl6: for i in 0 to 1 generate 
APPENDIX A. VHDL SOURCES 126 
sad_8xl6_msd_p(i) < = min_sad_8xl6_p(14+15*i) 
when min8xl6(i)='0' 
else temp_sad_8xl6_p(14+15*i); 
sad_8xl6_msd_n(i) < = min_sad_8xl6_n(14+15*i) 
when min8xl6(i)=，0, 
else tcmp_sad_8xl6_n(14+15*i); 14() 
sad_16x8_msd_p(i) <= miri_sad_16x8_p(14+15*i) 
when minl6x8(i)='0' 
else temp_sad_16x8_p(14+15*i); 





sad_ 16xl6_msd_p <= miii_sad_16xl6_p(15) 
when miiil6xl6='0' 
else temp_sad_16xl6_p(15); 




for i in 0 to 15 generate 





for i in 0 to 7 generate 
oLcomp_4x8: oLcomp port map (clk.rst, SAD_4x8_pos(i), 
SAD_4x8_neg(i),sad_4x8_msd_p(i), 
sad_4x8_msd_n(i), comp_result_4x8(i)); 17� 





for i in 0 to 3 generate 
oLcomp_8x8: oLcomp port map (clk.rst, SAD_8x8_pos(i), 
APPENDIX A. VHDL SOURCES 127 
SAD_8x8_rieg(i),sad_8x8_msd_p(i), 
sad_8x8_msd_n(i), comp_result_8x8(i)); i8o 
end generate; 
oLcomp_16x8_8xl6_generate: 
for i in 0 to 1 generate 
oLcomp_16x8: oLcomp port map (elk,rst, SAD_16x8_pos(i), 
SAD_16x8_neg(i),sad_16x8_rnsd_p(i), 
sad_16x8_nisd_n(i), comp_result_16x8(i)); 
oLcomp_8xl6: oLcomp port map (elk,rst, SAD_8xl6_pos(i), 
SAD_8xl6_neg(i),sad_8xl6_msd_p(i), 
sad_8xl6_msd_n(i), comp_result_8xl6(i)); 190 
end generate; 
oLcomp-16x16: oLcomp port map (elk, rst, SAD_16xl6_pos, 
SAD_16xl6_neg,sad_16xl6_msd_p, 
sad_16xl6_msd-n,comp_result_16xl6); 
stop < = comp_result_4x4(0) and comp_result_4x4(l) 
and comp_result_4x4(2) and comp.result_4x4(3) 
and comp_result_4x4(4) and comp_result_4x4(5) 
and comp_result_4x4(6) and comp.result_4x4(7) 200 
and coinp_result_4x4(8) and coinp_result_4x4(9) 
and conip_result_4x4(10) and comp_result_4x4( 11) 
and comp_resiilt_4x4(12) and comp_result_4x4( 13) 
and comp_result_4x4(14) and comp_result_4x4(15) 
and comp_result_4x8(0) and comp_result_4x8(l) 
and comp_result_4x8(2) and comp_result_4x8(3) 
and comp_rcsult_4x8(4) and comp_result_4x8(5) 
and comp_resiilt _4x8 (6) and comp_result_4x8 (7) 
and comp_result_8x4(0) and comp_result_8x4(l) 
and comp_result_8x4(2) and comp_result_8x4(3) 210 
and comp_result_8x4(4) and comp_resiilt_8x4(5) 
and comp_result_8x4(6) and comp_result_8x4(7) 
and comp-result_8x8(0) and comp_resiilt_8x8(1) 
and comp_result_8x8(2) and comp_result_8x8(3) 
and comp_resiilt_8xl6(0) and comp_result_8xl6(l) 
and comp_result_ 16x8(0) and comp.result_ 16x8(1) 
and comp_resiilt_ 16x16; 
process (elk,rst) 220 
begin 
if (rst = ' 1 ' ) then 
APPENDIX A. VHDL SOURCES 128 
cnt < = "0000"; 
min4x4 < = (others = > '0 ' ) ; 
min4x8 < = (others =>，0, ) ; 
min8x4 < = (others = > '0 ' ) ; 
min8x8 < = (others = > '0 ' ) ; 
minSxlG < = (others = > 
minl6x8 < = (others = > '0 ' ) ; 
minl6xl6 < = '0 ' ; 230 
temp_sad_4x4_p < = (others =>，0 ' ) ; 
temp_sad_4x4_n < = (others = > ,0,); 
temp_sad_4x8_p < = (others =>，0，）； 
temp_sad_4x8_n < = (others = > '0 ' ) ; 
temp_sad_8x4_p < = (others = > '0 ' ) ; . 
tenip_sad_8x4_ii < = (others =>，0”； 
temp_sad_8x8_p < = (others = > '0 ' ) ; 
temp_sad-8x8_n < = (others =>，0，）； 
temp_sad_8xl6_p < = (others = > '0 ' ) ; 240 
temp_sad_8xl6_n < = (others = > '0 ' ) ; 
temp_sad_16x8_p < = (others =>，0，）； 
temp_sad_16x8_n < = (others = > '0 ' ) ; 
temp_sad_16xl6-p < = (others = > '0 ' ) ; 
temp_sad_16xl6_ii < = (others = > ,0,); 
niin_sad_4x4_p < = (others = > ' 1 ')； 
min_sad_4x4_n < = (others = > '1')； 
min_sad_4x8_p < = (others =>，1，）； 
min_sad_4x8_n < = (others = > ' 1 ')； 
min_sad_8x4_p < = (others = > '1')； 250 
min_sad_8x4_n < = (others = � ' 1 ')； 
min_sad_8x8_p < = (others =>，1，）； 
miii_sad_8x8_n < = (others = > '1')； 
min_sad_8xl6_p < = (others = > ,1，）； 
min_sad_8xl6_n < = (others = > ' 1 ')； 
min_sad_16x8_p < = (others = > '1')； 
min.sad-16x8-11 <= (others = > '1»)； 
min_sad_16xl6_p < = (others = > 
min_sad_16xl6_n < = (others = > ' 1')； 
260 
else 
if(clk'event and c l k = ' l ' ) then 
cnt <= cnt + 1; 
i f ( cnt = "0001") then 
for i in 0 to 15 loop 
APPENDIX A. VHDL SOURCES 129 
if(comp_result_4x4(i) = ' 1 ' ) then 
min4x4(i) <= not(min4x4(i)); 
end if； 
end loop; 270 
end if； 
i f (cnt = "0011") then 
for i in 0 to 7 loop 
if(comp_result_4x8(i) = ，1') then 
min4x8(i) <= not(min4x8(i)); 
end i f ; 
if(comp_result_8x4(i) = ，1，） then 
min8x4(i) <= not(min8x4(i)); 




for i in 0 to 3 loop 
if(comp_result_8x8(i) = ' 1 ' ) then 
min8x8(i) <= not(min8x8(i)); 
end i f ; 
end loop; 
end if ； 290 
if(cnt="0111") then 
for i in 0 to 1 loop 
if(comp_result_16x8(i) = ，1') then 
minl6x8(i) <= not(minl6x8(i)) ; 
end if； 
if(comp_result_8xl6(i) = ' 1 ' ) then 
min8xl6(i) <= not(min8xl6(i)) ; 
end if； 
end loop ； 300 
end i f ; 
if(cnt="1001") then 
if(comp_result_16xl6 = '1，） then 
minl6xl6 <= not(minl6xl6)； 
end if； 
end if； 
for i in 0 to 15 loop 3io 
APPENDIX A. VHDL SOURCES 130 
if(min4x4(i) = ,0,) then 
temp_sad_4x4_p(191-12*i downto 180-12*i) <= 
temp_sad_4x4_p(190-12*i downto 180-12*i) & sad_4x4_pos(i)； 
temp_sad_4x4_n(191-12*i downto 180-12*i) <= 
temp_sad_4x4_n(190-12*i downto 180-12*i) k sad_4x4_neg(i)； 
min_sad_4x4_p(191-12*i downto 180-12*i) <= 
min_sad_4x4_p(190-12*i downto 180-12*i) 
& temp_sad_4x4_p(191-12*i)； 
min_sad_4x4_n(191-12*i downto 180-12*i) <= 
min_sad_4x4_n(190-12*i downto 180-12*i) 320 
& temp_sad_4x4_n(191-12*i)； 
else 
miii_sad_4x4_p(191-12*i downto 180-12*i) <= 
min_sad_4x4_p(190-12*i downto 180-12*i) & sad_4x4_pos(i)； 
min_sad_4x4_ii(191-12*i downto 180-12*i) <= 
min_sad_4x4_n(190-12*i downto 180-12*i) & sad_4x4_neg(i)； 
temp_sad_4x4_p(191-12*i downto 180-12*i) <= 
temp_sad_4x4_p(190-12*i downto 180-12*i) 
& min_sad_4x4_p(191-12*i)； 
temp_sad_4x4_n(191-12*i downto 180-12*i) <= 330 
temp_sad_4x4_n(190-12*i downto 180-12*i) 
& min_sad_4x4_n(191-12*i)； 
end i f ; 
end loop; 
for i in 0 to 7 loop 
if(min4x8(i) = '0') then 
temp_sad_4x8_p(103-13*i downto 91-13*i) <= 
temp_sad_4x8_p(102-13*i downto 91-13*i) & sad_4x8_pos(i)； 
temp_sad_4x8_n(103-13*i downto 91-13*i) <= 340 
temp_sad_4x8_n(102-13*i downto 91-13*i) & sad_4x8_neg(i)； 
min_sad_4x8_p(103-13*i downto 91-13*i) <= 
min_sad_4x8_p(102-13*i downto 91-13*i) 
& temp_sad_4x8_p(103-13*i)； 
min_sad_4x8_n(103-13*i downto 91-13*i) <= 
min_sad_4x8_n(102-13*i downto 91-13*i) 
& temp_sad_4x8_n(103-13*i)； 
else 
min_sad_4x8_p(103-13*i downto 91-13*i) <= 
min_sad_4x8_p(102-13*i downto 91-13*i) & sad_4x8_pos(i)； 350 
min_sad_4x8_n(103-13*i downto 91-13*i) <= 
min_sad_4x8_n(102-13*i downto 91-13*i) & sad一4x8一neg(i); 
temp_sad_4x8_p(103-13*i downto 91-13*i) <= 
temp_sad_4x8_p(102-13*i downto 91-13*i) 
APPENDIX A. VHDL SOURCES 131 
& min_sad_4x8_p(103-13*i)； 
temp_sad_4x8_n(103-13*i downto 91-13*i) <= 
temp_sad_4x8_n(102-13*i downto 91-13*i) 
& min_sad_4x8_n(103-13*i)； 
end if； 
if(min8x4(i) = ,0,) then 360 
temp_sad_8x4_p(103-13*i downto 91-13*i) <= 
temp_sad_8x4_p(102-13*i downto 91-13*i) & sad_8x4_pos(i); 
temp_sad_8x4_n(103-13*i downto 91-13*i) <= 
temp_sad_8x4_n(102-:L3*i downto 91-13*i) & sad_8x4_neg(i)； 
min_sad_8x4_p(103-13*i downto 91-13*i) <= 
min_sad_8x4_p(102-13*i downto 91-13*i) 
& temp_sad_8x4_p(103-13*i)； 
min_sad_8x4_n(103-13*i downto 91-13*i) <= 
min_sad_8x4_n(102-13*i downto 91-13*i) 
& temp_sad_8x4_n(103-13*i); 370 
else 
min_sad_8x4_p(103-13*i downto 91-13*i) <= 
min_sad_8x4_p(102-13*i downto 91-13*i) & sad_8x4_pos(i)； 
min_sad_8x4_n(103-13*i downto 91-13*i) <= 
min_sad_8x4_n(102-13*i downto 91-13*i) & sad_8x4一neg(i); 
temp_saci_8x4_p(103-13*i downto 91-13*i) <= 
temp_sad_8x4_p(102-13*i downto 91-13*i) 
& min_sad_8x4_p(103-13*i)； 
temp_sad_8x4_n(103-13*i downto 91-13*i) <= 
temp_sad_8x4_n(102-13*i downto 91-13+i) 380 
& min_sad_8x4_n(103-13*i)； 
end i f ; 
end loop; 
for i in 0 to 3 loop 
if(min8x8(i) = ,0，） then 
temp_sad_8x8_p(55-14*i downto 42-14*i) <= 
temp一sad_8x8_p(54-14*i downto 42-14*i) & sad一8x8一pos(i); 
temp_sad_8x8_n(55-14*i downto 42-14*i) <= 
temp_sad_8x8_n(54-14*i downto 42-14*i) & sad_8x8_neg(i)； 390 
rain_sad_8x8_p(55-14*i downto 42-14*i) <= 
min_sad_8x8_p(54-14*i downto 42-14*i) 
& temp_sad_8x8_p(55-14*i)； 
min_sad_8x8_n(55-14*i downto 42-14*i) <= 
min_sad_8x8_n(54-14*i downto 42-14*i) 
& temp_sad_8x8_n(55-14*i)； 
else 
min_sad_8x8_p(55-14*i downto 42-14*i) <= 
APPENDIX A . VHDL SOURCES 1 3 2 
min_sad_8x8_p(54-14*i downto 42-14*i) & sad_8x8_pos(i)； 
min_sad_8x8_n(55-14*i downto 42-14*i) <= 400 
inin_sad_8x8_n(54-14*i downto 42-14*i) & sad_8x8_neg(i)； 
temp_sad_8x8_p(55-14*i downto 42-14*i) <= 
temp_sad_8x8_p(54-14*i downto 42-14*i) 
& min_sad_8x8_p(55-14*i)； 
temp_sad_8x8_n(55-14*i downto 42-14*i) <= 





for i in 0 to 1 loop 
if(min8xl6(i) = ,0,) then 
temp_sad_8xl6_p(29-15*i downto 15-15*i) <= 
temp_sad_8xl6_p(28-15*i downto 15-15*i) & sad_8xl6_pos(i)； 
temp_sad_8xl6_n(29-15*i downto 15-15*i) <= 
temp_sad_8xl6_n(28-15*i downto 15-15*i) k sad_8xl6_neg(i)； 
min_sad_8xl6_p(29-15*i downto 15-15*i) <= 
min_sad_8xl6_p(28-15*i downto 15-15*i) 
k temp_sad_8xl6_p(29-15*i)； 
min_sad_8xl6_n(29-15*i downto 15-15*i) <= 420 
min_sad_8xl6_n(28-15*i downto 15-15*i) 
& temp_sad_8xl6_n(29-15*i)； 
else 
min_sad_8xi6_p(29-15*i downto 15-15*i) <= 
min_sad_8xl6_p(28-15*i downto 15-15*i) & sad_8xl6_pos(i)； 
min_sad_8xl6_ii(29-15*i downto 15-15*i) <= 
min_sad_8xl6_n(28-15*i downto 15-15*i) & sad_8xl6_neg(i)； 
temp_sad_8xl6_p(29-15*i downto 15-15*i) <= 
temp_sad_8xl6_p(28-15*i downto 15-15*i) 
& min_sad_8xl6_p(29-15*i); 430 
temp_sad_8xl6_n(29-15*i downto 15-15*i) <= 
temp_sad_8xl6_n(28-15*i downto 15-15*i) 
& min_sad_8xl6_n(29-15*i)； 
end if； 
if(minl6x8(i) = ' 0 0 then 
temp_sad_16x8_p(29-15*i downto 15-15*i) <= 
temp_sad_16x8_p(28-15*i downto 15-15*i) & sad一16x8一pos(i); 
temp_sad_16x8_n(29-15*i downto 15-15*i) <= 
temp_sad_16x8_n(28-15*i downto 15-15*i) & sad_16x8_neg(i)； 440 
min_sad_16x8_p(29-15*i downto 15-15*i) <= 
min_sad_16x8_p(28-15*i downto 15-15*i) 
APPENDIX A. VHDL SOURCES 133 
& temp_sad_16x8_p(29-15*i)； 
min_sad_16x8_n(29-15*i downto 15-15*i) <= 
min_sad_16x8_n(28-15*i downto 15-15*i) 
& temp_sad_16x8_n(29-15*i)； 
else 
min_sad_16x8_p(29-15*i downto 15-15*i) <= 
min_sad_16x8_p(28-15*i downto 15-15*i) & sad_16x8_pos(i)； 
niin_sad_16x8_n(29-15*i downto 15-15*i) <= 450 
min_sad_16x8_n(28-15*i downto 15-15*i) & sad_16x8_neg(i); 
temp_sad_16x8_p(29-15*i downto 15-15*i) <= 
temp_sad_16x8_p(28-15*i downto 15-15*i) 
& min_sad_16x8_p(29-15*i)； 
temp_sad_16x8_n(29-15*i downto 15-15*i) <= 





if(mini6x16 = ，0，） then 
temp_sad_16xl6_p(15 downto 0) <= 
temp一sad一16x16一p(14 downto 0) & sad一16x16一pos; 
temp_sad_16xl6_n(15 downto 0) <= 
temp一sad一16x16一n(14 downto 0) & sad一16x16一neg; 
min_sad_16xl6_p(15 downto 0) <= 
min_sad_16xl6_p(14 downto 0) & temp_sad_16xl6_p(15)； 
min_sad_16xl6_n(15 downto 0) <= 
min_sad_16xl6_n(14 downto 0) & temp_sad_16xl6_n(15)； 
else 470 
min_sad_16xl6_p(15 downto 0) <= 
min_sad_16xl6_p(14 downto 0) & sad_16xl6_pos; 
min_sad_16xl6_n(15 downto 0) <= 
min_sad_16xl6_n(14 downto 0) k sad一16x16一neg; 
temp_sad_16xl6_p(15 downto 0) <= 
temp_sad_16xl6_p(14 downto 0) & min_sad_16xl6_p(15)； 
temp_sad_16xl6_n(15 downto 0) <= 
temp_sad_16xl6_n(14 downto 0) & min_sad_16xl6_n(15)； 
end i f ; 




APPENDIX A. VHDL SOURCES 134 
A.10 MSB-first motion estimation processor 
library IEEE; 
use IEEE. STDXOGIC1164. ALL ； 
use IEEE. STDXOGIC"ARITH. ALL ； 
use IEEE. STD LOGIC UNSIGNED. ALL ； 
entity MSD SAD UNIT is 
Port ( elk : in STDXOGIC; 
rst : in STD LOGIC; 
curr.pixel : in STD LOGIC VECTOR (255 downto 0)； 
ref pixel : in STDLOGICVECTOR (255 downto 0) ； lo 
mv.addr : in STD LOGICVECTOR (5 downto 0)； 
mv.p: out STD LOGICVECTOR (40 downto 0); 
mv'n: out STD LOGIC VECTOR (40 downto 0); 
stop: out STD'LOGIC); 
end MSD.SAD.UNIT; 
architecture Behavioral of MSD SAD UNIT is 
component abs'stage'top 
Port ( elk : in STD LOGIC; 20 
msd : in STD'LOGIC; 
refpixel : in STD LOGIC VECTOR (255 downto 0)； 
currpixel : in STDLOGICVECTOR (255 downto 0)； 
sd.mim.pos ： out STD LOGIC VECTOR (255 downto 0)； 
sd.num.neg : out STDLOGICVECTOR (255 downto 0 ) ) ; 
end component ； 
component comp.stage.top 
Port ( elk : in STDXOGIC; 
rst : in STDXOGIC; 30 
SAD.4x4.pos : in STD LOGIC VECTOR (15 downto 0); 
SAD.4x4.neg : in STD LOGIC VECTOR (15 downto 0); 
SAD'4x8pos : in STD LOGIC VECTOR (7 downto 0)； 
SAD.4x8.neg : in STD LOGIC VECTOR (7 downto 0)； 
SAD.8x4.pos : in STD LOGIC VECTOR (7 downto 0)； 
SAD.8x4.neg : in STD LOGIC VECTOR (7 downto 0)； 
SAD-8x8pos : ill STD LOGIC VECTOR (3 downto 0)； 
SAD.8x8.neg : in STD LOGIC VECTOR (3 downto 0)； 
SAD.8xl6.pos : in STD LOGIC VECTOR (1 downto 0); 
SAD.8xl6.neg : in STD LOGIC VECTOR (1 downto 0) ； 40 
SAD.16x8.pos ： in STD LOGIC VECTOR (1 downto 0)； 
SAD.16x8.neg : in STD LOGIC VECTOR (1 downto 0); 
APPENDIX A. VHDL SOURCES 135 
SAD.16xl6.pos : in STD.LOGIC; 
SAD-16xl6'neg : in STD LOGIC; 
mv"41'p : out std.logic.vector (40 downto 0); 
mv"4rn : out std.logic.vector (40 downto 0); 
stop: out STD.LOGIC); 
end component ； 
50 
component sd.l6op.tree 
Port ( elk : in STD LOGIC; 
p : in STD LOGIC VECTOR (15 downto 0); 
n : in STD LOGIC VECTOR (15 downto 0); 
neg : out STD LOGIC; 
pos : out STD LOGIC)； 
end component ； 
component olSDFA'tree 
Port ( elk : in STDLOGIC; 60 
SAD.4x4.p : in STD LOGIC VECTOR (15 downto 0)； 
SAD.4x4-n : in STD LOGIC VECTOR (15 downto 0); 
oSAD"4x8p : out STD LOGIC VECTOR (7 downto 0)； 
oSAD_4x8.n : out STD LOGIC VECTOR (7 downto 0)； 
oSAD.8x4.p : out STD LOGIC VECTOR (7 downto 0)； 
oSAD.8x4.n : out STD.LOGIC.VECTOR (7 downto 0); 
oSAD"8x8'p : out STD LOGIC VECTOR (3 downto 0)； 
oSAD.8x8.n : out STD LOGIC VECTOR (3 downto 0)； 
oSAD.8xl6 p : out STD LOGIC VECTOR (1 downto 0)； 
oSAD'8xl6n : out STDLOGICVECTOR (1 downto 0); 70 
oSAD"16x8p : out STD.LOGIC.VECTOR (1 downto 0); 
oSAD'16x8'n : out STD LOGIC VECTOR (1 downto 0); 
oSAD-16xl6p : out STDLOGIC; 
oSAD'16xl6"n : out STD'LOGIC)； 
end component ； 
signal sd'num'pos, sd'num'neg: std'logic"vector(255 downto 0); 
signal SAD'4x4p, SAD'4x4"ii: std.logic.vector(15 downto 0); 
signal SAD.8x4.p, SAD'8x4"n : std.logic.vector (7 downto 0); 
signal SAD'4x8'p, SAD.4x8.n : std.logic.vector (7 downto 0); 80 
signal SAD"8x8p, SAD‘8x8.n : std.logic.vector(3 downto 0); 
signal SAD'8xl6p, SAD8xl6n : std'logicvectord downto 0); 
signal SAD.16x8.p, SAD"16x8'n : std'logic'vectord downto 0); 
signal SAD.16xl6.p，SAD.16xl6.n: std'logic; 
signal msd: std.logic; 
APPENDIX A. VHDL SOURCES 136 
begin 
stagel.absolute: abs'stage'top port map (elk, msd, 90 
refpixel, curr'pixel, sd'num'pos, sd'num'neg)； 
tree'generate: 
for i in 0 to 15 generate 
adder'tree: sd'16op'tree port map (elk, 
sd'num'pos (16* (i+1) -1 downto 16*i), 
sd'num'neg (16* (i+1) -1 downto 16*i), 
SAD.4x4.p(i) ,SAD.4x4.n(i)); 
end generate ； 
SAD "merger: olSDFA.tree port map (clk,SAD'4x4'p, ‘ lOO 
SAD.4x4.n, SAD.dxS.p, SAD.4x8.n, 
SAD.8x4.p, SAD-8x4-ri, SAD.8x8.p, 
SAD'8x8"n,SAD'SxlGp, SAD.8xl6.n， 
SAD. 16x8.p, SAD. 16x8.n,SAD. 16xl6p, 
SAD.16xl6.n); 
stageS'comparator: comp'stage'top port map (elk, rst, 
SAD.4x4-p，SAD.4x4.n, SAD-4x8-p, SAD.AxS.n, 
SAD.8x4.p, SAD.8x4-n, SAD'SxSp, SAD.SxS.n, 
SAD'8xl6p,SAD'8xl6"ri，SAD.16x8.p, no 
SAD-16x8-11, SAD'16xl6p, SAD.iexie.n, 
mv.p, mv.n, stop)； 
process (elk) 
begin 
if (elk'event and elk = ' 1 ' ) then 
i f ( r s t = ，0,) then 
msd <= ，0,； 
else 
msd <=，1,; 120 
end if； 




1] Virtex-2 User Guide. http://direct.xilinx.com/bvdocs/ 
user guides / iigO 12.pdf. 
'2] Virtex-5 XtremeDSP Design Consideration. 
http://direct.xilinx.com/bvdocs/userguides/ugl93.pdf. 
3] Draft ITU-T Recommendation H.263, Video coding for low 
bit rate communication, Version 2. 1998. 
4] G. Bjontegaard and K. Lille void. Context-adaptive V L C 
Coding of coefficients. In JVT document JVT-C028, Fair-
fax, May 2002. 
5] S. Bouchoux and E. Bourennane. Application based on Dy-
namic Reconfiguration of Field-programmable Gate Arrays: 
J P E G 2000 Arithmetic Decoder. SPIE Journal in Optical 
Engineering, 44(10):107001—107006，Oct. 2005. 
6] C. Y. Chen, S. Y. Chien，Y. W . Huang, T. C. Chen, T. C. 
Wang, and L. G. Chen. Analysis and Architecture Design 
of Variable Block Size Motion Estimation for H.264/AVC. 
IEEE Trans, on Circuits and Systems, 53(2):578—593’ Feb. 
2006. 
7] C. Y. Cho, S. Y. Huang, and J. S. Wong. A n Embed-
ded Merging Scheme for H.264/AVC Motion Estimation. 
In IEEE Int. Conf. on Image Processing, volume 3，pages 
1016-1019, Sept. 2005. 
137 
BIBLIOGRAPHY 138 
8] W . C. Chung. Implementing the H.264/AVC 
Video Coding Standard on FPGAs. In 
Xcell publication, pages 18-21, Sept. 2005. 
www.xilinx.eom/publications/solguides/be_01/xc_pdf/pl8-
21_bel-dsp4.pdf. 
9] K. Compton and S. Hauck. Reconfigurable computing: sur-
vey of systems and software. ACM Computing Surveys, 
34(2):171-210, June 2002. 
10] M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan 
Kaufmann Publishers, 2004. 
11] M. D. Erecgovac and T. Lang. On-Line Arithmetic: A De-
sign Methodology and Applications. In Proc. IEEE work-
shop. on VLSI Signal Processing, pages 252-263, 1988. 
12] E. M. Fakhouri. Variable block-size motion estimation, cite-
seer .ist.psu.edu/fakhouri97variable .html. 
13] A. Gersho and R. M. Gray. Vector Quantization and Signal 
Compression. Kluwer Academic Publishers, 1992. 
14] B. Hedayati. Fpgas expand their roles as 
best asic replacement. http://www.xilinx-
china.com/company / success / asic. ht m. 
15] W . Hsu and H. Derin. Three-dimensional subband coding 
of video. Proc. Int. Conf. Acoustics, Speech, and Signal 
Processing (ICASSP), pages 1100—1103, Apr. 1988. 
16] ISO/IEC11172. Information technology - coding of moving 
pictures and associated audio for digital storage media at up 
to about 1.5 Mbit/s. (MPEG-1), 1993. 
17] ISO/IEC113818. Information technology - generic coding of 
moving pictures and associated audio information. (MPEG-
2), 1995. 
BIBLIOGRAPHY 139 
18] ISO/IEC14496-2. Amendment 1, Information technology -
coding of audio-visual objects - Part 2: Visual. 2001. 
19] ISO/IEC15444. Information technology - JPEG2000 image 
coding system. 2000. 
20] V. Iverson, J. MacVeigh, and B. Reese. Real-time H.24-
A V C codec on Intel architectures. In Image Processing, 
2004. ICIP '04. 2004 International Conference on, vol-
ume 2, pages 757-760, Oct. 2004. 
21] J. R. Jain and A. K. Jain. Displacement Measurement and 
its Application in Interframe Image Coding. IEEE Trans, 
on Communication, 29(12):1799-1808, Dec. 1981. 
22] M . Kim, I. Hwang, and S. I. Chae. A fast VLSI Architecture 
for Full-search Variable Block Size Motion Estimation in 
M P E G - 4 AVC/H.264. In Proc. of the ASP-DAC, volume 1, 
pages 631-634, Jan. 2005. 
23] T. Koga, K. linmia, A. Hirano, Y. lijima, and T. Ishiguro. 
Motion Compensated Interframe Coding for Video Confer-
encing. In Proc. of National Telecomm. Conf, pages G5.3.1-
G5.3.5，New Orleans, Nov. 1981. 
24] T. Kormarek and P. Pirsch. Array Architectures for Block 
Matching Algorithms. IEEE Trans, on Circuits and Sys-
tems, 36(10):1301-1308，1989. 
25] P. M. Kulin, G. Diebel, S. Herrmann, A. Keil, H. Mooshofer, 
A. Kaup, R. M. Mayer, and W . Stechele. Complexity and 
P S N R comparison of several fast motion estimation algo-
rithms for MPEG-4. Proc. SPIE, 3460:486—489, 1998. 
26] H. T. Kung and C. E. Leiserson. Systolic arrays (for VLSI). 
Sparse Matrix Proceedings, 1979. 
BIBLIOGRAPHY 159 
27] W . Lee, Y. Kim, R. J. Gove, and C. J. Read. Media Station 
5000: Integrating Video and Audio. IEEE Trans. Multime-
dia, 1(2):50—61, 1994. 
28] M. Li. Arithmetic and Logic in Computer Systems. Wiley 
Inter science, 2004. 
29] W . Li and E. Salari. Successive Elimination Algorithm 
for Motion Estimation. IEEE Trans. Image Processing, 
4(1):105-107, Jan. 1995. 
30] S. Lopez, F. Tobajas, A. Villar, V. de Armas, J. Lopez, and 
R. Sarmiento. Low Cost Efficient Architecture for H.264 
Motion Estimation. In Proc. of IEEE Int. Symp. on Circuits 
and Systems, volume 1, pages 412-415, 2005. 
31] H. Loukil, F. Ghozzi, and A, Samet. Hardware implementa-
tion of Block Matching Algorithm with F P G A technology. 
In IEEE Int. Conf. on Microelectronics, volume 16, pages 
542-546， 2004. 
32] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Keros-
fsky. Low-complexity Transform and Quantization in 
H.264/AVC. IEEE Trans, on Video Technology, 13(7):620-
644, July 2003. 
33] D. Marpe, H. Schwarz, and T. Wiegand. Context-Based 
Adaptive Binary Arithmetic Coding in the H.264/AVC 
Video Compression Standard. IEEE Trans, on Circuits and 
Systems for Video Tech., 13(7):620-636, July 2003. 
34] M. Mohammadzadeh, M. Eshghi, and M. Azadfar. An Op-
timized Systolic Array Architecture for Full Search Block 
Matching Algorithm and its Implementation on F P G A 
chips. In IEEE Int. Conf. N E W C A S , volume 3, pages 327-
330, 2005. 
BIBLIOGRAPHY 141 
35] G. E. Moore. Cramming more components onto integrated 
circuits. Electronics Magazine, 38(8), Apr. 1965. 
36] H. Nozaki, M. Motoyama, A. Shimbo, and S. Kawamura. 
Implementation of R S A Algorithm Based on R N S Mont-
gomery Multiplication. Lecture Notes in Computer Science, 
2162:364-376, May 2001. 
37] T. M. Oh, Y. R. Kim, W . G. Hong, and S. J. Ko. Partial 
Norm Based Search Algorithm for Fast Motion Estimation. 
Electron. Lett., 36(14):1195—1196, 2000. 
38] J. Olivares and J. Hormigo. Minimum Sum of Absolute 
Differences Implementation in a Single F P G A Device. In 
IEEE Int. Conf. on Field Programmable Logic, pages 986-
990, 2004. 
39] C. Ou, C. F. Le, and W . J. Hwang. A n efficient VLSI ar-
chitecture for H.264 Variable block size motion estimation. 
IEEE Trans, on Signal Processing, 51 (4): 1291-1299，Nov. 
2005. 
40] B. Parhami. Computer Arithmetic: Algorithms and Hard-
ware Designs. Oxford University Press, 2000. 
41] K. R. Rao and P. Yip. Discrete Cosine Transform. Acad-
emic Press, 1990. 
42] I. E. G. Richardson. H.264 and MPEG-4 Video Compres-
sion. John Wiley Publisher, 2003. 
43] D. Salomon. Data Compression: The Complete Reference. 
Springer, 2004. 
44] M. A. Soderstrand, W . K. Jenkins, G. A. Jullien, and F. J. 
Taylor, editors. Residue number system arithmetic: mod-
ern applications in digital signal processing. IEEE Press, 
Piscataway, NJ，USA, 1986. 
BIBLIOGRAPHY 142 
45] C. L. Su and C. W . Jen. Motion Estimation using On-line 
Arithmetic. In Proc. of IEEE Intl. Symp. on Circuits and 
System, volume 1, pages 683-686, 2000. 
46] C. L. Su and C. W . Jen. Motion Estimation using M S D -
first Processing. In Proc. of IEEE circuits, device and and 
systems, volume 150, pages 124-133, Apr. 2003. 
47] P. D. Syines. Digital Video Compression. McGraw-Hill, 
2004. 
48] T. Wieg and Ed. Pattaya. Draft ITU-T Recommenda-
tion H.264 and Draft ISO/IEC 14496-10 AVC. In JVC 
of ISO/IEC and ITU-T SG16/Q.6 Doc. JVT-G050, M a r . 
2003. 
49] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kas-
sim. A Novel Unrestricted Center-Biased Diamond Search 
Algorithm for Block Motion Estimation. IEEE Trans, on 
Circuits and Systems, 8(4):369-377，Aug. 1998. 
50] J. Villalba, J. Hormigo, J. M. prades, and E. L. Zapata. On-
line Multioperand Addition Based on On-line Full Adders. 
In IEEE Intl. Conf. on App. Specific systems, pages 322-
327, July 2005. 
51] C. Wei and M. Z. Gang. A novel S A D Computing Hardware 
Architecture for Variable-size Block Matching Estimation 
and Its Implementation with F P G A . In Proc. of IEEE Int. 
Symp. on Circuits and Systems, volume 1，pages 683-686, 
2000. 
52] S. Wong, B. Stougie, and S. Cotofana. Alternatives in 
FPGA-based S A D Implementations. In IEEE Int. Conf. 
on Field Programmable Logic, pages 449-452, Dec. 2002. 
BIBLIOGRAPHY 143 
53] S. Wong, S. Vassiliadis, and S. Cotofana. A Sum of Absolute 
Differences Implementation in F P G A Hardware. In Proc. 
of 28仇 Euromico Conf., pages 183-188, Sept. 2002. 
54] S. Y. Yap and J. V. McCanny. A VLSI Architecture 
for Advanced Video Coding Motion Estimation. In Proc. 
IEEE Intl. Conf. on application-specific systems, arch., and 
processors, pages 293-301, June 2003. 
55] S. Y. Yap and J. V. McCanny. A VLSI Architecture for 
Variable Block Size Video Motion Estimation. IEEE Trans, 
on Circuits and Systems, 51(7):384—389，July 2004. 
56] S. Zhu and K. K. Ma. A New Diamond Search Algorithm for 
Fast Block Matching Motion Estimation. In Proc. of Intl. 
Conf. on Information Communication and Signal Process-
ing (ICICS), pages 292-296，Sept. 1997. 
57] S. Zhu and K. K. Ma. A New Diamond Search Algorithm 
for Fast Block Matching Motion Estimation. IEEE Trans. 









 . f 
. .




























































































 . . 
. .









 . . -
 . V -








. . . .






 ” . -
































C U H K L i b r a r i e s 
圓圓111__ 
004366665 
