Analogue implementation of motion estimation processors for digital video coding. by Panovic, M.
REFERENCE ONLY
UNIVERSITY OF LONDON THESIS
Degree Year N am eofA u tho r  ^  A  r V
COPYRIGHT
This is a thesis accep ted  for a Higher Degree of the University of London. It is an 
unpublished typescript and the copyright is held by the author. All persons consulting 
the thesis must read and abide by the Copyright Declaration below.
COPYRIGHT DECLARATION
I recognise that the copyright of the above-described thesis rests with the author and 
that no quotation from it or information derived from it may be published without the 
prior written consen t of the author.
LOAN
T h eses  may not be lent to individuals, but the University Library may lend a copy to 
approved libraries within the United Kingdom, for consultation solely on the premises 
of those  libraries. Application should be m ade  to: The T h eses  Section, University of 
London Library, S ena te  House, Malet Street, London WC1E 7HU.
REPRODUCTION
University of London th e se s  may not be reproduced without explicit written 
permission from the University of London Library. Enquiries should be add ressed  to 
the T h eses  Section of the Library. Regulations concerning reproduction vary 
according to the da te  of accep tance  of the thesis  and are listed below a s  guidelines.
A. Before 1962. Permission granted only upon the prior written consen t of the 
author. (The University Library will provide a d d re sses  where possible).
B. 1 9 62- 1974. In many c a s e s  the author has  agreed to permit copying upon
completion of a Copyright Declaration.
C. 1975 -1 9 8 8 . Most th e se s  may be copied upon completion of a  Copyright
Declaration.
D. 1989 onwards. Most th e se s  may be copied.
This thesis comes within category D.
This copy has  been deposited in the Library of
□ This copy has  been deposited in the University of London Library, Sena te  House, Malet Street, London WC1E 7HU.
C:\Documents and Settings\lproctor.ULL\Local Settings\Temporary Internet Files\OLK36\Copyright - thesis.doc

Analogue Implementation of Motion Estimation 
Processors for 
Digital Video Coding
By
Mladen Panovic
A thesis submitted for the degree of 
Doctor of Philosophy
UCL
Department of Electronic & Electrical Engineering 
University College London 
Torrington Place, London 
W C1E7JE, England
1
UMI Number: U593592
All rights reserved
INFORMATION TO ALL USERS 
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
Dissertation Publishing
UMI U593592
Published by ProQuest LLC 2013. Copyright in the Dissertation held by the Author.
Microform Edition © ProQuest LLC.
All rights reserved. This work is protected against 
unauthorized copying under Title 17, United States Code.
ProQuest LLC 
789 East Eisenhower Parkway 
P.O. Box 1346 
Ann Arbor, Ml 48106-1346
Abstract
Digital video technology has been characterised by continued growth in the last decade. 
The uses of video coding are broad and diverse. However, generally they fall into two 
main categories either utilising limited storage or transmission bandwidth. Video coding 
is essential for video on Digital Versatile Disc (DVD) and utilising the limited 
bandwidth available for realising real time applications such as videoconferencing, 
videotelephony and digital television.
The use o f Differential Pulse Code Modulation (DPCM) enables reduction of temporal 
redundancy between successive frames by frame differencing. The efficiency of this 
method is enhanced by motion compensating the predicted frame before differencing. 
The use of block motion estimation for the implementation of motion compensated 
video compression has become the favoured technique in video coding standards such 
as H.26-(l,3) and MPEG-(1,2,4).
Motion estimation is the most demanding aspect of video coding. Motion estimation 
accounts for some 50% of the total computation needed in the H.261 video coding 
standard. The use o f Very Large Scale Integration (VLSI) digital implementations 
enables this processing rate to be realised in real time, vital for communication 
purposes. Therefore power dissipation and implementation area o f such 
implementations tend to be excessive.
The research described in this work attempts to provide an efficient Complementary 
Metal Oxide Semiconductor (CMOS) integrated circuit realisation o f the motion 
estimation processor based on analogue circuit techniques. The goal is to use compact 
analogue computation circuits in place of large digital arithmetic units with 
corresponding reduction o f implementation area and power dissipation. These 
requirements are essential for battery operated video applications. It is hoped that the 
work presented in this thesis that explore and characterise the issues relating to the 
integrated circuit realisation of an analogue motion estimation processor, will provide 
the basis for future video coding standard implementations.
2
Acknowledgements
This thesis would have not been possible without the support of many people over the 
last three years.
1 would like to thank Andreas Demosthenous, David Gamer and John Taylor for 
supervising this work. I would also like to express my gratitude towards Olujide 
Adeniran, Mike Brent and Iasonas Triantis.
Furthermore the Engineering and Physical Sciences Research Council (EPSRC) 
supported this work under grant number GR/R64926.
3
Table of Contents
Chapter 1- Introduction
1.1 Background 17
1.2 Overview of Thesis 20
1.3 Statement o f Originality 21
Chapter 2- V ideo Coding Fundamentals
2.1 Introduction 23
2.2 Digital Video Representation 23
2.2.1 Digitisation 23
2.2.2 Spatial Sampling 24
2.2.3 Temporal Sampling 25
2.2.4 Quantisation 25
2.2.5 Digital Video Formats 25
2.2.6 The Basis for Video Compression 26
2.2.7 Video Quality Measure 27
2.3 Spatial Redundancy 28
2.3.1 The Discrete Cosine Transform 29
2.3.2 Quantiser 30
2.3.3 Variable Length Coding 31
2.3.4 Vector Quantisation 33
2.4 Temporal Redundancy 34
2.4.1 Frame Differencing 34
2.4.2 Motion Compensated Prediction 36
2.5 Motion Compensation and Video Coding Standards 40
2.6 Conclusions 41
4
Chapter 3- M otion Estimation
3.1 Introduction 42
3.2 Principles of Block Matching Motion Estimation 42
3.2.1 Block Matching Functions 43
3.3 Advanced Motion Compensation Techniques 45
3.3.1 Half-pixel Accuracy 45
3.3.2 Overlapping Block Motion Compensation 47
3.3.3 Unrestricted Motion Vectors 49
3.4 Motion Estimation Complexity 49
3.4.1 Computational Complexity 50
3.5 Fast Motion Estimation Algorithms 52
3.5.1 The Two Dimensional Logarithmic Search Algorithm 54
3.5.2 The Three-Step Search Algorithm 55
3.5.3 The One-Dimensional Logarithmic Search Algorithm 56
3.5.4 The Conjugate Direction Search Algorithm 57
3.5.5 Comparison of Fast Search Algorithm Methods 58
3.5.6 Hierarchical Block Motion Estimation 59
3.5.7 Phase Correlation Method 61
3.5.8 Pixel Decimation 62
3.5.9 Reduced Pixel Precision 63
3.6 Conclusions 64
Chapter 4- Characterisation o f  Block Matching Algorithms using Analogue 
Techniques
4.1 Introduction 65
4.2 Realisation of Block Matching Algorithms using Analogue Techniques 65
4.2.1 Motivation for using Analogue Techniques 65
4.2.2 Non-ideal Analogue Characteristics 66
4.3 Characterisation of non-ideal Analogue Techniques on Performance 68
4.3.1 Reduced Pixel Precision 69
4.3.2 Offset Errors 70
5
4.3.3 Reduction of Block Match Comparison Resolution 71
4.4 Effects of Block Size on Block Matching Algorithm Performance 73
4.4.1 Reduced Pixel Precision 73
4.4.2 Offset Errors with Different Block Size 76
4.4.3 Block Size Performance 76
4.5 Comparison of Architectures Techniques 79
4.5.1 Architectures Performance with offset errors 80
4.6 Conclusions 81
Chapter 5- Block M atching Motion Estimation Architectures
5.1 Introduction 82
5.2 Basic Architectures for Analogue Block Matching Motion Estimation 82
5.2.1 First Generation Architectures 83
5.2.2 First Generation Architectures Comparison 85
5.2.3 Second Generation Architectures 87
5.2.4 Second Generation Architectures Comparison 91
5.3 Improved Analogue Motion Estimation Processor 93
5.4 Block Matching Algorithm Implementation 96
5.4.1 Review of Analogue Block Matching Algorithms 96
5.4.2 Analogue Processing Element Structure 97
5.4.3 Analogue Block Matching Algorithms Comparison 99
5.5 Conclusions 103
Chapter 6- Implementation o f  Block Matching M otion Estimation using 
Analogue Techniques
6.1 Introduction 104
6.2 Distance Metric 104
6.2.1 Analogue Processing Element Principle 105
6.2.2 Analogue Processing Element Realisation 107
6.2.3 Analogue Processing Element Small-signal Analysis 111
6.2.4 Analogue Processing Element Frequency Characteristics 115
6
6.2.5 Mobility Degradation and Short Channel Effects 117
6.2.6 Simulated Results 125
6.2.7 Measured Results 127
6.3 Analogue Memory 130
6.3.1 Sample and Hold with Dummy Switch 131
6.3.2 Sample and Hold with Transmission Gate 132
6.3.3 Rotary Shifter 132
6.4 Decision Circuit 133
6.4.1 High Accuracy Sample and Hold 134
6.4.2 Comparator 135
6.5 Conclusions 137
Chapter 7- Experimental Results o f  Block Matching Analogue M otion 
Estimation Processor
7.1 Introduction 138
7.2 Analogue Motion Estimation Processor 138
7.2.1 Fabricated Analogue Motion Estimation Processor 138
7.2.2 Experimental Evaluation Method 140
7.2.3 Experimental Results 141
7.3 Improved Analogue Motion Estimation Processor 150
7.3.1 Fabricated Improved Analogue Motion Estimation Processor 150
7.3.2 Experimental Evaluation Method 153
7.3.3 Experimental Results 154
7.4 Motion Estimation Processor Performance 162
7.4.1 Power Dissipation and Motion Vector Computation Rate 162
7.4.2 Comparison of Motion Estimation Processors 164
7.5 Conclusions 166
Chapter 8- Conclusion
8.1 Conclusions 171
8.2 Future Work 172
References 174
7
List of Tables
Table 2.1 Digital video formats. 25
Table 3.1 Showing a comparison of fast search algorithms complexity 58
Table 5.1 Components and operations required for motion estimation processor 86
using various architecture.
Table 5.2 Components and operations required for motion estimation processor 87
using various architecture.
Table 5.3 Components and operations required for motion estimation processor 91
using various architecture.
Table 5.4 Components and operations required for motion estimation processor 92
using various architecture.
Table 5.5 The basic amplifier configurations and the ideal characteristics. 98
Table 6.1 Various parameters for the three squaring circuits. 126
Table 7.1 Showing area of various motion estimation processor sub-systems. 141
Table 7.2 The mean PSNR error for various sequences with different distance 144
metrics and measured analogue motion estimation processor.
Table 7.3 Showing area of various improved motion estimation processor sub- 154
systems.
Table 7.4 The mean PSNR error for various sequences with different distance 157
metrics and measured improved analogue motion estimation processor.
Table 7.5 Showing the power usage breakdown for analogue motion estimation 167
processor.
Table 7.6 Showing area of various motion estimation processor sub-systems. 168
Table 7.7 Showing comparison of various motion estimation processor 169
implementations.
List of Figures
Figure 2.1 General block diagram of video digitiser system. 24
Figure 2.2 Spatial sampling scanning. 24
Figure 2.3 Colour subsampling showing the YCrCb components used for each 26
8
pixel.
Figure 2.4 Intraframe compression and decompression. 28
Figure 2.5 Showing quantisation used for intraframe coding (a) and interframe 31
coding (b) with the characteristic dead zone at centre.
Figure 2.6 The DCT coefficients are entropy coded in zigzag fashion. 33
Figure 2.7 Frame difference encoder. 34
Figure 2.8 Showing for “Susie” test image a difference error frame. 35
Figure 2.9 Showing for “Susie” test image a motion compensated difference 35
error frame.
Figure 2.10 Showing movement o f objects between frames. 36
Figure 2.11 Generic interffame encoder. 37
Figure 2.12 Showing block matching process. 37
Figure 2.13 Bidirectional coding. The prediction is made from the average of the 38
best matches from the previous and future frames.
Figure 2.14 Showing sequence of I, B and P frames. The frames are encoded 39
with reference to each other as indicated. This is known as a group of 
pictures (GOP).
Figure 3.1 Showing block motion estimation process. 43
Figure 3.2 The sub-pixel search positions are interpolated from the positions 46
surrounding pixel A.
Figure 3.3 Showing for “Susie” test image an integer-pixel search motion 47
compensated difference error frame.
Figure 3.4 Showing for “Susie” test image a half-pixel search motion 47
compensated difference error frame.
Figure 3.5 Showing the use of Overlapping Block Motion Compensation 48
(OBMC), with system of weighted average of current and four 
neighbouring blocks.
Figure 3.6 Showing the use of unrestricted motion vectors. The motion vector 49
points outside the frame enabling better motion prediction.
Figure 3.7 Showing resource usages in video codec. 50
Figure 3.8 Showing number of blocks compared for a given search parameter w. 51
Figure 3.9 Two dimensional logarithmic search algorithm. 54
9
Figure 3.10 The three-step search algorithm.
Figure 3.11 One-dimensional logarithmic search algorithm.
Figure 3.12 The conjugate direction search algorithm.
Figure 3.13. Showing hierarchical block matching algorithm. The motion
estimation is performed using a three level pyramid structure with 
each level a reduced resolution representation of the lower level.
Figure 3.14 Showing the block size and search window at each level using the 
three-level hierarchical block matching algorithm.
Figure 4.1. Simplified block diagram showing ME processor as part of video 
coder.
Figure 4.2. Block diagram showing distance measure metric.
Figure 4.3. Depicting the effect of input-offset error (£r) and output-offset error 
(£y ) on the absolute-distance PE characteristic.
Figure 4.4. The mean PSNR error due to pixel precision reduction for various 
sequences. Upper, MSE distance metric with ideal 8-bit MAE 
reference superimposed (dashed). Lower, MAE distance metric.
Figure 4.5. The mean PSNR error due to input-offset for various sequences. 
Upper, MSE distance metric. Lower, MAE distance metric.
Figure 4.6. The mean PSNR error at reduced comparator accuracy for
sequences using 8-bit pixel values. Upper, MSE distance metric, with 
as a reference 8-bit MAE using ideal comparator superimposed 
(dashed). Lower, MAE distance metric.
Figure 4.7. The mean PSNR error at reduced comparator accuracy for various
sequences, using 6-bit pixel values and MSE distance metric - with as 
a reference 8-bit MAE using ideal comparator superimposed (dashed).
Figure 4.8 The mean PSNR error due to pixel precision reduction for various 
sequences using the MSE algorithm for various block sizes.
Figure 4.9 The mean PSNR error due to pixel precision reduction for various 
sequences, using the MAE algorithm for various block sizes.
Fig. 4.10 The mean PSNR error due to input referred offset for different
sequences using various blocks. The MSE algorithm has been used with 
8-bit pixel precision.
55
56
57
60
61
66
67
67
70
71
72
72
74
75
77
10
Fig. 4.11 The mean PSNR error due to input referred offset for different 78
sequences using various blocks.
Fig. 4.12 Motion estimation processor using serial-parallel architecture. 79
Fig. 4.13 Motion estimation processor using multiple-parallel architecture. 79
estimation is performed using a three level pyramid structure with 
each level a reduced resolution representation of the lower level.
Fig. 4.14 The mean PSNR error due to input referred offset for various sequences 80 
using multiple-parallel architecture. Upper, MSE distance metric.
Lower, MAE distance metric.
Figure 5.1 Motion estimation processor using serial-parallel architecture. 84
Figure 5.2 Motion estimation processor using multiple-parallel architecture. 84
Figure 5.3 Motion estimation processor using parallel-parallel architecture. 85
Figure 5.4 Showing the computation path taken in the search window. 88
Figure 5.5 (a) For each match, though the positions are different, only one column 88
or row actually contain new pixel values. Therefore as shown in (b) and 
(c) the bulk o f the values can be reused by just updating new values and 
multiplexing the columns and/or rows to correspond.
Figure 5.6 Motion estimation processor using improved serial-parallel 89
architecture.
Figure 5.7 Motion estimation processor using improved multiple-parallel 89
architecture.
Figure 5.8 Showing the computation path taken in the search window. 90
Figure 5.9 A single TB is shown for the first row. (a) For each match, though the 90
positions are different, only row actually contains new pixel values.
Therefore as shown in (b) and (c) the bulk of the values can be reused 
by just updating new values and multiplexing the columns and/or rows 
to correspond.
Fig. 5.10 Proposed improved distance metric for serial-parallel architecture with 94
input-offset error cancellation.
Fig. 5.11 The basic S/H circuit incorporated in the system-level simulations. 95
Fig. 5.12 Improved serial-parallel architecture: Error from mean-PSNR of 8-bit 95
pixel MSE as a function of comparator resolution for various sequences
11
using 8-bit pixel values. The standard deviation of the input-offset error 
was 10 mV. 8-bit pixel ideal MAE reference superimposed (dotted lines)
Figure 5.13 General distance metric structure. 97
Figure 5.14 Distance measure based on a VCVS. 99
Figure 5.15 Distance measure based on a CCCS. 100
Figure 5.16 Distance measure based on a CCVS. 101
Figure 5.17 Distance measure based on a VCCS. 101
Figure 6.1 Motion estimation processor showing sub-systems. 104
Figure 6.2 Linear transconductor and square-law function principle [92]. 106
Figure 6.3 Practical realization of the circuit in Fig. 6.2 using source followers. 107
Figure 6.4 Practical realization o f the circuit in Fig. 6.2 using complementary 109
transistors.
Figure 6.5 Proposed practical realisation of the circuit in Fig. 6.2 using flipped 110
voltage followers.
Figure 6.6 The source follower (a) and the flipped follower (b). 112
Figure 6.7 The small-signal equivalent circuit of the flipped follower. 112
Figure 6.8 (a) Flipped voltage follower driving the source of another transistor. 116
(b) Equivalent small-signal model when the feedback loop is broken.
Figure 6.9 The important dimensions of a MOS transistor. 117
Figure 6.10 The drain current (Id ) for gate-source voltage ( Vcs)  for different L 119
keeping the aspect ratio constant (0.8/x technology BSIM3v3 model 
used with toX = 16.3 nm). The Sah model is superimposed for 
comparison.
Figure 6.11 The square root of drain current ( Id ) for gate-source voltage ( Vcs)  for 120
different L keeping the aspect ratio constant. (0.8/xm technology 
BSIM3v3 model used with toX= 16.3 nm). The Sah model is 
superimposed for comparison.
Figure 6.12 The drain current (Ip) for gate-source voltage ( Vcs) for different L 123
keeping the aspect ratio constant (0.35/xm technology BSIM3v3 
model used with tox = 7.7 nm). The Sah model is superimposed for 
comparison.
Figure 6.13 The square root drain current (Id ) for gate-source voltage ( Vcs)  for 124
12
different L keeping the aspect ratio constant (0.35fi technology 
BSIM3v3 model used with toX= 7.7 nm). The Sah model is 
superimposed for comparison.
Figure 6.14 Simulated THD of the three squaring circuits as a function of channel 126
length (differential sinusoidal input: 1 kHz, 400 mVp-p).
Figure 6.15 Bode plot of the circuit in Fig. 6.8a for V\ -  lOOmV and 127
V2 = 400 mV.
Figure 6.16 Bode plot of the circuit in Fig. 6.8a for V\ = 400mV and 
V2 = 100 mV.
Figure 6.17 Simulated time domain response of the closed-loop circuit in Fig. 128
6.8. (a) Applied step-inputs V\ and V2. (b) Time response o f the drain 
current of transistor M l .
Figure 6.18 Chip microphotograph o f squarer circuit. 130
Figure 6.19 Measured plot of square-law transfer curve. 130
Figure 6.20 Squarer measurements, (a) Spectrum of output voltage (differential 131
input: 1 kHz, 400 mVp.p). (b) Average THD as a function of input 
signal amplitude.
Figure 6.21 Input-offset voltage distribution histogram of the squarer circuit 131
shown in Fig 6.5.
Figure 6.22 Simple single transistor switch sample and hold. 133
Figure 6.23 The single transistor switch with dummy transistor sample and hold. 133 
Figure 6.24 The transmission gate switch sample and hold. 134
Figure 6.25 Rotary Shifter (4 control lines, S0-S3). 134
Figure 6.26 Simplified diagram of sample and hold circuit. 136
Figure 6.27 Transistor level circuit diagram of sample and hold circuit. 136
Figure 6.28 High performance comparator. 137
Figure 6.29 Comparator preamplifier. 137
Figure 6.30 Dynamic Comparator. 138
Figure 7.1 Motion estimation processor integrated circuit. 141
Figure 7.2 Motion estimation processor microphotograph. The parts not 142
highlighted are not part of the Motion estimation processor.
Figure 7.3 Motion estimation processor test procedure. 143
13
Figure 7.4 The PSNR error from 8-bit MSE for the Carphone sequence with 145
various metrics and measured analogue motion estimation processor.
Figure 7.5 Detecting the patterns of motion in Carphone sequence: (a) Reference 146
frame (b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE 
and (f) Measured analogue motion estimation processor.
Figure 7.6 The PSNR error from 8-bit MSE for the Foreman sequence with 147
various metrics and measured analogue motion estimation processor.
Figure 7.7 Detecting the patterns o f motion in Foremen sequence: (a) Reference 148
frame (b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE 
and (f) Measured analogue motion estimation processor.
Figure 7.8 The PSNR error from 8-bit MSE for the Susie sequence with various 149
metrics and measured analogue motion estimation processor.
Figure 7.9 Detecting the patterns o f motion in Susie sequence: (a) Reference 150
frame (b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE 
and (0 Measured analogue motion estimation processor.
Figure 7.10 The PSNR error from 8-bit MSE for the Trevor sequence with various 151
metrics and measured analogue motion estimation processor.
Figure 7.11 Detecting the patterns of motion in Trevor sequence: (a) Reference 152
frame (b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE 
and (f) Measured analogue motion estimation processor.
Figure 7.12 Improved ME processor integrated circuit. 153
Figure 7.13 Improved motion estimation processor microphotograph. The top area 155
highlighted is essentially the circuit shown in Fig. 7.2, middle left the 
resistor load, middle right the decision circuit and lower, on chip 
decoupling capacitors.
Figure 7.14 Improved ME processor test procedure. 156
Figure 7.15 The PSNR error from 8-bit MSE for the Carphone sequence with 158
various metrics and measured improved motion estimation processor.
Figure 7.16 Detecting the patterns of motion in Carphone sequence: (a) Reference 159
frame (b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE 
and (f) Measured improved motion estimation processor.
Figure 7.17 The PSNR error from 8-bit MSE for the Foreman sequence with 160
various metrics and measured improved motion estimation processor.
14
Figure 7.18 Detecting the patterns of motion in Foremen sequence: (a) Reference 161 
frame (b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE 
and (0 Measured improved motion estimation processor.
Figure 7.19 The PSNR error from 8-bit MSE for the Susie sequence with various 162
metrics and measured improved motion estimation processor.
Figure 7.20 Detecting the patterns o f motion in Susie sequence: (a) Reference 163
frame (b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE 
and (0  Measured improved motion estimation processor.
164
Figure 7.21 The PSNR error from 8-bit MSE for the Trevor sequence with various 
metrics and measured improved motion estimation processor.
Figure 7.22 Detecting the patterns o f motion in Trevor sequence: (a) Reference 165
frame (b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE 
and (f) Measured improved motion estimation processor.
Figure 7.23 Showing the usage o f time for calculating the best motion vector. 167
15
Glossary of Terms
BMA Block Matching Algorithm
COMP Comparator
CMOS Complementary Metal Oxide Semiconductor
DPCM Differential Pulse Code Modulation
DVD Digital Versatile Disc
DCT Discrete Cosine Transform
DM Distance Metric
GOP Group O f Pictures
ITU International Telecommunication Union
CCITT International Telegraph Consultative Committee
IDCT Inverse Discrete Cosine Transform
JPEG Joint Photographic Experts Group
LTA Loser Takes All
MAE Mean Absolute Error
MSE Mean Square Error
MPEG Motion Picture Experts Group
PSNR Peak-to-peak-Signal-to-Noise-Ratio
PE Processing Element
RLE Run Length Encoding
S/H Sample and Hold
SB Search Block
TB Template Block
TSS Three-Step Search
VLSI Very Large Scale Integration
16
Chapter 1
Introduction
1.1 Background
Digital video coding has become an essential technology in utilising the limited storage 
capacity and transmission bandwidth available. The use of video coding has become 
widespread in recent years, largely due to improvements in technology and the 
establishment of several international video coding standards. Such examples include 
communication applications of video telephony and video conferencing and the DVD 
(Digital Versatile Disc) storage media used in digital video camera and entertainment 
videos.
Video coding techniques take advantage of visual redundancies in order to reduce the 
bit rate needed to represent the video information. In natural scenes, data redundancies 
arise from both spatial correlation within an image and temporal correlation between 
successive images, as exploited in intraframe and interframe coding respectively. The 
earliest interframe coding techniques were based on Differential Pulse Code Modulation 
(DPCM), these date back to the late 1950’s and 1960’s [1], [2]. This works by encoding 
frame differences on a pixel by pixel basis. The use of frame differencing yields good 
compression only in static parts of the image sequence. However, the overall 
compression is not typically high, particularly in sequences that contain a lot of motion. 
Therefore, with constrained bit rate, the picture quality is usually much reduced. In the 
early 1980’s using the technology available at the time, DPCM was standardised by the 
International Telegraph Consultative Committee (CCITT) as the H.120 video coding 
standard. The DPCM predictive method of simple frame differencing is not efficient but 
can be improved with motion compensation. This compensates the frame for the motion 
that has occurred between frames before the differencing, therefore substantially 
improving the prediction.
The use of motion compensation to improve video compression can be dated back to the 
late 1960’s with the publication of a paper presented by Rocca [3] at the Picture Coding
17
Symposium. The early 1970’s saw the first attempt at commercial exploitation of the 
concept with a patent by Haskell and Limb [4]. To the present day, digital video coding 
techniques using motion compensation have varied considerably, however the concept 
remains essentially unchanged.
In the early 1980’s Jain and Jain [5] suggested an interframe coding technique with 
motion compensation based on the block translation model of motion between 
successive frames. The motion compensation is predicted using the block matching 
motion estimation method. The technique seeks to predict the motion by dividing the 
current frame into regular blocks and searching for the most similar block within a 
search window of the intraffame coded reference frame. Therefore, a motion vector is 
used to represent the block most similar to the block in the current frame. The block 
translation model of motion is hardly accurate for describing all motion in natural 
scenes, however it is computationally tractable and yields good compression. Largely 
due to these factors, by the end of the early 1980’s, the interframe coding technique 
with motion compensation using block matching motion estimation was used in the
H.261 video coding standard introduced by the International Telecommunication Union 
(ITU), formerly known as CCITT. The H.261 standard was the culmination of work of 
international contributions aiming to effectively specify a standard that could be 
implemented with the technology available. The H.261 standard laid down the path that 
many subsequent standards such as those specified by MPEG (Motion Picture Experts 
Group) have followed.
Research since the early 1980’s has focused on improving or extending the block 
matching motion estimation method. Broadly speaking publications have been divided 
in two directions. Those being to either improve the motion estimation accuracy, or to 
reduce the computation necessary to calculate the motion vectors. It is block matching 
motion estimation algorithms that are generally used in video coding standards. The 
most demanding aspect of video coding is motion estimation. The H.261 
recommendation [6], suggests that the computation required for the entire codec is in 
the region of 1200 MOPS (Mega-operations-per-second) [7]. Motion estimation 
accounts for some 50% of the total computation needed. The use of Very Large Scale 
Integration (VLSI) implementations enables this processing rate to be realised in real
18
time, which is vital for communication purposes. The implementation area and power 
dissipation of such implementations tend to be excessive.
Since the 1980’s, the approach towards video coding VLSI implementations has 
changed direction several times. Programmable architectures intended for general image 
processing rather than any specific application were proposed until the late eighties. The 
designs tended to be essentially slightly optimised digital signal processor architectures 
[8], with a number of processors working in parallel to meet the computational demand 
of the applications. The main drawbacks of this approach are excessive implementation 
area and power dissipation. To overcome this, dedicated architectures for video coding 
were proposed during the 1990’s [9] - [11]. The interest in this approach has declined 
somewhat, due to the long design period and the lack of flexibility to cater for 
improvements in ever evolving video standards. Therefore research has recently 
returned back to programmable architecture approaches. However these new 
architectures [12], [13] tend to concentrate on using specialised additional coprocessors 
for tasks that require large computation resource, such as motion estimation.
The research described in this work is much in line with current proposals that attempt 
to provide an efficient motion estimation coprocessor to supplement the main video 
coding processor. The goal is to use compact analogue computation circuits in place of 
large digital arithmetic units with corresponding reduction of implementation area and 
power dissipation. These requirements are essential for conservative purposes such as 
portable applications. The realisation of fully integrated analogue computation circuits 
has been studied for many years [14], [15] but it is clear from the literature that the 
widespread use of this technique for motion estimation processors has not been adopted. 
This is understandable since the use of analogue techniques for motion estimation has 
not been greatly explored and characterised. Further, the system represents a somewhat 
challenging circuit design. It is hoped that the work presented in the following chapters 
that explore the issues relating to the realisation of an analogue motion estimation 
processor in Complementary Metal Oxide Semiconductor (CMOS) technology, will 
provide the basis for future implementations.
19
1.2 Overview of Thesis
The remaining chapters of this thesis are organised as follows:
Digital video coding principles are described in Chapter 2. The chapter describes 
techniques used to compress individual images and sequences. Various coding 
standards are outlined including their structure and performance.
In chapter 3, motion estimation algorithms are discussed. The matching functions used 
in motion estimation are outlined. The most important search methods are considered 
with particular regard to video coding standards.
The characterisation of analogue techniques for motion estimation is examined in 
Chapter 4. Numerous simulations are presented taking into account various limitations 
imposed by the use o f analogue circuits.
The various analogue motion estimation processor architectures are described in 
Chapter 5, summarising performance with regard to processing rate and efficiency. An 
improved circuit technique for reducing degradation in performance due to circuit 
parameter variations is considered. Further simulations are provided. Finally a brief 
review of circuits used for matching functions is presented and a suitable CMOS circuit 
implementation is derived.
The circuits used in the various sub-systems of the analogue motion estimation 
processor are described in Chapter 6. Particular emphasis is placed on the circuit used to 
implement the matching function.
The main developments and measured results are described in Chapter 7. A comparison 
is made with recent digital implementations in the literature and an analogue motion 
estimation processor used as part of a much larger system.
Finally, Chapter 8 presents the conclusions of the thesis and recommendations of future 
directions of the research.
20
1.3 Statement of Originality
The following aspects can be identified as significant contributions of this thesis to the 
understanding of analogue circuit techniques with application to low power block 
motion estimation for video coding:
• The effect of limited accuracy and dynamic range available using CMOS 
analogue circuit techniques on motion estimation performance has been 
investigated. It has been shown that using the inherent square-law characteristics 
o f the MOS transistor in saturation, the mean square error (MSE) computation 
can be conveniently and efficiently realised. It was found that a pixel accuracy 
of only 6-bit was required using the mean square error computation to achieve a 
similar performance to 8-bit pixels for the mean absolute error computation. 
Though both the mean square error and mean absolute error (MAE) computation 
result in large dynamic range, only a limited range is actually used for motion 
estimation in video sequences.
• It is shown in this work that the reduction in pixel accuracy has only a minor 
degradation effect on motion estimation performance for large block sizes used 
in video coding standards. However the degradation increases with reduction in 
block size. The moderate square-law conformance attainable from analogue 
circuits only marginally affects performance.
• The architecture introduced in this work computes the entire mean square error 
between blocks of pixels in parallel, while each successive block within a search 
window is considered in serial. The architecture makes use of redundancy to 
enable a similar computation rate to that obtained when calculating the entire 
search window in parallel. This has two main consequences, those being a 
significant reduction in implementation area and the flexibility to use any 
desired search window size.
• The use of analogue circuit techniques will introduce errors due to the inevitable 
variations in circuit parameters. The major sources contributing to the
21
degradation in motion estimation performance are identified and a circuit 
technique is introduced to significantly reduce such errors.
• Measured results obtained from fabricated integrated circuits are presented. It is 
shown that with the combined analogue circuit technique and architecture 
introduced, significant improvements in implementation size and power 
dissipation are achieved compared to conventional digital implementations.
A number of publications have resulted from the work presented in this thesis:
1. M. Panovic and A. Demosthenous, “Analogue Implementation of Mean 
Square Error Function For Use In Motion Estimation,” Proceedings Prep 
2003, Exeter, UK, April 2003, pp. 163-164.
2. M. Panovic and A. Demosthenous, “Effects of Reduced Pixel Precision on 
Video Motion Estimation,” Proceedings IEEE London Communication 
Symposium 2003, Sept., pp. 141-143.
3. M. Panovic and A. Demosthenous, “Versatile Analogue Motion Estimator 
Architecture,” Proceedings IEEE International Symposium on Image and 
Signal Processing and Analysis 2003, Sep., pp. 986-990.
4. M. Panovic and A. Demosthenous, “Compact CMOS Linear Transconductor 
and Four Quadrant Analogue Multiplier,” Proceedings IEEE International 
Symposium on Circuits and Systems 2004, May, vol. 1, pp. 685-688.
5. M. Panovic and A. Demosthenous, “A Compact Block-matching Cell for 
Analogue Motion Estimation Processors,” Proceedings IEEE International 
Symposium on Circuits and Systems 2004, May, vol. 2, pp. 229-232.
6. M. Panovic and A. Demosthenous, “A Compact Block-Matching Cell For 
Analogue Motion Estimation Processors,” accepted for publication in 
International Journal of Electronics.
22
Chapter 2
Video Coding Fundamentals
2.1 Introduction
Video coding techniques are used to compress data by taking advantage of 
redundancies in visual information. Data redundancy in image sequences tends to be 
caused by spatial redundancy within a frame and temporal redundancy between 
successive frames.
Compression techniques that use spatial correlation and temporal correlation are 
known as intraframe coding and interframe coding respectively. Usually the coding 
techniques are processed separately. After applying these techniques, redundancy 
between compressed symbols can be further reduced by using entropy coding, which 
is based on information theory. The video coding techniques mentioned so far are all 
lossless. Further reduction may be obtained by selective removal of spatio-temporal 
visual information that the human visual awareness is less sensitive towards; this 
technique is described as lossy. The unnecessary visual information is discarded prior 
to entropy coding.
The following sections will describe each of the coding methods in detail and outline 
the main video coding standards.
2.2 Digital Video Representation
A standard representation of video information is required to compress data. To that 
end, the most popular standard representations for digital video are considered in the 
following section.
2.2.1 Digitisation
The video signal is proportional to that captured by the light sensor and therefore is 
analogue. The process of digitising analogue video consists of three basic stages
23
including spatial sampling, temporal sampling and quantisation. A block diagram 
outlining the process is shown in Fig. 2.1.
DIGITISED
VIDEO
RED/GREEN/ 
BLUE FILTER
CLOCK
SENSOR QUANTISER
RASTER
SCAN
Figure 2.1 General block diagram of video digitiser system.
2.2.2 Spatial Sampling
Sampling a finite number of points over the viewing area or frame is referred to as 
spatial sampling. The points are sampled through a process called raster scanning. The 
scan can be performed progressive or interlaced as illustrated in Fig 2.2. The 
progressive scan also known as non-interlaced, samples consecutive lines in a frame 
in order. This is done from left to right and top to bottom. The interlaced scan divides 
lines into odd and even sets and samples the odd lines in a frame first followed after 
by the even. This is done from left to right and top to bottom for odd and even lines. 
The odd and even sets of lines each make a field. It should be noted that the odd and 
even fields are sampled, and displayed at different times. Thus, for an interlaced scan, 
the period between fields is half the period between frames. The interlaced scan finds 
use in television and the progressive scan in film and computer systems.
(a) Progressive scan (b) Interlaced scan
Figure 2.2 Spatial sampling scanning.
24
2.2.3 Temporal Sampling
The limited response of the human visual system to temporal change can be used to 
create the perception of motion by showing at least 16 frames of video per second
[16]. This has formed the basis of motion picture systems using 24 frames per second 
(fps). Television uses higher temporal sampling rates of 25 fps and 30 fps for Phase 
Alternative Line (PAL) and National Television Systems Committee (NTSC) 
systems, respectively.
2.2.4 Quantisation
Following spatial and temporal sampling, the sequence of analogue video signals is 
digitised through the process of quantisation. This essentially involves representing a 
continuous signal by a finite number o f levels referred to by a digital code. The 
process is generally known as pulse code modulation (PCM). In practice this is 
realised using an analogue-to-digital (ADC) converter. The subject of quantisation 
will be considered in greater detail later in this work.
2.2.5 Digital Video Formats
Digital video formats are numerous, however similarity only includes the colour 
representation. A summary o f the parameters for various formats is shown in Table
2.1 including Common Intermediate Format (CIF), quarter CIF (QCIF) used for H26- 
(1,3) video conferencing standards. The Source Intermediate Format (SIF) is used in 
MPEG-1. The format for MPEG-2 is also the standard picture format for digital 
television.
Format Frame Colour Scan Frame Raw data
size Sampling rate (mbps)
MPEG-2 High quality video distribution (D V D , SDT V) 4 - 8  mbps
720 x 480/576 4:2:0 Interlaced 30/25 124
MPEG-1 Intermediate quality video distribution (VC D , w w w ) 1.5 mbps
SIF 352 x240/288 4:2:0 Progressive 30/25 30
Video conferencing over Internet H.261 H .263 128 -  384 kbps
CIF 352 x 288 4:2:0 Progressive 30 37
Video telephony H.263 2 0 - 6 4  kbps
QCIF 176x 144 4:2:0 Progressive 30 9.1
Table 2.1 Digital video formats.
25
The colour representation used is YCrCb. The Y component refers to luminance while 
Cb and Cr are the chrominance components. The components are defined as follows
[17]:
Y = 0.257R + 0.504G + 0.098B+16 
Cr = 0.439R -  0.368G -  0.071B + 128 
Cb = -0.148R -  0.291G + 0.439B + 128
The components are derived from the three primary camera colours red (R), blue (B) 
and green (G), known as the RGB signal.
The use o f subsampling reduces spatial redundancy. Applications that require higher 
quality such as video production, use 4:2:2 subsampling that samples relative to the 
luminance component only every alternate pixel as shown in Fig. 2.3 (a). Consumer 
applications mainly use 4:2:0 subsampling, that samples relative to the luminance 
component only every alternate pixel of every alternate horizontal line as shown in 
Fig. 2.3 (b).
& o & o
o o o o
& o & o
o o o o
& o & o
& o & 6
& o & o
& o & o
o Y
© Cr
• Cb
(a) 4:2:2 (b) 4:2:0
Figure 2.3 Colour subsampling showing the YCrCb components used for each pixel.
2.2.6 The Basis for Video Compression
The rationale behind video compression becomes apparent by considering the 
application of QCIF format from Table 2.1 for video conferencing over the plain old 
telephone system (POTS) using a modem. Given that the uncompressed video is 
encoded in QCIF at 30 fps, the uncompressed bit rate is:
(176 • 144 + 2 ■ 88 • 72) • 8 bits • 30 fps = 9.1 mbps (mega-bits-per-second)
26
In order to transmit this format using a 56 kbps (kilo-bits-per-second) modem, a 
compression of 163:1 would be required almost certainly resulting in a large reduction 
o f quality.
Considering a more familiar application, that of MPEG-2 for SDTV, the 
uncompressed bit rate is:
(720 • 480 + 2 • 360 • 240) • 8 bits • 30 fps = 124 mbps
The transmission bandwidth of 4 mbps therefore requires a compression ratio of 31:1. 
To allow for soundtrack, text and control signals an even higher compression ratio is 
required.
2.2.7 Video Quality Measure
Digital video processing will invariably introduce distortions. Therefore it is 
necessary to define an objective assessment to measure the amount of distortion 
between the original and processed video signal. The assessment used generally in 
literature is the peak-to-peak-signal-to-noise-ratio (PSNR) [18]. The PSNR is 
measured in decibels (dB) to get an objective picture quality assessment. The PSNR is 
defined as follows:
PSNR = 10 log 10
(peak - t o -  peak -  value - o f  -  original -  data)2
MSE
Where the MSE is defined as:
MSE(i, j )  = - i _  £ £ [ / ( / ,  f )  -  g(i, j ) f  
MxN
where f ( i j )  is the original frame and g(ij)  is the reconstructed frame. The frame size 
is M  by N  pixels and / and j  refer to pixel coordinates.
(2 . 1)
(2 .2 )
27
It should be noted that the PSNR does not always correlate well with the perceived 
distortion. It is quite possible to have an image that has been processed in different 
ways albeit with the same PSNR, nevertheless the resulting images are quite different 
in appearance. However, as already mentioned the PSNR is almost exclusively used. 
Therefore, to enable comparison with previous work the PSNR will be used 
throughout this work. Distortion measures that attempt to more accurately assess 
video quality can be found in [19] and [20].
2.3 Spatial Redundancy
The strong correlation among neighbouring pixels within a picture can be exploited to 
reduce spatial redundancy using intraframe coding as shown in Fig. 2.4. The image is 
divided into blocks which are then transform coded using the Discrete Cosine 
Transform (DCT). The transform coding maps the pixels into the transform domain as 
coefficients corresponding to spatial frequency components. Since the energy of most 
natural scenes is concentrated at low frequencies, the insignificant coefficients may be 
eliminated using a quantisation stage. The accuracy of quantisation applied to each 
coefficient is chosen to produce the least visual artefacts. Finally, entropy coding is 
applied to the coefficients. The reverse process is used to recover the original image.
Image Source Intrafram e Encoder Compressed
Data Image Data
8 x 8
Pixel
Block
Quantiser
Entropy
Encoder
Discrete
Cosine
Transform
Compressed Intrafram e Decoder Reconstructed
Image DataImage Data
8 x 8
Pixel
Block
Entropy
Decoder Dequantiser
Inverse
Discrete
Cosine
Transform
Figure 2.4 Intraframe compression and decompression.
28
The general intraframe coding principle outlined so far is common to many image and 
video standards. The process is used by the JPEG (Joint Photographic Experts Group) 
standard for compression of still images as well as in video standards such as H.26-(l, 
3) and MPEG-(1,2,4).
2.3.1 The Discrete Cosine Transform
The DCT block transform code has been successfully used by JPEG and as a result 
adopted by H.26-(l,3) and MPEG-(1,2,4). The DCT is used to analyse spatial 
correlation in an image. The transform breaks down 8 x 8  pixel blocks of the image 
into components of different spatial frequencies. Thus, a block o f 64 pixels is 
transformed into 64 coefficients. The original image can be recovered by the decoder 
using the Inverse Discrete Cosine Transform (IDCT). There are primarily two reasons 
for choosing the DCT:
• DCT coefficients are relatively uncorrelated [21], therefore they can be coded 
independently, simplifying entropy based algorithms for compressing 
coefficient values.
• DCT approximately decomposes image blocks into their underlying spatial 
frequencies. As a consequence the components may be quantised at a variable 
precision appropriate to human visual perception.
When the DCT is applied to natural images, the image energy is concentrated into a 
few coefficients and the rate-distortion performance is similar to the Karhunen-Loeve 
transform that is generally considered optimum [22]. A number of fast DCT 
transforms are available [23] enabling the realisation of efficient software 
implementation. The two dimensional DCT used by video coding standards can be 
expressed as:
F(u, v) = £  £  f ( x ,  y)  cos[(2x +1) cos[(2y + 1 ) ^ ] (2.3)
29
where ffx.y) is a two dimensional array o f samples and u,v are the horizontal and 
vertical frequency indices and the constants C(u) and C(v) are given by:
C(u) = 4 -  i f  u = 0 
V 2
C(u) = 1 i f  u > 0
To recover the original samples, the IDCT is performed using the following:
/ ( * .  y)  = ~ ~  E  Z v )  cos[(2x +1) cos[(2y +1) £ ]
u=0  v=0 l O  l h
2.3.2 Quantiser
The DCT causes the majority o f the image energy to be concentrated into the lower 
frequency components. The remaining energy is distributed over many coefficients, 
each containing a small amount. The compression is facilitated by varying the 
coarseness at which coefficients are quantised. Since human visual perception is 
partly characterised by being less sensitive to distortions at higher frequencies, coarser 
quantisation is applied to the higher frequency coefficients, forcing more coefficients 
to zero thus allowing greater compression.
The precision reduction obtained by quantisation is very important, since lower 
precision results in greater compression. However, the coarser the quantisation step, 
the worse the picture quality. To ensure an acceptable quality, the choice of 
quantisation value is set according to the spatial frequency of the coefficient and the 
human visual perception response to that frequency. The quantisation index for a 
particular frequency coefficient is stored in a look up table. MPEG defines a default 
table [24] but allows for custom tables appropriate for the material to be used. It is 
therefore essential that the decoder receives the lookup table, to dequantise correctly 
the data.
(2.4)
30
Quantisation rounding conventions
Quantisation is essentially a division by the quantisation value. For intraframe coding 
the quantisation fractional values are rounded to the nearest integer, values midway 
are rounded up. For interframe coding the values are always rounded down to the 
smaller integer. The two different methods are illustrated in Fig. 2.5, the wide interval 
around zero for interframe coding is known as a dead zone; this tends to cause more 
non-significant coefficients to become zero, increasing compression. The coefficients 
lost in the dead zone are mainly constituted by noise.
o  .3
2 3-3 -2 1 0 1 4 5-5 -4
Unquantised DCT
O -3
■3 -2 0 1 2 3 4 5-5 A 1
Unquantised DCT
Figure 2.5 Showing quantisation used for intraframe coding (a) and interframe coding 
(b) with the characteristic dead zone at centre.
2.3.3 Variable Length Coding
The function of the entropy coder is to losslessly encode a sequence of symbols using 
the shortest possible bitstream [25]. Since each symbol has a different probability of 
occurrence, by using shorter bit sequences to encode symbols with a greater 
probability and longer bit sequences for symbols that are less likely to occur, 
compression can be obtained. The optimum code length for a symbol, Ls, is given by:
31
Ls — log 2 (1 / Ps} (2.5)
where Ps is the probability o f a symbol occurring. The entropy which is defined as the 
average number o f bits per symbol, is given by:
entropy  =
5
where s is the alphabet length.
The most popular entropy coding methods used in video coding are Huffman coding 
and arithmetic coding. They are known as fixed-to-variable coding, meaning that each 
symbol is coded by a different length sequence of bits.
The Huffman coding method makes use of the fact that each symbol used to represent 
a coefficient value has a different probability of occurring. Therefore, smaller length 
sequences o f bits are used to represent symbols that occur most often. For this method 
to be successful it is important to determine statistically the probability of symbols 
occurring and attribute bit sequence length accordingly.
Arithmetic coding does not code one symbol at a time like Huffman coding, but codes 
a sequence of symbols at a time. The probability o f a sequence o f symbols occurring 
is determined and is represented by a bit sequence of certain length. The more likely 
the sequence of symbols the smaller the bit sequence used to represent it. This method 
tends to lead to higher compression than Huffman coding.
The disadvantage of these methods is that when long runs of unlikely symbols occur, 
say from an unusual scene, the data will take up more bandwidth than if it was not 
encoded. This may be remedied by the use of adaptive versions of arithmetic and 
Huffman coding. The coder determines from the image the likelihood of a symbol 
occurring and varies the bit sequence length accordingly. Using this method a table is 
built, which is stored with the image data so that the decoder may reconstruct the 
original symbols. Adaptive arithmetic coding is used by JPEG. For robustness,
(2 .6)
32
Huffman coding is used for the MPEG standard and different tables are used 
depending on the type of scene to be compressed.
The DCT coefficients are entropy coded in zigzag fashion as is shown in Fig. 2.6. 
This is done to order the coefficients from low to high frequency. The low frequency 
components contain the bulk of the energy and are preserved during the quantisation 
stage, while the higher frequency coefficients are more coarsely quantised and tend to 
be zeros. Therefore, long runs of zero are grouped together, so Run-Length Encoding 
(RLE) is more appropriate for this special case. RLE replaces sequences of identical 
symbols with a single symbol, the run-length (their number) and an indicator that 
represents how the symbol and the run-length should be interpreted.
DC
cr
li­
en
Increasing Horizontal Frequency
Figure 2.6 The DCT coefficients are entropy coded in zigzag fashion.
2.3.4 Vector Quantisation
Vector quantisation [26] works by segmenting an image into equally sized blocks of 
pixels. The blocks are represented by vectors called codewords, that are chosen from 
a finite sized codebook. The process is quite similar to the quantisation described in
33
section 2.3.2, however instead of scalars values, vectors are quantised. The larger the 
codebook the better the distortion performance, although more computationally 
intensive. The encoding requires large computational resource while the decoder only 
requires a look up table.
During the late 1980’s, fifteen block-based videoconferencing proposals were put 
forward to the ITU in the study period prior to the H.261 standard. The DCT was the 
basis o f fourteen of them, while vector quantisation the remainder. At the same time 
JPEG chose the DCT due to the possibility of progressive image transmission. This 
combined with a substantial effort into realising DCT integrated circuits [27], [28] 
most likely influenced the adoption o f DCT for H.261. The DCT is the basis of most 
video standards.
2.4 Temporal Redundancy
A video sequence typically contains highly correlated successive frames for scenes 
that have little motion. Spatial redundancy operates only on individual images. 
Techniques that exploit temporal correlation to reduce temporal redundancy are 
known as interframe coding and are examined in this section.
2.4.1 Frame Differencing
Temporal redundancy may be exploited by simply differencing frames. This is an 
extension o f the basic DPCM coding technique [29]. The frame differencing coder is 
shown in Fig. 2.7.
OutputInput
Frame
Encoder
Frame
Store
Frame
Decoder
Figure 2.7 Frame difference encoder.
34
Interframe coding reduces temporal redundancy by coding the difference between 
successive frames rather than coding the whole frame. In parts of the image that are 
static, the differences will be zero. Therefore only the parts of the image having 
motion need be coded, as a consequence compressing the image data. A shortcoming 
of this approach occurs when large areas of the image contain motion. This effect can 
be significantly reduced if the motion can be estimated and the difference taken on the 
motion compensated image rather than on the original image. An error frame for the 
“Susie” test image sequence produced by simply differencing successive frames is 
shown in Fig. 2.8 and a motion compensated difference error frame in Fig. 2.9. It is 
clear that the motion compensated difference error frame has significantly less error.
Figure 2.8 Showing for “Susie” test image a difference error frame.
Figure 2.9 Showing for “Susie” test image a motion compensated difference error 
frame.
35
2.4.2 Motion Compensated Prediction
The technique used for removing temporal redundancy in video standards is motion 
compensated prediction. The concept is that although successive frames in a video 
sequence are related to each other, on a pixel to pixel basis they change quite a lot. 
This is shown in Fig. 2.10. The two frames shown in Fig. 2.10 appear to be closely 
related but it is only when placed on top of each other, that it is noticeable that a large 
number of pixels have changed. The change is usually caused because objects have 
moved between frames in natural scenes.
Frame 2 Frame 1 Overlay
AA
Figure 2.10 Showing movement of objects between frames.
The DPCM encoder of Fig. 2.7 may be modified to incorporate motion compensation. 
This utilises predictive coding. The resulting generic interframe encoder shown in 
Fig. 2.11, is used by all the video coding standards including H.26-(l,3) and MPEG- 
(1,2,4). The first frame is intraframe encoded to reduce spatial redundancy and is 
known as an I frame. The intraframe coded frame is decoded and stored as a reference 
frame. The second frame is motion estimated with the reference frame held in the 
buffer and a motion compensated frame constructed. The motion is described by a set 
of motion vectors. The error frame resulting from the difference between the second 
frame and the motion compensated reference frame is intraframe coded. This is rather 
aptly known as a predicted frame or P frame. The original frame can be decoded by 
motion compensating the reference frame using the set of motion vectors and adding 
the P frame. Subsequent frames are motion compensated much in the same way, 
however the reference frame can be obtained either from an I or P frame.
36
Input Output
i i
Frame
Decoder
Frame
Buffer
Frame
Encoder
Motion
Estimation
Motion Motion Vector
Compensation
Figure 2.11 Generic interframe encoder.
There are many ways to estimate the motion between successive frames. The method 
generally used by video coding standards is block motion estimation. The idea is to 
take a frame and divide it into N  x TV blocks, the blocks are normally square in shape 
and equally sized. Each block A is searched for in the previous frame until a block B, 
which most closely matches A is found (Fig. 2.12).
Current Frame P revious Frame
Figure 2.12 Showing block matching process.
The motion vector is given by the displacement of A and B. This is used to 
reconstruct the current frame from the previous to produce a motion compensated 
frame. If no good match is found, then the block A is coded with intraframe 
techniques. The motion compensated frame is then differenced with the current frame 
to produce an error frame that is stored with a set of motion vectors.
In practical implementations such as H.26-(l,3) and MPEG-(1,2,4), certain constraints 
are placed on block motion compensation:
37
• The match for block A is limited within a search window. This is done 
primarily for two reasons. Firstly, comparing every block within a frame is 
computationally expensive, so it is necessary to limit the search. Secondly the 
code for each block includes a motion vector, so the motion vector itself could 
become significantly large to store if the displacement is large.
• The size of the block is determined by a number of factors. If the block is too 
small, say 2 x 2  pixels, then a large number of motion vectors will be 
necessary. The use of large blocks reduces the number of motion vectors, 
however large blocks result in visual artefacts since it is harder to find a good 
match.
In certain instances motion compensating a frame with respect to a previous frame 
alone may not have enough information to correctly compensate a frame. The solution 
is bi-directional motion compensation, which can use either future (backward motion 
estimation) or previous frames (forward motion estimation) or a combination of both. 
As an example of this effect Fig. 2.13 shows an occluded circle moving into view 
from behind a rectangle. In this situation it is not possible to make a prediction for 
frame 2 from either the future or previous frames. This is overcome by using 
information from both future and previous frames and interpolating the best matches 
from both. The overhead of sending two motion vectors is compensated by the 
compression gained. Such frames are referred to as B frames.
Frame 3 Frame 2 Frame 1
bi-directional motion 
B 3  compensated block
Figure 2.13 Bidirectional coding. The prediction is made from the average of the best 
matches from the previous and future frames.
38
A B frame is coded with reference to past and future frames as shown in Fig. 2.14. 
This is referred to as a group of pictures (GOP). The GOP can be of any length, which 
is defined as the number of frames between each I frame. Since future frames are 
required, the frames are reordered so that all reference frames required to encode the 
B frame are available. The sequence is reordered as exemplified in Fig. 2.15. It should 
be noted that no frame is ever decoded with reference to a B frame. The encoding o f a 
B frame requires additional frame buffers, this increases complexity and introduces 
extra delay. Therefore the distance between successive reference frames is kept small 
for real time encoding such as video conferencing. The B frame can be omitted 
altogether to reduce delay. This is not however an issue for non real time applications 
such as DVD video.
The I, P and B frames used in the GOP offer increasing levels o f compression. The 
typically expected performance is shown below:
I frame 7:1
P frame 20:1
B frame 50:1
The B frame offers the highest compression level obtainable. They are also the most 
complex to encode and the decoder memory requirements are increased. A GOP 
usually consists of 13 frames. Therefore, in this case a compression ratio of about 
26:1 can be achieved.
1 2 3 4 5 6 7
Figure 2.14 Showing sequence of I, B and P frames. The frames are encoded with 
reference to each other as indicated. This is known as a GOP.
39
Frame Type: 
Frame Number:
I B B P B B I
1 2 3 4 5 6 7
(a) Original frame sequence
Frame Type: 
Frame Number:
I P B B I B B
1 4 2 3 7 5 6
(b) Reordered frame sequence
Figure 2.15 Showing reordering of sequence of I, B and P frames, for purpose of 
encoding.
2.5 Motion Compensation and Video Coding Standards
The successful worldwide proliferation of any consumer communication system relies 
on international standardisation. This ensures interoperability, resulting in consumer 
confidence. The first application to be considered for digital video compression 
standard was video telephony. This was the H.261 standard published in 1990, 
followed by H.263 in 1995. These standards address only video compression. The 
MPEG-1, MPEG-2 and MPEG-4 standards also describe audio compression, and 
were published in 1993, 1995 and 1999 respectively by the ISO (International 
Standards Organisation). The various video coding standards are based on the generic 
motion compensation interframe encoder shown in Fig. 2.11. A brief overview of 
each o f these video standards is considered.
The interframe coding using only I and P frame motion compensation is used in 
H.261 [30]. The previously decoded frame is used to predict the current frame using 
1 6 x 1 6  pixels blocks. All three motion compensation techniques, I, P and B frames 
are used in MPEG1 and MPEG2 [31, 32]. The techniques used to estimate the motion 
vectors for all the above standards are similar. The H.263 [33] and MPEG-4 [34] use 
an advanced prediction motion estimation mode, where 8 x 8  blocks are required for 
motion estimation. This will be described in more detail in section 3.3.2. The main 
advantage of this method is improved PSNR performance. However, coding four
40
motion vectors from 8 x 8  blocks results in lesser compression than single motion 
vectors from 1 6x16  blocks.
The H.261 standard was developed for video conferencing application while the 
improved H.263 for both video conferencing and video telephony. As can be seen in 
Table 2.1, these standards are intended for limited bandwidth transmission. The initial 
aim of MPEG was to define a set o f standards for three different compressed bit rate 
bandwidths (Table 2.1). MPEG-1 was intended for a bit rate of 1.5mbps and 
applications including VHS (analogue videotape recording format) quality video and 
audio on CD-ROM. The MPEG-2 standard was intended for SDTV broadcast 
purposes using a bandwidth o f 4-8 mbps. Finally MPEG-3 for high end applications 
such as High Definition Television (HDTV), with a bit rate of 40mbps. The MPEG-2 
standard was considered sufficient for HDTV, so MPEG-2 and MPEG-3 were 
merged. Recently, MPEG-4 has been finalised (1999) which is aimed for low bit rate 
applications.
2.6 Conclusions
In this chapter the basic principles that enable digital video coding have been 
described. The most important video coding standards have been outlined. It is quite 
apparent that visual data requires enormous bandwidth. Therefore compression 
through digital video coding is absolutely essential. The use of compression is 
fundamental to all digital video coding standards.
41
Chapter 3
Motion Estimation
3.1 Introduction
The motion estimation technique used in all video coding standards is the block 
matching algorithm (BMA) that is an essential part of motion compensation. The 
following sections explain motion estimation in detail, covering the technical 
difficulties associated with implementing the method.
3.2 Principles of Block Matching Motion Estimation
The block translational model is used in block matching motion estimation, developed 
by Jain and Jain [5], and is generally used in video coding standards. The model 
divides an image into non-overlapping blocks that are usually square. The blocks in 
the current image are found by the translation of a similar block from the reference 
frame.
In video coding standards the translation is found using a BMA. To implement BMA 
the current frame is divided into blocks of N  x N  pixels as shown in Fig. 3.1. Each 
block from the frame to be predicted is matched against blocks of the same size in the 
reference frame within a search window. The displacement of the best match yields 
the motion vector for a given block. The search window is defined by N  + 2w, were w 
is the maximum displacement from the block boundary. There are (2w + 1) 2 unique 
blocks within a search window. Evaluating every position is known as an exhaustive 
search.
42
sw
search
window
MV
motion
vector
SW
MV
T\
N
N Reference Frame
Predicted Frame
Figure 3.1 Showing block motion estimation process.
SW
w \v
Search Window 
Length and width 
defined by 2w + N
The simplicity of BMA has resulted in the general adoption for video coding 
standards. However, the block transational model does not take into account any 
scaling or rotation that often occurs or dissimilar motion of different objects, resulting 
in reduction of prediction accuracy. Nevertheless, the technique results in good 
compression. The BMA has been used in video conferencing and telephone standards 
such as H.26-(l,3) and multimedia video standards such as MPEG-(1,2,4).
3.2.1 Block M atching Functions
Block matching functions are used to compare the similarity of blocks. The best 
match is obtained when the function is either maximised or minimised depending on 
the matching function used. A number of matching functions have been proposed 
varying in complexity and efficiency. The normalised cross correlation (NCF) 
function (3.1) gives accurate results [35], however the computational requirements are 
large. The performance is optimum only for Gaussian sources. The mean square error 
(MSE) function (3.2) generally offers superior results but also suffers from high 
complexity. The mean absolute error (MAE) function (3.3) provides a similar 
performance to MSE and since the function does not involve multiplication, the 
complexity is much reduced. For this reason the MAE is favoured in video codecs 
[36]. A matching function has been proposed [37] known as the pixel difference
43
classification (PDC) function (3.4), which is computationally simpler but does not 
perform as well as the other matching functions.
Normalised Cross Correlation Function (NCF):
The block that maximises the function corresponds to the best match.
N N
I  ! / < * ,  n)g(m + i9n + j )
N C F(i,j)  = —j— W=1 - ...T- -  =  , -  w < /, y < w (3.1)
J i t  + +
V m = l i»=l V »«=1 « = 1
Mean Square Error (MSE):
The block that minimises the function corresponds to the best match.
MSE(i, 7') = T77 X  S ( / n) + *>” + J')Y > -  w < i, j  < w (3.2)
N  m= 1 n= l
Mean Absolute Error (MAE):
The block that minimises the function corresponds to the best match.
MAE(i9 j )  = T T tX X L A '” ’ ~ + n  + I* -  w < /, y < w (3.3)
A m=i „=i
The function f ( m ,n ) represents in (3.1, 3.2, and 3.3) the current block at co-ordinates 
(im,n) and g(m +1, n + y) represents the corresponding block in the previous frame at 
new co-ordinates (m +1, n +1).
Pixel Difference Classification (PDC):
The block that maximises the function corresponds to the best match.
P D C ( iJ ) = - w < i , j < w  (3.4)
44
T(m,n,iJ) = 1, i f  \ f ( m , n) -  g(m + i,n  + j )  |< t
= 0, otherwise.
The function T(m,n,iJ) is the binary representation of the pixel difference and the 
value 1 or 0 corresponding to a matching or mismatching pixel respectively, t is the 
threshold value.
3.3 Advanced Motion Compensation Techniques
The BMA can produce block artefacts that are subjectively disconcerting, since the 
image quite visibly breaks into blocks. The block artefacts are caused by the block 
translational model failing to predict the motion accurately. This tends to occur when 
motion contains scaling or rotation or more than one direction of motion. Several 
techniques used in video coding standards to improve prediction are presented in this 
section.
3.3.1 Half-pixel Accuracy
The step size used in the BMA need not necessarily be an integer. To obtain a more 
accurate block match, fractional pixel accuracy search can be used. The use of 
fractional pixel accuracy search presents difficulty when, as in most cases the pixel 
samples are at integer accuracy. This is generally overcome by interpolating the 
points available. Since half-pixel accuracy is used in H.263 and MPEG- (2,4), this 
particular instance of fractional pixel accuracy search will be considered further. It is 
generally accepted that the half-pixel accuracy search provides significant 
improvements in motion estimation accuracy over integer pixel search, particularly 
for low resolution video.
Interpolating eight sub-pixel positions from the available sample points as shown in 
Fig. 3.2, produces the half-pixel accuracy. An effective method is the use of bilinear 
interpolation. The eight sub-pixel positions are calculated following the best match 
from the integer pixel search. The eight interpolated positions for a single pixel can be 
described by:
45
Sub-pixels h, either side o f  pixel A calculated using,
h =
A + X
(3.6)
and top and lower sub-pixel positions v,
v = A + Y (3.7)
The remaining four comer sub-pixels can be calculated by,
A + X + Y + Z
c = --------------------------------
4
The integer division is used in all cases and the modulus discarded.
The use of half-pixel search results in increased length motion vector overhead and 
complexity, however this is considered worthwhile. The error reduction is evident as 
can be seen by comparing the error frames of the integer-pixel (Fig. 3.3) and half­
pixel (Fig. 3.4) accuracy search, for two consecutive frames of the “Susie” test video 
sequence.
(3.8)
z Y
• •
c V
O o
X # hO o'
o o
c V
• •
z Y
O  O
O h  #  X O Centre Integer Pixel
o O Sub-pixel positions
c
•
z
• Surrounding 
Integer Pixels
Figure 3.2. The sub-pixel search positions are interpolated from the positions 
surrounding pixel A.
46
i ,r  ‘ ■ • 1 > y
v ■ /, I ’ffi
1 1  ?  i
■
. . .  . .V iW 'f  ' / ‘iJB ’™ — U
1 \  .r*r ift / ^ /  §
Figure 3.3 Showing for “Susie” test image an integer-pixel search motion 
compensated difference error frame.
Figure 3.4 Showing for “Susie” test image a half-pixel search motion compensated 
difference error frame.
3.3.2 Overlapping Block Motion Compensation
The use of block matching motion compensation can result in an image containing 
discemable block artefacts. This may be overcome to a certain extent with the use of 
Overlapping Block Motion Compensation (OBMC) [38], [39]. This essentially works 
by combining the motion estimates for the current and neighbouring blocks to derive 
the predicted value for a pixel.
The OBMC is applied to the luminance component. The block motion estimation is 
applied using blocks of 8 x 8 size as described in section 2.5. Therefore each block is 
associated with one motion vector. The OBMC neighbourhood consists of four blocks
47
adjoining the current block as shown in Fig 3.5. The neighbouring blocks are to the 
left and right of the current block and the top and bottom. The prediction found for 
each pixel is the weighted average for the current block, and blocks predicted using 
motion vectors from the two adjacent neighbouring blocks within the current blocks 
neighbourhood. This may be exemplified by considering the prediction for pixels in 
the upper right 4 x 4  quadrant as used in the H.263 video coding standard. The pixels 
in this quadrant are formed from the weighted average of the blocks predicted by the 
motion vectors for the current block and the neighbouring block to the top and right. 
The weighting used decreases away from the centre of the current block and the 
blocks to the left and right increase from the centre horizontally, while the blocks 
upper and lower increase from the centre vertically. The exact weighting has been 
chosen to enable fast computation [40].
Top Neighbouring
Right Neighbouring 
Predicted Block
MV2
Left Neighbouring 
Predicted Block
MV5
MV3
MV1
MV4
MV refers to the 
motion vectorCurrent
Predicted
Block
Lower Neighbouring 
Predicted Block
Figure 3.5 Showing the use of Overlapping Block Motion Compensation (OBMC), 
with system of weighted average of current and four neighbouring blocks.
The increase in the number of motion vectors consequently requires more bits to 
encode. However, the improvement in prediction accuracy is significant. The 
improvement can be up to 1 dB when the OBMC is used with the standard BMA [22]. 
The performance is such that the technique is incorporated in H.263 as an advanced 
option.
48
3.3.3 Unrestricted Motion Vectors
In certain instances such as pixels near the edges of a frame, it is sometimes desirable 
for motion vectors to point outside the frame as shown in Fig 3.6, clearly resulting in 
more accurate prediction.
The unrestricted motion 
vector (MV) points 
outside the frame.
Frame
Edge
MV
Predicted
Frame
Frame
Reference
Frame
Figure 3.6 Showing the use of unrestricted motion vectors. The motion vector points 
outside the frame enabling better motion prediction.
3.4 Motion Estimation Complexity
Motion estimation is used at the video encoding stage to describe motion that has 
occurred between successive frames using motion vectors. Therefore, the video 
encoding phase is significantly more computationally demanding than the video 
decoding. The compression can be described as asymmetrical. As a consequence, for 
media applications such as MPEG- (1,2) where only video decoding is required for 
the consumer, the video playback is much more easily implemented. However, for 
applications such as video conferencing and video telephony, the motion estimation 
processor used in the video codec is complex and requires special consideration.
Indeed the motion estimation process has been shown to account for up to 60 % of the 
computations required in video coding [36], [41]. The motion estimation is shown in 
relation to the other compression processes used in video coding in Fig. 3.7. It can be 
observed that the DCT/IDCT transforms used to reduce spatial redundancy in the 
intraframe part of the video codec, use less than half the computational resource of the 
motion estimator. Entropy and other miscellaneous functions account for only 15 % 
of the total complexity. Such computational resource contributes significantly to the
49
total power consumption of the video coding system. In practical realisations, usually 
there is a trade off between picture quality and power consumption.
Entropy,
Miscellaneous
Figure 3.7 Showing resource usages in video codec.
3.4.1 Computational Complexity
The complexity of motion estimation algorithms is defined by three main factors:
• Block matching function
• Search range
• Search method
The larger the search window the better the match will be, particularly for image 
sequences that have large motion. The disadvantage in this is that the computational 
requirements increase significantly as is shown in Fig. 3.8.
DCT/IDCT
Quantiser
Motion
Estimation
50
2500
2000
10 1500
• *  1000
500
250 5 10 15
Search parameter w
Figure 3.8 Showing number of blocks compared for a given search parameter w.
The most significant factor that influences the complexity is the search method. The 
exhaustive search method requires the largest complexity. The advantage of the 
exhaustive search method is that it guarantees the best match can be found. To 
overcome the problems associated with exhaustive search method, fast motion 
estimation algorithms have been developed. The high complexity of the motion 
estimation, for both H.26-(l,3) and MPEG-(1,2,4) video standards results in an 
encoder that is many times more complex than the decoder, even if more efficient fast 
motion estimation algorithms are used.
51
3.5 Fast Motion Estimation Algorithms
In order to make reductions in computational operations, several fast algorithms, as it 
will be shown in this section have thus been formulated. The image prediction is 
somewhat reduced. The basic principle behind fast algorithms is to divide the search 
process into a number o f sequential steps. The decision in choosing the direction of 
the next step is based on the current step result. Therefore, only a small number of 
search points are calculated at each step. Thus, the total number of search points is 
significantly reduced.
The fast search algorithms are based on the premise that the matching function is 
monotonic in any direction away from the optimal point. Indeed under such 
conditions a fast algorithm can converge to the global optimal point. However, in 
reality the monotonic matching function assumption is often not valid and as a result 
fast search algorithms are sub-optimal for predicting images. The monotonic 
matching function assumption is largely based on the block translational model used 
to describe motion. The motion is not necessarily always purely translational. The 
image is also subject to coding and noise. All these factors can cause the monotonic 
matching function assumption to be invalid.
The fast algorithms are performed in sequential order, consisting of a number of steps. 
Therefore, the initial and subsequent search directions are important since they may 
lead to a local minimum or maximum depending on the matching function used. The 
sequential search, which often can be variable in length, depending on the image is 
not entirely suitable for parallel processing structures. Such parallel processing 
requires regularity to perform efficiently. Generally, a fast search algorithm begins 
with an approximate search, computing a number of initial points for the first step. 
The number of search points evaluated at each step is often the same and the distance 
between search steps usually halves at each step. Therefore, the process is logarithmic 
consequently leading to significant reductions in computations compared with the 
exhaustive search algorithm. The procedure is repeated until the local optimum point 
is reached.
52
The fast algorithms used for block motion estimation can be summarised as follows:
• Fast search by reduction of motion vector candidates.
• Reduced Search window resolution with hierarchical or multi-resolution 
approaches.
• Fast matching obtained by pixel decimation.
• Simplified matching function.
All these approaches reduce computational complexity and can be used in conjunction 
with each other since they are all independent. However, the fast search algorithms 
that reduce the number of candidate vectors within the search area, are the most used 
algorithms.
53
3.5.1 The Two Dimensional Logarithmic Search Algorithm
In this method proposed by Jain and Jain [5], the five initial points are tried on the 
first step, one at the coordinates that the search window is centred on and four further 
points at coordinates (±w/2,±w/2) in a diamond pattern search (Fig. 3.9), where w 
represents the search window size. The next step using the two dimensional 
logarithmic (TDL) search, repeats the same diamond search pattern but in the 
direction of the previous best match. The search range is halved if the best match is 
the start coordinate of the previous step, otherwise it is kept the same. The final step is 
reached only when the search range is reduced to one pixel and then nine further 
points are examined before the best match is found. The number of steps and the total 
number of search points cannot be predetermined using this technique, only a best 
case and worst case can be specified.
i-8 i + 8
Search
Points
•  Step 1
A  Step 2
■ Step 3
O Step 4
A  Step 5
□ Step 6
Figure 3.9 Two dimensional logarithmic search algorithm.
54
3.5.2 The Three-Step Search Algorithm
The three-step search (TSS) first considered by Koga et al [42], provides a highly 
efficient and accurate method of finding the best match. For this reason, the TSS is 
the recommended method for the testing of software based H.261 for videophone 
applications [36]. The technique first tries the start coordinate (Fig 3.10) and eight 
further positions which are half the search step w/2 surrounding the start coordinate. 
The position that minimises the distortion measure is the start coordinate for the 
second step. For the second step the search step is halved again to w/4 and eight 
further positions are tried and the best match so far is the start coordinate for the third 
step. The third and final step of the process halves the search step to w/8 and eight 
further positions are tried and the best match so far gives the displacement, hence the 
motion vector. The TSS constantly divides the search step size by two; therefore the 
method is a logarithmic search.
i - 8 i i + 8
Search
Points
#  Step 1
A  Step 2
■ Step 3
Figure 3.10 The three-step search algorithm.
J
55
3.5.3 The One-Dimensional Logarithmic Search Algorithm
The one-dimensional logarithmic (ODL) search algorithm was proposed by Puri et al 
[43] and is perhaps the fastest of the known search algorithms. The method searches 
four initial points on the first step and one at the coordinates that the search window is 
centred on and four further points at coordinates (±w/2,±w/2) in a diamond pattern 
search (Fig. 3.11), where w represents the search window size. The next step halves 
the search range. The two points that produce the best matches are used as references 
for the next step. The search range is halved for the second step. The matching 
function is applied to four points. The points to the left and to the right of the best 
vector in the x direction are searched and points above and below the best vector in 
the y direction are searched. The best match of the three on the x-axis and the best of 
the three on the y-axis are chosen. The final step is reached only when the search 
range is reduced to one pixel and then the vectors from the two best matches are 
added to produce the final motion vector. The algorithm has little chance of accuracy 
since it relies more than any other on the matching function being monotonic.
i-8  i i + 8
Search
Points
•  Step 1 
A  Step 2 
■ Step 3
O Best
Figure 3.11 One-dimensional logarithmic search algorithm.
i
/ V y
3
\L.A i K A i ►
4k
\ t
L
56
3.5.4 The Conjugate Direction Search Algorithm
The conjugate direction search (CDS) was introduced by Srinivasan and Rao [44]. At 
every iteration of the search, two conjugate directions with a step size of one pixel are 
searched as shown in Fig. 3.12.
The search is performed on the x-axis coordinate in either direction until a best match 
is found. The second step now continues in the vertical direction. The search is 
performed in either direction until a further best match is found. The final step of the 
process is to search along the vector connecting the best match from the second step 
and the initial central starting point. The search is conducted in either direction along 
the vector until the best match is found. The best match gives the displacement, and 
hence motion vector. In the example shown in Fig. 3.12, the search along the vector 
corresponds to integer coordinates, this however may not be the case. In such 
circumstances, the nearest grid points along the vector are used.
i - 8  i / + 8
Search
Points
#  Stage 1
A  Stage 2
■ Stage 3
Figure 3.12 The conjugate direction search algorithm.
j  + 8
J
57
3.5.5 Comparison of Fast Search Algorithm Methods
The complexity o f the fast search motion estimation algorithms discussed so far is 
compared with the exhaustive method in Table 3.1 for different search window sizes. 
As can be seen the fast algorithms reduce the number of search points compared to 
the exhaustive method quite considerably. The three-step method requires more 
operations but offers a better performance than the two dimensional logarithmic 
method. The number of search points and steps are fixed for the three-step method; 
this makes it particularly suitable for VLSI parallel computation implementations 
where structural regularity is important. The number of search points and steps can 
vary greatly for the other algorithm though the total number o f operations is reduced. 
Therefore it is appropriate for software implementation. The one dimensional 
logarithmic search algorithm reduces the search points to what is perhaps the 
minimum. However the prediction performance is diminished.
Search
method
Number of 
search points
Search parameter w
4 8 16
Exhaustive (2 w +  l ) 2 81 289 1089
TDL 2 + 7 log 2 vv 16 23 30
TSS 1 + 8  log 2 w 17 25 33
ODL 1 + 4 log 2 VV 9 13 17
CDS 5 + 4 log 2 w 13 17 21
Table 3.1 Showing a comparison o f fast search algorithms complexity 
The fast search motion estimation algorithms offer several advantages such as:
•  The complexity compared to the exhaustive search is significantly reduced.
• The algorithms are suitable for software based real time applications such as 
video conferencing and video telephony.
58
The algorithm is not without certain disadvantages that include:
• The algorithms only search a small number of points within a search window; 
therefore, there is an increased probability o f the algorithm producing a 
motion vector from a local minimum.
• The number o f steps is variable for most fast search algorithms; this is not 
suitable for parallel computation where regularity is essential. The exception 
being the TSS. The TSS combines regularity with the good prediction 
performance.
3.5.6 Hierarchical Block Motion Estimation
Hierarchical block motion estimation first introduced by Beirling [45], reduces the 
extremely large number of computations required by the exhaustive method. A 
number o f multi-resolution frames, each half the resolution of the previous are used in 
the process as shown in Fig. 3.13. The three level example shown in Fig. 3.14, use a 
variable sized blocks of 2 x 2 pixels, 4 x 4  pixels and 8 x 8  pixels corresponding to 
successively increasing resolution with search range parameter w o f 4 pixels, 2 pixels 
and 1 pixel respectively. This particular method and variants are known as 
hierarchical variable block size (VBS) motion estimation [46]-[48]. Motion estimation 
is performed on the lowest resolution frame first and the best match motion vector is 
then doubled in size and used as the start coordinate for the second level. The frames 
o f the second level are twice the resolution of the first level. The process is repeated 
until the final level which is at full resolution is reached and the motion vector for the 
best match is obtained. The search range is the largest at the first level and is reduced 
by half at each successive level. A relatively small search window can be used at 
higher levels, as it starts with a good approximate initial estimate. The method can be 
used in conjunction with fast search algorithms, however as a consequence the 
prediction performance is somewhat reduced.
The hierarchical block motion estimation features several advantages such as:
59
• The whole searching area is covered though at lower resolution
• The complexity compared to the exhaustive search is significantly reduced.
The algorithm is not without certain disadvantages that include:
• The use o f subsampling requires additional memory and the subsampling will
require filtering and therefore additional computations.
• The prediction accuracy may not necessarily be good when a scene contains
spatial details since there is an increased probability of the algorithm 
producing a motion vector from local minima.
Motion
Estimation
Motion
Estimation
Motion
Estimation
Motion
Vector
Figure 3.13. Showing hierarchical block matching algorithm. The motion estimation 
is performed using a three level pyramid structure with each level a reduced 
resolution representation of the lower level.
Downs 
x :
ample
2
i
Low-pass Filter
Downsample 
x 2
Low-pass Filter
60
Figure 3.14 Showing the block size and search window at each level using the three- 
level hierarchical block matching algorithm.
3.5.7 Phase Correlation M ethod
To implement phase correlation [49] the reference frame is divided into overlapping 
reference blocks of L x L pixels. The phase correlation function (PCF) is then 
performed between each reference block and the corresponding block in the previous 
frame centred at the same coordinates. Assuming that the two blocks are related by a 
simple translation,
W i )  =  W i  + d) (3.7)
Performing a Fourier transform and taking the normalised cross-power spectrum 
between \^i(x) and ^ (x )  results in the translation that has occurred. The translation is 
obtained by taking the inverse Fourier transform resulting in the PCF:
PCF(x) = 5(x + d) (3.8)
Since the PCF between the two blocks are translations of each other, an impulse 
function results (3.8), which is located at a position exactly equal to the translation 
that has occurred. In practice several impulses may occur corresponding to local 
minima, the maximum peak is identified as the translation of the reference block. The 
spatial information is lost by the transform to the frequency domain, however the 
direction and speed of the motion determined from the PCF, can be used to direct the
61
block motion estimation algorithm to a limited range vastly improving efficiency, 
particularly over large search ranges.
3.5.8 Pixel Decimation
Block matching is based on the block translational model. Thus assuming that all 
pixels move in the same way, only a partial matching o f the pixels in the block can be 
used to obtain a good estimate. Essentially using pixel decimation, motion estimation 
complexity can be reduced by considering fewer pixels in the matching function.
There are many ways pixel decimation can be achieved. A practical approach consists 
o f adopting a fixed chequered flag pattern with subsampling factors ranging from two 
to eight. The saving in complexity would be of the same order o f magnitude. A more 
developed approach [50] applies two fixed patterns that consist o f a cross and plus. 
However, the use of such rigid subsampling patterns can lead to inaccurate image 
prediction, since details can be omitted in certain regions of the blocks to be matched. 
The alternating use between these patterns alleviates this problem by using all the 
pixels of the current block and o f the search area. The use o f alternating pixel 
decimation patterns was proposed by Liu and Zaccarin in [51]. In this example four 
4:1 subsampling block patterns are used. The patterns are used in a manner such that 
only one at each location of the search area is used, and in a defined alternating 
manner. The four final motion vectors are then refined with a full block matching and 
the best reveals the motion vector. The computational complexity is in this case 
decreased by four.
There are some advantages using pixel decimation such as:
• All the block positions in the entire search area are covered by the matching 
function
• The regularity o f the algorithm is suitable for hardware realisations 
There are several disadvantages with the technique:
62
• The complexity is reduced typically by only four to eight times.
• Since the entire search area is covered, the memory is accessed many times 
more than fast search algorithms.
• Scenes characterised by many spatial details increase the probability of the 
matching function getting stuck in a local minima.
3.5.9 Reduced Pixel Precision
The exhaustive search combined with the Reduced Bit Mean Absolute Difference 
(RBMAD) distance measure was proposed by Baek et al [52] to reduce hardware 
complexity. The RBMAD reduces the number of bits used in absolute difference 
calculation by bit truncation. As a consequence this results in a reduction in VLSI 
implementation area and power consumption. The absolute difference processor and 
adder are based on a smaller number of bits, therefore a higher operating speed is 
possible.
There are several advantages in using reduced pixel precision such as:
• The hardware complexity is reduced and the operating speed increased with 
small loss in prediction accuracy.
• The regularity of the algorithm is suitable for hardware realisations.
• The matching function can have a positive effect on the PSNR for certain
scenes. This is possibly due to filtering effect as a result of masking the LSB
(least significant bits) of the pixels used for the absolute distance.
There are some disadvantages with the technique:
• The complexity is reduced typically by only two times.
63
• For scenes characterised by many spatial details, the probability of the 
algorithm getting stuck in a local minima is increased.
3.5 Conclusions
Various motion estimation techniques have been described and the performance of 
each assessed. The exhaustive search method provides the best results, albeit at the 
highest computational complexity. The TSS method and hierarchical block motion 
estimation offers the best performance o f the fast motion estimation methods. The fast 
search methods generally trade video quality for reduced computation.
The exhaustive search method lends well to integrated circuit implementation where 
structural regularity is important. The method allows operations to be performed in 
parallel. This is particularly important for real time applications such as video 
conferencing and video telephony. Fast search methods that have a variable number 
o f search steps are unsuitable. The TSS therefore is the most suitable of fast search 
algorithms since the number of search steps is fixed. However the method as with all 
fast algorithms is susceptible to local minima, reducing PSNR due to false matches.
The reduced pixel precision algorithm using the exhaustive search is of particular 
interest, since good performance is obtained at reduced accuracy. This would seem to 
be the best candidate for an analogue implementation where accuracy is limited. It is 
this algorithm that shall be further explored in the following chapters.
64
Chapter 4
Characterisation of Block Matching Algorithms 
using Analogue Techniques
4.1 Introduction
The issues relating to the use of analogue techniques to BMA’s are varied. The use of 
a particular BMA architecture, complexity of implementation, effects of non-ideal 
analogue building blocks and their relation to motion estimation performance must all 
be considered.
In this chapter, BMA’s are examined with particular consideration given to design 
realisation. To simulate the analogue motion estimation processor, a C++ program has 
been specifically written allowing various design parameters to be investigated. The 
degree o f non-ideal analogue characteristics and their relation to overall motion 
estimation performance is explored to determine design limits and the choice of BMA 
implementation.
4.2 Realisation of Block Matching Algorithms using 
Analogue Techniques
4.2.1 Motivation for using Analogue Techniques
The motivation for using analogue integrated circuit techniques to implement BMA’s 
is wide and varied. These may be due to lowering material cost or achieving better 
power efficiency over digital integrated circuit techniques, or in most cases both. Such 
circuit techniques have proved useful for realising parallel processing systems used in 
neural networks [53] -  [66] and vector quantisers in image compression [67]-[71 ].
The use of such proven analogue techniques for motion estimation would result in 
similar benefits for video coding applications. Since several video coding standards 
such as H261-(3) and MPEG-(1,2,4) only require the motion vectors and not the
65
method by which they are found, thus a high degree of flexibility regarding 
implementation is available. Therefore a “black box” situation exists where only the 
input and output interface of the motion estimation processor must be fully specified 
and not the internal workings as is shown in the simplified block diagram of Fig. 4.1.
ME=Motion
Estimation
MEVIDEOCAPTURE
DIGITAL
VIDEO
DATA
FRAMES MV
VIDEO CODER
MV=Motion
Vector
Figure 4.1. Simplified block diagram showing motion estimation processor as part of 
video coder.
4.2.2 Non-ideal Analogue Characteristics
The limitation of any analogue implementation is the random variation in process, and 
hence variable circuit parameters. Therefore, it is necessary to evaluate the analogue 
motion estimation processor with respect to inherent analogue non-ideal 
characteristics. The performance of the analogue motion estimation processor will be 
largely determined by several factors including:
• The distance metric used.
• How well the distance metric is implemented.
• The accuracy at which each match is compared.
Since the BMA is fundamental to the motion estimation processor operation, the 
BMA will be the topic of further discussion.
The various components of analogue MAE and MSE distance metric implementations 
are shown in Fig. 4.2. The MAE or MSE will be comprised of a number of analogue 
processing elements (PE) to perform the matching function. Each analogue PE takes
66
either the absolute distance between pixels or the square difference depending on the 
BMA used. The use of analogue techniques will introduce non-ideal components into 
the matching functions (3.2) and (3.3). The non-ideal deviations can be modelled in 
Fig. 4.2 with either an absolute distance PE (4.1) or squared distance PE (4.2):
absolute-distance PE: dm = (k + sx )|(C -  R) + ey | + sz
square-distance PE: dm = (k + ex)(C -  R + ey ) 2 + ez
where dm, the distance measure and hence the PE output, C and R are pixel values, €x 
is the variation in gain, ey is the input-offset error and ez is the output-offset error. The 
input-offset error and output-offset error effect on the PE are shown in Fig. 4.3. The 
terms €x , €y and e2 will follow a normal distribution with large standard deviation 
values for small transistor geometries in deep submicron CMOS processes [72], [73].
Block
MEAN
DISTANCE
MEASURE
dm = Pixel distance 
measure
Figure 4.2. Block diagram showing distance measure metric.
dm
- nonideal 
ideal
>\C-R\
Figure 4.3. Depicting the effect o f  input-offset error (s}-) and output-offset error (s,) 
on the absolute-distance PE characteristic.
(4.1)
(4.2)
67
Investigations during the course of this research, have shown that the BMA 
performance is mainly influenced by:
• Pixel precision,
• Input-offset of each cell and
• The accuracy at which compared.
While factors that only affect the performance marginally include:
• The variation in gain between PE,
• Output-offset of each PE and
• Absolute or square-law conformance of the PE.
The variation in gain and output-offset of each PE follows a normal distribution. 
Therefore, the mean from summing the individual PE will result in a constant offset. 
Obviously a constant offset will not affect the comparison of each successive match. 
The absolute and square conformance will not significantly affect the performance 
either. These observations confirm previous results obtained by Tomasini et al [74] 
implementing the motion estimation as part of a larger video system, and the 
observations of Tuttle et al [71] working with vector quantisers.
4.3 Characterisation of non-ideal Analogue Techniques on 
Performance
Characterising the effect on the BMA as a result of non-ideal analogue characteristic 
is essential. The image prediction is directly influenced by the BMA implementation. 
The use of analogue techniques will introduce inevitable variations. Therefore, it is 
necessary to simulate the analogue motion estimation processor with respect to 
inherent analogue non-ideal characteristics. The results presented in the following
68
sections were obtained using a block size of 16 x 16 pixels and search window of 32 x 
32 pixels for various sequences. The sequences consist of ‘Carphone’ with excessive 
motion of person and background; ‘Foreman’ with large motion of person and change 
o f scenery to forecourt; ‘Susie’ with less motion and close up of person with fixed 
background and finally ‘Trevor’ having the least motion with person at a moderate 
distance and fixed background. The exhaustive search was used in all sequences. The 
PSNR was used to evaluate the objective quality of the motion estimated frames.
4.3.1 Reduced Pixel Precision
The pixel precision is important for analogue designs, since invariably the matching 
function circuit will be preceded by an analogue memory. Thus, pixel precision 
determines the complexity of the analogue memory. The mean PSNR (mean 
calculated over all frames in each sequence) for the 4 sequences was assessed at 
reduced pixel precisions for both matching functions. As seen in Fig. 4.4, the loss 
from the ideal 8-bit pixel precision increases similarly for each function, though 
negligibly down to 6-bit. The rapid increase below 6-bit is due to the correlation 
information being lost in the quantisation. Interestingly, the performance of 4 to 5-bit 
pixel MSE is similar to 8-bit pixel MAE, while the MAE for 4 to 5-bit pixel precision 
is up to 0.25 dB down on the 8-bit pixel MAE. This would suggest that the MSE can 
trade off over two bits of pixel precision and achieve similar accuracy as 8-bit MAE. 
Since the inherent square-law characteristics of MOS transistors may readily be 
exploited in circuit design, this would suggest the MSE suitability for analogue 
realisation.
69
CD
Carphone
Foreman
Susie
Trevor
£ 0.2
3 0-« 
i  0.1
£  0 .05UJ
Pixel Accuracy (bits)
2 - 0 .2 5 c
0.1
0.05
Pixel Accuracy (bits)
A Carphone -  Foreman O Susie □  Trevor 
Figure 4.4. The mean PSNR error due to pixel precision reduction for various 
sequences. Upper, MSE distance metric with ideal 8-bit MAE reference 
superimposed (dashed). Lower, MAE distance metric.
4.3.2 Offset Errors
The performance evaluation with input-offset caused by device mismatch in the 
analogue PE, is essential in determining the suitability of a particular matching 
function for an analogue implementation. The offset can be reduced using calibration 
[69], [74] however this increases complexity and reduces operating speed. The offset 
follows a normal distribution for increasing standard deviation between 2.5mV and 
lO.OmV. These offset values are typical of sub-micron geometry transistors [72], [73]. 
The input range is 0-255mV with 8-bit pixel precision. It can be seen that the 
performance of scenes with a high degree of motion, such as Carphone and Foreman, 
are improved by adding input offset to the MAE algorithm (Fig 4.5). The 
improvement is largely due to smoothing the MAE distance metric that has been 
observed to enhance performance.
70
2 3 4 5 6 7 8 9  10
Input referred offset (mV)
Z. 0.2
0.1
Input referred offset (mV)
A Carphone _  Foreman O Susie □  Trevor 
Figure 4.5. The mean PSNR error due to input-offset for various sequences. Upper, 
MSE distance metric. Lower, MAE distance metric.
4.3.3 Reduction of Block Match Comparison Resolution
The matching functions can have a wide dynamic range, but natural scenes only use 
part of this range. Since it is desirable to design analogue circuits with small dynamic 
range for low voltage operation, it is important to determine the comparison accuracy 
required to identify each best match block. As seen in Fig. 4.6, an accuracy of 12-bit 
for 8-bit pixel MAE is required and 14-bit for 8-bit pixel MSE to ensure less than 0.1 
dB error. An accuracy of 14-bit is required for the MSE using 6-bit pixel precision 
(Fig. 4.7) to perform similarly, or better in most sequences than 8-bit pixel MAE. 
Thus, using 6-bit pixel precision and the MSE function, a similar performance to 
MAE using 8-bit pixel precision can be achieved. As a result, the circuit precision 
(excluding the comparator) can be reduced by a factor of 4, greatly reducing the 
circuit requirements.
71
CD
ID
CO2
Carphone 
Foreman 
■1 Susie 
Trevor
t  0. 5 -
IID ;
i  -|  10 1412
Comparator Resolution (bits)
T3
a  0.3
5
S  0.2
t  0.1
UJ
cBD2 12 1410
Comparator Resolution (bits)
8
A Carphone — Foreman O Susie □  Trevor 
Figure 4.6. The mean PSNR error at reduced comparator accuracy for sequences 
using 8-bit pixel values. Upper, MSE distance metric, with as a reference 8-bit MAE 
using ideal comparator superimposed (dashed). Lower, MAE distance metric.
3. 5
CO
I2-5
I 2
UJ
c
0. 5
12 14 1610
Resolution (bits)
Carphone
Foreman
Susie
Trevor
A Carphone — Foreman O Susie □  Trevor 
Figure 4.7. The mean PSNR error at reduced comparator accuracy for various 
sequences, using 6-bit pixel values and MSE distance metric - with as a reference 8- 
bit MAE using ideal comparator superimposed (dashed).
72
4.4 Effects of Block Size on Block Matching Algorithm  
Performance
Generally it has been found that motion estimation performance increases with the 
reduction in block size. The block size used for H.261 and MPEG- (1,2) is 16 x 16. 
For improved performance, 8 x 8  blocks are used in advanced low bit rate modes in 
standards such as H.263 and MPEG-4. These sizes are examined in relation to 
reduced pixel algorithms using search window sizes of 32 x 32 and 24 x 24 
respectively for various sequences. In addition, the 4 x 4 block size is included with 
12x12  search window.
4.4.1 Reduced Pixel Precision
The reduction in pixel accuracy results in significant power and size savings. The 
pixel precision therefore determines the complexity of the motion estimation. The 
reduction relaxes analogue circuit requirements with regard to accuracy and dynamic 
range. The effect of reduced pixel precision with block size is shown for the MSE 
matching function in Fig. 4.8 and the MAE matching function in Fig. 4.9.
However with smaller block sizes, the mean PSNR error introduced, increases 
significantly with reduction in pixel precision as can be seen in Fig. 4.8 and 4.9. With 
8 x 8  blocks and 4-bit pixel precision the mean PSNR error is 0.5 - 1 dB and with 4 x 
4 blocks between 1.1 -  2.3 dB. This compares rather unfavourably with 16 x 16 
blocks at an increased mean error of 0.3 dB maximum. The results would suggest that 
reduced pixel precision is only an effective way of reducing system complexity with a 
large block size such as 16 x 16.
73
4-bit Pixel PrecisionTD
LUcn2
0.5
UJ
croa3 4 8 16
CO
Block Size 
5-bit Pixel Precision
2- ^ 
UJ ! h 
2  0.6  -
® 0 .4 -  
£
|  0.2 -
UJ
cTO *-
£  4 8 16
Block Size
I
LU
c
12
Block Size
6-bit Pixel Precision
Block Size 
7-bit Pixel Precision
A Carphone — Foreman O Susie □  Trevor
Figure 4.8 The mean PSNR error due to pixel precision reduction for various
sequences using the MSE algorithm for various block sizes.
74
4-bit Pixel Precision
td a
LU 2 - 
<
S 15 -
!  ”  
£  0.5 r
LU
I i 8 16
Block Size 
5-bit Pixel Precision
T3
LU
<  0.6
0.4
0.2
LU
c
5 4 8 16
Block Size
6-bit Pixel Precision
g
LU
C
S
Block Size 
7-bit Pixel Precision
Block Size
A Carphone -  Foreman O Susie □  Trevor
Figure 4.9 The mean PSNR error due to pixel precision reduction for various
sequences, using the MAE algorithm for various block sizes.
75
4.4.2 Offset Errors with Different Block Size
It has been found that the mean PSNR error introduced by input-offset tends to 
dominate, while the other non-ideal characteristics contribution is minor. The 
performance with input-offset caused by device mismatch in the analogue PE is 
essential in evaluating the block size of matching function for an analogue 
implementation. The offset follows a normal distribution for increasing standard 
deviation between 2.5mV and lO.OmV. The input range is 0-255mV with 8-bit pixel 
precision. It can be seen in Fig. 4.10 and Fig 4.11, that the performance of scenes with 
a high degree of motion, or lower resolution, tend to perform better in the presence of 
input-offset than scenes with more spatial detail such as Susie. However generally the 
prediction performance with reduction of block size tends to be worsened.
4.4.3 Block Size Performance
The performance o f reduced pixel precision MAE and MSE algorithms has been 
investigated with different block sizes. The performance of the algorithms with 
input-offset introduced has been assessed with particular relevance to analogue 
motion estimation processor implementations. It has been found that contrary to the 
general improvement found in PSNR from using smaller block size for full pixel 
precision algorithms, the performance is much degraded with reduced pixel precision.
The performance with smaller block sizes is much worsened for the MSE and MAE 
matching functions. With smaller block sizes it is more probable that neighbouring 
blocks are quite similar, therefore the spatial detail is lost and the matching function 
cannot resolve the differences between blocks, leading to poor image prediction. 
Therefore some important observations that were apparently unnoticed in the works 
of [52] may be noted:
• The reduced pixel algorithms only represent a viable low complexity 
alternative for large block sizes such as 16 x 16. This is applicable to both 
digital and analogue implementations.
76
•  The prediction improvement found with using small block sizes for full 
precision matching algorithms does not hold for reduced pixel precision 
matching algorithms.
Z? 2.5mV Standard Delation
0.15
0.05
Block Size
5.0mV standard Deviation
-  0.8
0.6
0.4
0.2
Block Size
7.5mV Standard Deviation
LUto2
3
E
M
0.5
LUc
8 164
Block Size 
10.0mV Standard Deviation
2.5
1.5
lu 0.5
A Carphone — Foreman O Susie □  Trevor 
Fig. 4.10 The mean PSNR error due to input-offset for different sequences using 
various blocks. The MSE algorithm has been used with 8-bit pixel precision.
77
2.5mV Standard Deviation
% 0.2
Block Size
5.0mV standard D ela tio n
0.8
*> 0.6< Eg  0.4
I  0.2UJ
Block Size
7.5mV Standard Deviation
<5 0.5
Block Size 
10.0mV Standard Delation■o
UJ
< 2.5
3
1.5
uj 0.5 
c
4 8 16
Block Size
A Carphone -  Foreman O Susie □  Trevor
Fig. 4.11 The mean PSNR error due to input-offset for different sequences using
various blocks.
78
4.5 Comparison of Architectures Techniques
The performance and complexity of the analogue motion estimation processor will 
also be influenced by the architecture used. Architectures that calculate the distance 
metric (DM) in parallel using analogue components are examined. There are three 
main approaches to identify the best-match reference block. The simplest way shown 
in Fig. 4.12, utilises a single DM block and works through the search area one 
location at a time using a 2-input comparator (COMP) and a sample-and-hold (S/H) to 
store the most recent best-match. This will be referred to as a serial-parallel 
architecture. The computation rate can be increased using multiple DM matching 
blocks as shown in Fig. 4.13. The search area is worked through a row (or column) at 
a time using an w-input loser takes all (LTA). This will be referred to as a multiple- 
parallel architecture. Finally a parallel-parallel architecture could be employed 
calculating the entire search area in parallel [74], utilising x2 DM blocks and an x2 - 
input LTA. The x  represents the number of pixels for a side of a square block. The 
architecture has also been used in vector quantisers [70], [71]. This can be viewed as 
an extension of the multiple-parallel architecture. A more detailed description of the 
architectures will be given in Chapter 5.
M otion
V ectorDM
S/H
COMP
Fig. 4.12 Motion estimation processor using serial-parallel architecture.
DM2DM1 DM (n)
(n) INPUT LTA
M otion
V ecto r
Fig. 4.13 Motion estimation processor using multiple-parallel architecture.
79
4.5.1 Architectures Perform ance with offset errors
The parameters influencing the performance of the motion estimation will be similar 
for both architectures. However in the case of the multiple-parallel architecture, the 
random input-offsets will be different for each of the multiple distance metrics used. 
In the case of the serial-parallel architecture, output-offset errors will cause no 
degradation in performance since all matching values will be subjected to the same 
DC offset for each subsequent match, assuming a normal distribution.
The performance of both architectures with increasing input-offset is shown in Fig. 
4.5 and Fig. 4.14 for the serial-parallel and multiple-parallel architecture respectively. 
The results were obtained using a block size of 16 x 16 pixels and search window of 
32 x 32 pixels for the various sequences and using 17 DM’s for the multiple-parallel 
architecture. Clearly the serial-parallel architecture performance is better than 
multiple-parallel for both distance metrics. Systems that employ the multiple-parallel 
are more prone to input-offset errors degrading performance [71], [74] particularly the 
MAE metric [70].
CD
s  0.8
UJx
I
0.6
0.4
0.2
UJ
cro12 2 3 4 5 6 7 8 9 10
Input referred offset (mV)
0.8
0.6
0.4
0.2
Input referred offset (mV)
A Carphone -  Foreman O Susie □  Trevor 
Fig. 4.14 The PSNR error due to input-offset for various sequences using multiple- 
parallel architecture. Upper, MSE distance metric. Lower, MAE distance metric.
80
In the case of the multiple-serial architecture, since the matching function values to be 
compared will originate from different distance metrics blocks, one might expect 
them to have very different DC offsets. However, since the output-offset errors in 
each distance metric block follow a normal distribution, the distance metric values, 
although computed from different distance metric blocks, will in fact be subjected to a 
nearly identical DC offset. This will not hold for blocks separated by large distances 
much greater than 1 mm [72]. However, with careful layout strategy, the same pixel 
locations for each different block can be clustered together, ensuring the gradient for 
each block is actually the same. Further system-level simulations (not shown) 
confirmed that output-offset errors cause insignificant degradation in either 
architecture.
4.6 Conclusions
In this chapter, BMA’s have been examined with the degree of non-ideal analogue 
characteristics explored. The relation of analogue characteristics to overall motion 
estimation performance is investigated to determine the choice of BMA 
implementation and design constraints.
It has been shown that the MSE can trade off over two bits of pixel precision and 
achieve similar accuracy as 8-bit MAE. Therefore since the inherent square-law 
characteristics of MOS transistors may readily be exploited in circuit design, the MSE 
is the most suitable matching function for analogue realisation.
The simulations of reduced pixel algorithms have shown that only this technique 
represents a viable low complexity alternative for large block sizes such as 16 x 16. 
Therefore, the prediction improvement found by using small block sizes for full 
precision matching algorithms does not hold for reduced pixel precision matching 
algorithms. This observation imposes a large block size on any implementation, 
analogue or digital using this technique. It has been found that systems that employ 
the multiple-parallel architectures are more prone to input-offset errors degrading 
performance. Therefore, the serial-parallel architecture is the most suitable in terms of 
performance for an analogue implementation.
81
Chapter 5
Block Matching Motion Estimation Architectures
5.1 Introduction
The design o f the analogue BMA is a key component to the successful implementation 
of low power motion estimation processors. As a result o f the design restriction 
imposed as discussed in Chapter 4, the design of the BMA poses a demanding 
challenge.
In this chapter, issues relating to the design of a block matching motion estimation using 
analogue techniques in CMOS technology are discussed. Various motion estimation 
processor architectures are compared in terms of complexity. Particular consideration is 
given to the design realisation of the distance metric used in the BMA. The degree of 
non-ideal analogue characteristics and their relation to overall motion estimation 
performance is considered to determine design limits and the choice of BMA 
implementation.
5.2 Basic Architectures for Analogue Block Matching Motion 
Estimation
The complexity and performance of the motion estimation is largely determined by the 
architecture used. Architectures that perform the distance metric in parallel using 
analogue techniques are examined. There are three main approaches to implement the 
BMA as described in section 4.5. The simplest way is using the serial-parallel 
architecture with a single distance metric block. The computation rate can be increased 
using a multiple-parallel architecture with multiple distance metric blocks. Finally the 
parallel-parallel architecture calculates the entire search area in parallel. The parallel- 
parallel architecture results in the highest possible computation rate.
There are numerous ways that the architectures can be implemented. The following 
sections describe the operation of such implementations. The first generation
82
architectures have been adapted from vector quantisers. As such these structures do not 
implement the motion estimation in the most efficient way. The second generation 
architectures are specifically designed for motion estimation, and therefore implement 
the BMA more efficiently.
5.2.1 First Generation Architectures
The architectures in this section refer to block matching strategies that have been 
applied to motion estimation from vector quantisers. In essence motion estimation and 
vector quantiser implementations are similar in as much as they both require a BMA. 
Therefore, it would seem reasonable to employ BMA implementation techniques that 
have been successfully employed in vector quantisers.
Serial-Parallel Architecture
The first architecture to consider will be the serial-parallel architecture. In this 
architecture the distances metric is calculated in parallel and each successive position 
within the search window is compared in serial fashion. A block diagram representing 
the method is shown in Fig. 5.1. This works by first loading the TB (template block) 
and SB (search block) memories with pixel values corresponding to the current frame 
and reference frame respectively. These memories are typically composed of sample 
and hold circuits. When the TB and SB memories are filled, the DM performs the 
matching function between the blocks in parallel. The value is compared with the 
previous value held in the S/H, using COMP and if less, the S/H is updated (initially 
updated with the first distances metric value) and the current SB memory address 
stored. The process is repeated until every unique position within the search window is 
exhausted. The SB with the least distance yields the motion vector (MV).
83
i TB DM SB
DAC
SH
COMP MV
Update
To
DAC Digital I/O
From
MV
Figure 5.1 Motion estimation processor using serial-parallel architecture.
Multiple-Parallel Architecture
The next architecture to consider is the multiple-parallel architecture. In this architecture 
multiple distance metrics are calculated in parallel for each row within the search 
window. A block diagram representing the method is shown in Fig. 5.2. This works by 
first loading the TB and multiple SB memories with pixel values corresponding to the 
current frame and reference frame respectively. When the TB and multiple SB 
memories are filled, the distances metric performs the calculation between the blocks in 
parallel. The distance metrics are compared using a LTA and minimum value stored in 
the S/H. For the first row the S/H is simply updated with the minimising distance metric 
and the SB address stored. For the successive rows the S/H is only updated if a 
distances metric produces a value less than that already held. Therefore, the SB with the 
least distance yields the motion vector.
DAC
SB SB
1 2
TB
▼ ▼
DM
1
DM
2
SH Input 
Multiplexed 
from losing 
DM
SH
J
(n+1) INPUT LTA
To
DAC Digital I/O
From
MV
MV
SB
(n)
▼ ▼
DM
(n)
Figure 5.2 Motion estimation processor using multiple-parallel architecture.
84
Parallel-Parallel Architecture
The final architecture to be considered will be the parallel-parallel architecture shown in 
Fig. 5.3. In this architecture all the distance metrics are calculated in parallel for the 
whole search window. All values are compared using a LTA and the minimum value 
yields the motion vector.
DAC L / * .
TB
To
DAC Digital I/O
DM
From
MV
I
DM
2
(n) INPUT LTA
i
MV
SB SB < ( SB
1 2 s s (n)
u
DM
(n)
I
Figure 5.3 Motion estimation processor using parallel-parallel architecture.
5.2.2 First Generation Architectures Comparison
The preceding descriptions of the architectures indicate that there are substantial 
differences in terms of speed and complexity. The main operations of the architectures 
are quantified in Table 5.1.
The serial-parallel architecture has the least complexity requiring only a single distance 
metric, comparator and sample and hold. In addition the TB and SB will be the same 
size as the distance metric. However, for each block match the whole SB needs to be 
updated, therefore a large number of operations are expended. The multiple-parallel 
architecture makes use of the fact that each SB in a row of the search window is 
overlapping and only changes by one pixel position. Though shown logically in Fig. 5.2 
as individual SB memories for clarity, the distance metrics share memory locations. 
This results in a substantial reduction of memory load operations. Further the LTA uses 
a tree structure that is known to provide both speed and accuracy [75] though increasing
85
the number of operations. The multiple-parallel architecture has been suggested as the 
best all round solution [76].
The parallel-parallel architecture requires the least amount of memory load operations 
resulting in the highest possible operation speed. The architecture makes use of the fact 
that each SB in a row or column of the search window is overlapping and only changes 
by one pixel position vertically or horizontally for each match [74]. Therefore, the 
parallel distance metrics can share many memory locations. However, a large number of 
comparators is required to implement the LTA.
Architectures LTA Size LTA 
Stages 1
COMP Memory
Loads
Serial-
parallel
2 1 1 N2+ W tfw + l)2
Multiple-
parallel
2w log2(2w) 2w-l N2+(2w+l)(N+2w)N
Parallel-
parallel
(2w+l)2 log2 (2w+l)2 (2w+l)2-l N2+(N+2w)2
1 The result is rounded to the larger integer.
Table 5.1 Components and operations required for motion estimation processor using 
various architecture.
The various approaches are compared in Table 5.2 using a 16 x 16 block size with 
search window parameter w = 8. The comparators and DM’s are assumed to be powered 
down when not in use as was done in [74]. Therefore the power dissipation will be 
similar whatever architecture is used. As can be seen the serial-parallel architecture 
requires the most number of clocks to complete the BMA, operating at 2 % of the speed 
of the parallel-parallel architecture. The serial-parallel architecture however requires the 
least complexity since only one comparator and DM is used in contrast to the parallel- 
parallel architecture using 288 comparators and 289 DM’s. The multiple-parallel 
architecture provides an intermediate performance that is nearer to the serial-parallel
8 6
architecture in terms of overall performance. Regardless of the fact that the area usage is 
large, the parallel-parallel architecture has been adopted for the majority of vector 
quantisers [70], [71] and motion estimation [74] BMA implementations. This is mainly 
due to the fact that it is 50 times faster than the serial-parallel architecture.
Architectures LTA
Size
LTA
Stages
COMP Memory
Loads
Clocks Relative
Speed
(%)
Serial-
parallel
2 1 1 74240 74817 2
Multiple-
parallel
16 4 15 8960 9045 15
Parallel-
parallel
289 9 288 1345 1354 100
Table 5.2 Components and operations required for motion estimation processor using 
various architecture.
5.2.3 Second Generation Architectures
The architectures described in this section make use of the overlapping SB in the search 
window to enable more efficient memory loading. Hence, increasing operating speed, 
reducing hardware complexity.
Serial-Parallel Architecture
To understand how the serial-parallel architecture can be modified to make use of the 
overlapping SB’s, it is necessary to examine the block-matching process. The motion 
estimation follows the path shown in Fig. 5.4. The TB is compared with every unique 
block within the search window of the reference frame. As the SB is moved over the 
search window for each match, only one column or row actually contains new pixel 
values as can be seen in Fig. 5.5 (a). This property may be taken advantage of as is 
shown in Fig. 5.5 (b) and (c). The SB in Fig. 5.5 (b) is initially filled with the pixel
87
values from position 1 in the search window. The SB column that is no longer valid is 
updated for position 2 (Fig. 5.5 (c)) and the columns are multiplexed to correspond 
properly with the TB. This process o f updating and multiplexing columns and rows to 
correspond is repeated for all possible positions.
2 x 2 TB in
current
frame
. 4 x 4  Search window  
within previous frame
Figure 5.4 Showing the computation path taken in the search window.
Positions: 
1 2
■ ■
........
j
___________
search window
(a)
(SB)
(b)
updated 
column 
i (SB )
(TB) (TB)
column /  row  
multiplexing
Position 1 Position 2
(C )
Figure 5.5 (a) For each match, though the positions are different, only one column or 
row actually contain new pixel values. Therefore, as shown in (b) and (c) the bulk o f  the 
values can be reused by just updating new values and multiplexing the columns and/or 
rows to correspond.
The serial-parallel architecture shown in Fig. 5.1 can be modified to incorporate the 
reusable pixels by adding multiplexers between the TB and SB memories and the 
distance metric as can be seen in Fig 5.6. The second generation serial-parallel 
architecture operates in a similar fashion to the first generation serial-parallel 
architecture differing only by updating the pixels that have changed (either a row or
8 8
column) at each block match and multiplexing either or both rows and columns to 
correspond correctly.
DAC J . TB MU
X
DM MU
X
SB
Update
SH
T
To
DAC Digital I/O
COMP MV
From
MV
Figure 5.6 Motion estimation processor using second generation serial-parallel 
architecture.
Multiple-Parallel Architecture
The multiple-parallel architecture shown in Fig. 5.2 can be modified to make use of the 
overlapping SB’s, much in the same way as applied to the serial-parallel architecture. 
The second generation multiple-parallel architecture is shown in Fig. 5.7.
DAC I S .
TB
SH Input 
Multiplexed 
from losing 
DM
SH
SB
1
MUX
DM
1
T
SB
2
MUX
DM
2
(n+1) INPUT LTA
SB
(n)
MUX
11
DM
(n)
X
To
DAC Digital I/O
From
MV MV
Figure 5.7 Motion estimation processor using second generation multiple-parallel 
architecture.
89
The template blocks are compared with every unique block within the search window of 
the previous frame, one row at a time following the path shown in Fig. 5.8. As the SB’s 
are moved over the search window, only one row actually contains new pixel values as 
can be seen in Fig. 5.9 (a). This property may be taken advantage of as is shown in Fig. 
5.9 (b) and (c). The SB in Fig. 5.9 (b) is initially filled with the pixel values from 
position 1 in the search window. The SB row that is no longer valid is updated for 
position 2 (Fig. 5.9 (c)) and the rows are multiplexed to correspond properly with the 
TB. This process of updating and multiplexing rows to correspond is repeated until the 
whole search window is covered.
TBl TB2 TB3
M ultiple 2 x 2  
TB in current 
frame
4 x 4  Search window  
^  within previous frame
Figure 5.8 Showing the computation path taken in the search window.
Positions
Search W indow
TBl TBl
SB
Changed
Row
Position 1 Position 2
(a) (b) (c)
Figure 5.9 A single TB is shown for the first row. (a) For each match, though the 
positions are different, only one row actually contains new pixel values. Therefore as 
shown in (b) and (c) the bulk of the values can be reused by just updating new values 
and multiplexing the columns and/or rows to correspond.
/ R o w  Multiplexing
90
5.2.4 Second Generation Architectures Comparison
The architectures described indicate that significant improvements in terms of speed and 
complexity result from reusing pixel values. The main operations of the architectures 
are quantified in Table 5.3. The serial-parallel architecture has the least complexity 
requiring only a single distance metric, comparator and sample and hold. In addition the 
TB and SB will be the same size as the distance metric. The use of multiplexers allows 
pixels to be reused. The improved multiple-parallel architecture makes use of the fact 
that each SB in a row of the search window is overlapping and only changes by one 
pixel position. This combined with reusing pixels through multiplexers, results in a 
substantial reduction of memory loads operations over the first generation multiple- 
parallel architecture.
Architecture LTA Size LTA 
Stages 1
COMP Memory Loads
Serial-
parallel
2 1 1 N*+(N+2w)2
Multiple-
parallel
2w+l log2(2w+l) 2w N2+(N+2w)2
Parallel-
parallel
(2w+l)2 log2 (2 w+ l)2 (2w+J)2-J N2+(N+2w)2
1 The result is always rounded to the larger integer.
Table 5.3 Components and operations required for motion estimation processor using 
various architecture.
The various approaches are compared in Table 5.4 using a 16 x 16 block size with 
search window parameter w = 8 as used previously. As can be seen the serial-parallel 
architecture requires the most number of clocks to complete the BMA. However using 
the second generation architecture, the operating speed is enhanced to 70 % that of the 
parallel-parallel architecture. The second generation multiple-parallel architecture
91
provides only a marginally better speed performance of 76 % that of the parallel-parallel 
architecture, but requires more comparators and DM’s therefore using significantly 
more area. This compares rather unfavourably with the serial-parallel architecture, 
therefore will not be considered further.
It can be seen from Table 5.4 that the second generation serial-parallel architecture 
performance reaches 70 % that of the parallel-parallel architecture at a fraction of the 
hardware complexity. The innovative step introduced of reusing pixel values through 
multiplexing, results in a serial-parallel architecture with similar speed performance to 
that of parallel-parallel architecture. This technique was completely overlooked by [74] 
resulting in a system that was many times more complicated than necessary. Further 
unlike the multiple-parallel or parallel-parallel, the serial-parallel architecture allows for 
variable search window size adding a large degree of flexibility.
Architecture LTA
Size
LTA
Stages
COMP Memory
Loads
Clocks Speed
Efficiency
(%)
Serial-
parallel
2 1 1 1345 1922 70
Multiple-
parallel
17 5 16 1345 1747 76
Parallel-
parallel
289 9 288 1345 1354 100
Table 5.4 Components and operations required for motion estimation processor using 
various architectures.
92
5.3 Improved Analogue Motion Estimation Processor
The inevitable input-offset present in the analogue PE’s is the most significant 
detrimental effect in relation to PSNR performance. Ensuring a large enough input 
range to the analogue PE will reduce the input-offset effect as was done by Tomasini et 
al [74], neural network processors [59] - [66] and vector quantisers [70]. With the 
reduction of power supply voltage with technology scaling, this option is not viable for 
deep sub-micron processes. Input-offset calibration has been used in previous works to 
reduce this error as was done by Tuttle et al for vector quantiser application [71]. 
Another example of offset calibration was used by Cauwenberghs and Pedroni [69] for 
vector quantiser application. However, the techniques used were applied to each 
individual analogue PE, therefore increasing complexity by a substantial amount. 
Moreover the techniques are only partially successful since the input error is stored on 
capacitors for each individual circuit and therefore subject to the clock feedthrough 
errors. A simple solution is very desirable in all applications. The solution used seeks to 
minimise the input-offset inherent in the analogue motion estimation processors while 
maintaining the second generation serial-parallel architecture. Therefore, attention has 
focused on improvements on the circuit level.
The serial-parallel architecture can be improved by the input-offset error cancellation 
approach shown in Fig. 5.10. This works by sampling the current DM value using S/HI 
with reference and current block pixels to each of the analogue PE’s inputs as shown in 
Fig 5.10. The subsequent DM value is obtained by swapping the reference and current 
block pixel values using the crossover switch to each of the analogue PE’s inputs and 
storing the result in S/H2. Summing the outputs of S/HI and S/H2 approximately 
cancels the error. Repeating the procedure for the next match using S/H3 and S/H4 and 
comparing the results, using a comparator evaluates the best-match reference block so 
far.
93
Reference Cross­ Current
block over block
pixels Switch pixels
r
DM
Sample 
& Hold — ► 
Address
Fig. 5.10 Proposed improved distance metric for serial-parallel architecture with input- 
offset error cancellation.
The simple sample and hold circuit in Fig. 5.11 was used for evaluating the technique. 
This is subject to a non-linear clock feedthrough error with both DC and signal 
dependant components. The clock feedthrough error can be reduced using the 
techniques described in sections 6.3.1 and 6.3.2, however to demonstrate the efficiency 
of the technique, the simple sample and hold circuit is used. The sample and hold output 
characteristic was modelled using a best fit transfer function and incorporated into 
simulations. The system-level simulations using the MSE function on the same 
sequences evaluated in Chapter 4, using a block size of 16 x 16 pixels and search 
window of 32 x 32 pixels are shown in Fig. 5.12. The equivalent accuracy o f the S/H 
circuits over the range of the DM values was about 5-bit, the standard deviation of the 
input-offset error of each analogue PE was 10 mV (cell input range of 0-255 mV), and 
for reduced pixel precision of 8-bit various comparator resolutions were evaluated. As 
can be seen in Fig. 5.12, using a comparator resolution of 12 to 14-bit, the new 
architecture performs similarly to 8-bit pixel MAE. The proposed technique offers 
substantial reduction in complexity over other offset cancellation techniques which 
apply calibration to each cell individually.
Since the DM computation only uses a small portion of the processing time, the rest 
being taken loading pixel memory, the motion vector processing rate is not significantly 
affected. Alternatively, the analogue DM circuits can be made to operate at twice the 
rate used previously, therefore maintaining exactly the same vector processing rate.
S/H 2S/HI S/H4S/H3
COMP
HIGH
LOW
94
clock
input output
Fig. 5.11 The basic S/H circuit incorporated in the system-level simulations
1.6
1.4
ST 1.2
00
0.8
E 0.6.
2  0.4
0.2
Resolution (bits)
Carphone
Foreman
Susie
Trevor
A Carphone _  Foreman O Susie □  Trevor 
Fig. 5.12 Improved serial-parallel architecture: Error from mean-PSNR of 8-bit pixel 
MSE as a function of comparator resolution for various sequences using 8-bit 
pixel values. The standard deviation of the input-offset error was 10 mV. 8-bit 
pixel ideal MAE reference superimposed (dotted lines).
95
5.4 Block Matching Algorithm Implementation
In this section various circuit structures are assessed with regard to implementing the 
BMA. In particular previous BMA implementations are considered. The classification 
that these structures belong to is considered and the suitability for design of an efficient 
motion estimation processor examined.
5.4.1 Review of Analogue Block Matching Algorithms
Block matching algorithms find use in systems where the similarity between groups of 
values needs to be measured. Such systems include template classifiers used in neural 
networks [60], vector quantisers in image processing [71] and motion estimation in 
video processing [74]. As such, numerous circuit configurations have been proposed to 
enable this type of signal processing. The technology invariably used is CMOS. The 
reasons for using CMOS are varied. The VLSI possibility for both digital and analogue 
implementations reduces material cost. Another factor is the low power operation 
possible using CMOS.
The distance metrics lend to parallel computation very easily. Examining the MAE (4.1) 
and MSE (4.2) matching functions reveals that many simple computations, such as 
square-difference or absolute-difference operations are performed on corresponding 
pixels between blocks. These may be performed using simple analogue circuits, with 
each analogue PE operating between two pixels. The operation of these circuits can be 
in parallel. The result is summed and divided by the block size. Summation, particularly 
in current mode, is efficient since the currents add linearly into a single node using 
Kirchoffs current law. The division can be achieved using a current mirror with 
dimensions between each branch in ratio to the division desired. Alternatively, if the 
current is summed into a linear load such as a resistor, the corresponding output voltage 
can be scaled to the division required. A generalisation of the computation required is 
shown in Fig. 5.13.
96
rAnalogue PE
(Processing
Element)
{
\
f(C, -R,) Distance
Measure
Figure 5.13 General distance metric structure.
5.4.2 Analogue Processing Element Structure
The analogue implementation of the BMA is based on the parallel operation of many 
analogue PE’s, much like that used in many digital implementations. However unlike 
the digital implementations where the summation of the PE’s is an arduous task, the 
analogue summation is simple. In the case of analogue PE’s using output currents, 
simply summing into a single node results in the required calculation. With analogue 
PE’s using output voltages, summation can be achieved using SC (switched capacitor) 
techniques [70]. The analogue PE is comprised of an input and an output. This 
conforms to a two-port device. Therefore, a brief overview of two-port or amplifier 
fundamentals is considered.
Amplifier fundamentals
An amplifier is a two-port device that comprises an input and an output. The two-port 
device is characterised such that the output is the product of the transfer function and 
input:
Vo (s) = H (s) • Vz (s) (5.1)
The most familiar are linear amplifier, such as operational amplifiers. This type has an 
output that is linearly proportional to the input. However, this need not necessarily be 
the case. Examples of nonlinear input-output relationships are the quadratic amplifiers 
used in measuring equipment [77] and log or antilog amplifiers used in absorbance 
measurements [78].
97
Amplifier Classifications
There are four possible amplifier configurations corresponding to the voltage and 
current combinations at the input and output:
Voltage-to-voltage converter (V-V). This is known as a voltage amplifier with gain in 
units of V/V. This may be described as a voltage controlled voltage source (VCVS).
Voltage-to-current converter (V-I). This is known as a transconductance amplifier with 
gain in units of A/V. This may be described as a voltage controlled current source 
(VCCS).
Current-current converter(I-I). This is known as a current amplifier with gain in units 
of A/A. This may be described as a current controlled current source (CCCS).
Current-to-voltage converter (I-V). This is known as a transimpedance amplifier with 
gain in units of V/A. This may be described as a current controlled voltage source 
(CCVS).
The ideal characteristics of these amplifier configurations are summarised in Table 5.5. 
The measure of circuit quality is how close it approximates the ideal parameters detailed 
in Table 5.5.
Input Output Amplifier type Gain Input Output
Resistance Resistance
Vz Vo Voltage H(s) • V/V 00 0
Iz Io Current H(s) • A/A 0 00
Vz Io T ransconductance H(s) • A/V 00 00
Iz Vo Tranimpedance H(s) • V/A 0 0
Table 5.5 The basic amplifier configurations and the ideal characteristics.
98
5.4.3 Analogue Block Matching Algorithms Comparison
A comparison of the various BMA implementations is considered. The BMA 
implementations examined according to input-output classification.
Voltage amplifier processing element
In this configuration the distance metric is based on the voltage amplifier or VC VS. The 
general structure of distance metrics based on this method is shown in Fig. 5.14.
The inputs and output are currents. Therefore, the pixel values can be stored using 
sample and hold techniques.
V-V
S/H
Reference
Pixel
S/H
Current
Pixel
S/H (Sample & 
Hold) Memory
Vout
Figure 5.14 Distance measure based on a VCVS.
Only a small number of circuits based on VCVS have appeared largely due to the 
difficulty of summing voltages using continuous time techniques. An example of a 
continuous time squarer function was proposed by Seriki and Newcomb [79]. This 
circuit performs a square-difference function with voltage inputs and output. The use of 
switched capacitor (SC) techniques overcomes many of the difficulties of using this 
configuration. However, the operating speed is much lower than continuous time 
techniques. An efficient vector quantiser based on SC techniques is described in [70] by 
Cauwenberghs and Pedroni. The BMA implementation is based on the MAE. The 
circuit is characterised by moderate accuracy and large input-offset. Nevertheless useful 
results were obtained.
99
Current amplifier processing element
In this configuration the distance metric is based on the current amplifier or CCCS. The 
general structure of distance metrics based on this method is shown in Fig. 5.15.
The inputs and output are currents. Therefore the pixel values can be stored using 
switched current (SI) techniques [80], [81]. The output greatly benefits from the ease 
with which currents may be summed.
Reference
Pixel
Current
Pixel SI Switched Current Memory
lout
Figure 5.15 Distance measure based on a CCCS.
A number of circuits are based on CCCS including the current squarer proposed by Bult 
and Wallinga [82]. The single input current squarer can be extended to H-inputs as 
suggested by Landolt et al [83]. Another example of CCCS is the absolute function 
proposed by Chen et al [84] and the squarer proposed by Haung et al [85]. These 
circuits were used to form a current mode classifier for matching applications in neural 
networks [60].
Transimpedance amplifier processing element
In this configuration the distance metric is based on the transimpedance amplifier or 
CCVS. The general structure of distance metrics based on this method is shown in Fig. 
5.16. The inputs are currents and the output voltage. Therefore, the pixel values can be 
stored using SI techniques. The voltage output is useful where a number of voltages 
must be compared. A distance measure circuit was proposed by Montanari et al for use 
in multilevel flash memory [86]. This is perhaps the least useful configuration for 
implementing the BMA since voltage summation is difficult unless SC techniques are 
used.
100
SI
Current
Pixel
l-V
Vout
Figure 5.16 Distance measure based 
Transconductance amplifier processing element
In this configuration the distance metric is based on the transconductance amplifier. The 
general structure of distance metrics based on this method is shown in Fig. 5.17. The 
inputs are voltages and output current. Therefore, the pixel values can be stored using 
sample and hold techniques. The output summation implicit in the BMA can easily be 
achieved using currents.
•naCCVS.
SI Switched 
Current Memory
V-l
S/H
Reference
Pixel
S/H
Current
Pixel
lout
Figure 5.17 Distance measure based on a VCCS.
S/H Sample & 
Hold Memory
The vast majority of circuits used in matching functions are based on VCCS. Many of 
these circuits are used in neural networks, as such a number of circuits based on this 
configuration perform a Gaussian function or sometimes called a bump circuit [59], 
[62], [63], [87].
101
A rather modest performance implementation of both the MAE and MSE distance 
metric has been used in the neural network classifier proposed by Gopalan and Titus 
[65]. Another neural network classifier based on the MSE distance metric was proposed 
by Cilingiroglu and Aksin [64], with adjustable output current. The circuit is compact 
and efficient, however the inputs must be differentially applied. This is not suitable for 
the serial-parallel architecture proposed in 5.2.2, since the use of differential signals not 
only doubles the memory requirement but also doubles the complexity of the 
multiplexers. An efficient MSE distance metric circuit was proposed by Tuttle et al as 
part of a vector quantiser [71]. This dynamic circuit featured calibration to alleviate 
device mismatch. However once again the inputs were differentially applied and 
therefore not suitable. The folded Gilbert multiplier [88] was used by Fang et al [61] to 
form a MSE distance metric circuit used for image compression. However, to achieve a 
large linear range the input transistors must have channel lengths larger than the width 
slowing the operation speed significantly.
Several squarer circuits have been proposed that would lend themselves to the 
formation of a MSE distance metric [87], [89] and [90]. However, the MSE distance 
metric used by Tomasini et al as part of the motion estimator [74], offers the best 
overall performance. The squarer circuit operates by producing the square-difference 
current at the output from two pixels represented by voltages at the inputs. Therefore, 
differential signals are not required and the output currents can conveniently be 
summed. The square-difference circuit used by Tomasini et al was initially proposed by 
Seevinck and Wassenaar [91]. The principle upon which this circuit is based, was laid 
down by Nedungadi and Viswanathan [92]. It is this circuit principle that will form the 
foundation of the MSE distance metric introduced in this work and considered in detail 
in the next chapter.
102
5.5 Conclusions
It has been shown that the serial-parallel architecture described operates at high motion 
vector computation rate. The motion vector throughput is achieved by efficiently 
reusing pixel values. Therefore the number of memory loads is the same as the parallel- 
parallel architecture. Further the search window can be varied to any size adding an 
additional degree of application flexibility. The complexity and power dissipation is the 
least of all architectures.
The input-offset is one of the most important limitations of analogue approaches 
towards distance metric implementations in many applications including motion 
estimation. A simple technique has been described to cancel the input-offset error 
inherent in analogue circuits. The simulated results suggest the suitability of the 
approach.
A brief overview of the circuits suitable for distance measure has been described. 
Various circuit configurations have been considered. The tranconductance circuit 
approach was found to be most suitable. The voltage input allows the pixel values to be 
temporarily stored in simple sample and hold circuits. The current mode output allows 
the summation required in distance metrics to be effectively obtained by feeding into a 
resistor load.
103
Chapter 6
Implementation of Block Matching Motion 
Estimation using Analogue Techniques
6.1 Introduction
The analogue motion estimation processor is constituted by a number of sub-systems. 
These are mainly comprised of many basic analogue elements. An overview of the 
analogue motion estimation processor is given in Fig. 6.1. The design of each sub­
system will be examined in detail in the following sections.
Memory
Address
Column
Address
Row
Address
Memory
Address
Reference 
block 
pixels
Decision
Circuit
Update
MV
SH
DMSB TB
COM P
Rotary
Shifter
Rotary
Shifter
Current
block
pixels
Figure 6.1 Motion estimation processor showing sub-systems.
6.2 Distance Metric
The distance metric shown in Fig. 6.1 is the most important sub-system in the analogue 
motion estimation processor. Since the overall performance in terms of power 
dissipation and operating speed is largely determined by the distance metric, particular 
attention must be given to the design. The distance metric is made up of many analogue 
PE circuits operating in parallel.
104
6.2.1 Analogue Processing Element Principle
In section 5.5 the circuit type identified for the analogue PE was transconductance 
operation. The configuration requires a square-difference circuit with voltage inputs and 
current output. This will be referred to either as a square-law function or simply as a 
squarer circuit for the remainder of the work. The principle proposed by Nedungadi and 
Viswanathan [92] will be elaborated in the following section for use as an analogue PE 
in the MSE matching function.
The inherent square-law characteristic of the saturation region drain current shown in 
(6.1) is utilised to realise a linear transconductor and square-law function. This is also 
known as Sah’s model [93]:
where Id is the drain current, Vgs is the gate-to-source voltage, Vj is the threshold 
voltage, k = }jCoxWI2L is the transconductance parameter, W and L are the channel 
width and length respectively, /j. is the carrier mobility, and Cox is the gate oxide 
capacitance per unit area.
The linear transconductor (6.2a) and square-law function are based on the following 
quadratic relations:
Additionally, a four-quadrant multiplier based on the quarter-square technique can be 
obtained using the following relation:
/ o = k(VGS - V T)2 (6.1)
V0=(Vi + Vx)2-(Vx-Vi)2=WxVi (6.2a)
V0 = (Vi + VX)2 + (Vx  - V,.)2 = 2V2 + 2V2 (6.2b)
(6.2c)
105
A. Linear Trans conductor
The circuit principle in Fig. 6.2 may be conveniently used to obtain a linear 
transconductor by subtracting the drain currents Id\ and Id2, or a square-law function by 
summing Id\ and Idi• Given that to the first order a MOS transistor in saturation may be 
modelled by (6.1) and neglecting channel length modulation, it can be seen by 
inspection that Vcs\ and Vqsi are given by:
(6.3a)
and
(6.3b)
where Vi = V \- V 2 .  The differential output current is then given by:
ID\ 1 D2 ~  ^ V gS\ VT) 0^GS2 ) ] (6.4)
Combining (6.3) and (6.4), the Vj terms cancel resulting in:
I m - l D 2 = m  + Vx ? - ( y x - V i? } (6.5)
This further reduces to:
I  D\ lD 2 ~ ^ k V x Vi (6 .6)
which describes a linear transconductor.
u p c p
Vx+ vT vx+ VT
Figure 6.2 Linear transconductor and square-law function principle [92].
106
B. Square-Law Function
Alternatively, by summing the drain currents of Ml and M l  a square-law function 
circuit is obtained:
/ D, + / D2= 2k(V ?+ V 2x ) (6.7)
The linear transconductor has been shown for completeness to exemplify the circuit 
principle. Since the MSE requires a square-difference circuit, the linear transconductor 
will not be considered any further. However, further details regarding the linear 
transconductor and application of the square-law function circuit to form a four- 
quadrant multiplier can be found in [95].
6.2.2 Analogue Processing Element Realisation
The following considers the various realisations that have been used to implement 
simple floating voltage sources with application to the circuit principle in Fig. 6.2.
A. Source Follower
It is possible to elaborate many methods of implementing the floating voltage sources 
depicted in Fig. 6.2, but perhaps the simplest is by the use of source followers as shown 
in Fig. 6.3.
\  <»+IVb
Figure 6.3 Practical realization of the circuit in Fig. 6.2 using source followers.
107
The Vgs drop of transistors M3 and M4 provide the dc terms Vx+ Vt  shown in Fig. 6.2. 
This can be modelled with:
VgS3,4 ~V T + (6 .8)
where IB is the bias current of each source follower. From (6.8), it follows that the Vx 
part is given by:
To maintain the input voltage range while ensuring adequately low output resistance for 
the source followers, the optimum gate ratio between the source follower transistors 
A/5, M4 and the input transistors M l, M2 should be in the region of n = 3 as suggested 
in [96]. The square-law function is then given by:
becomes zero. Using (6.8) and (6.9) it can be seen that the square-law conformance 
range of V{ is given by:
Much effort has been afforded into the practical realization of the circuit principle in 
Fig. 6.2 using source followers and its applications [96], [97], while optimisation 
techniques have been explored in [95]. Improvements in linearity have been obtained 
however the efficiency is characteristically low since the source follower operates in 
class A and is at best only 25% power efficient.
(6.9)
(6 .10)
The square-law conformance range is defined by the values of V( where either ID] or Id2
(6 .11)
108
B. Class AB Output Stage -  Using CMOS Pair
Many of the problems associated with using source followers have been overcome by 
extending the circuit principle from single channel devices to complementary [91] or 
“CMOS pairs” as shown in Fig. 6.4.
DD
M 6 m
M lM 5 M 7
lD 1
Figure 6.4 Practical realization of the circuit in Fig. 6.2 using complementary 
transistors.
Using this method the dc floating voltages need only drive the transistor gates avoiding 
the problems associated with the output resistance of the source follower. The CMOS 
pair is shown in Fig. 6.4 as A/5, A/6 and also A/7, A/8. The structure in Fig. 6.4 is 
immediately recognised as the familiar class AB output stage and is therefore 
characterised by high output current efficiency of up to 50%.
The equivalent threshold voltage and transconductance parameter for the CMOS pair 
are given by:
^Teq =  ^Tn +  ^Tp ( 6 . 1 2 )
k„ • k n
K q = J r - n (6.13)
(V
109
Optimum performance is obtained [91] when k n = k p . Assuming like channel 
transistors have the same W/L ratio, the square-law function is then given by:
/ D1+ / 0 2 = 2 ^ F ,2 + 2 /fl (6.14)
From (6.8) and (6.9), the square-law conformance of Vi is:
- 4 i 7 J k eq< vi<  <6-15)
However it is evident from (6.12) that a higher supply voltage is necessary to 
compensate for the additional Vt and from (6.13) the area used will be greater than 
single transistor implementations.
C. Flipped Voltage Follower
Finally an approach introduced by Peluso et al [98] based on the later named flipped 
follower [99] is proposed (Fig. 6.5) that offers similar simplicity and low voltage 
operation as the source follower implementation with the same current efficiency as the 
CMOS pair implementation. The dc floating voltage sources in Fig. 6.2 can be 
efficiently realised using flipped voltage followers.
M l
D\
Figure 6.5 Proposed practical realisation of the circuit in Fig. 6.2 using flipped voltage 
followers.
110
These are formed in Fig. 6.5 by transistors M3, M5 and M4, M6 and the dc current 
sources Ib. The flipped voltage follower is characterised by having a similar voltage 
transfer function as the source follower but with much lower output impedance as a 
result of using feedback. Provided that Ml to M4 have the same W/L ratio, the square- 
law function is given by:
I m + ID2=2kVi2 +2I„ (6.16)
The square-law conformance range is defined by the values of V,- where either Idi or I02 
becomes zero. Since (6.8) and (6.9) hold for the flipped voltage follower, the square-law 
conformance range of F, is given by:
- J T J k K V . K j I j k  (6.17)
Unlike the source follower implementation, the signal currents through the output 
transistors Ml and M2 in Fig. 6.5 can be up to 100% that of the combined bias currents 
IB, therefore the circuit operates in class AB and has an output current efficiency of up 
to 50% [100]. Therefore the squarer circuit utilising the flipped follower will be used as 
the analogue PE described in section 4.2. The additional 21 b DC offset term is 
subtracted using a simple current mirror and the currents summed into a resistor load to 
provide the distance metric.
6.2.3 Analogue Processing Element Small-signal Analysis
The source follower shown in Fig 6.6 (a) is characterised by high input resistance and 
low output resistance with ideally unity voltage gain transfer, assuming the load, Rl -* 
00 and transistor output conductance r0\ -* oq The transistor tranconductance parameter 
is defined by gm\ and the body effect by gmb\.
gm] + gmb,
The output resistance of a source follower can be approximated under the same 
conditions as before by:
i l l
gm , + gmb,
(6.19)
However since the MOS transistor usually has much lower transconductance than a 
bipolar transistor, the output resistance is not low enough for many applications. This 
situation is particularly pronounced in the case where resistive loads are to be driven. 
This may be worked around by simply increasing the W/L ratio and bias current of the 
source follower. This is undesirable since the area and power dissipation is 
proportionately increased. The use of negative feedback offers a way to reduce the 
output resistance without increasing area or power dissipation. To that end the flipped 
follower is sometimes used as shown in Fig 6.6 (b). The corresponding small-signal 
model is shown in Fig. 6.7.
(a) (*>)
Figure 6.6 The source follower (a) and the flipped follower (b).
^ /  goB o
I  gm, (V,-VJ t  -gmb, Va r0, |  g m 2Vi
_L
Figure 6.7 The small-signal equivalent circuit of the flipped follower.
112
A. Flipped follower small-signal analysis - output resistance
The small-signal output resistance is found for the flipped follower of Fig. 6.6 (b) using 
the small-signal equivalent circuit of Fig. 6.7. The ideal current source shown in Fig 6.6 
(b) is represented by the conductance goB. The output resistance is found by setting the 
input voltage F, to zero and finding the current iQ that flows into the output with the 
voltage V0 applied at the output. Vj is the voltage developed across the conductance g0b 
of the non-ideal current source. Using Kirchoff s current law (KCL) the current at the 
output is found as:
i .= ^ s- + P"1Vl + Vlg cS (6 .20)
r„2
Setting Vt to zero and using KCL at the drain of M2 gives,
v&+ -  g”',K  -  = 0 (6.21)
'■.I
Solving (6.21) for V \, then substituting into (6.20) and rearranging results gives,
D _  Q — r
~  . v /= 0  ~  r o 2I
r*  + 1 1 S ob
[1 + (gm,+gmb, )rot ] [1 + gm2 /g oa ]
(6.22)
Assume Ib is an ideal current source, therefore gqb -* ooalso if rQ\ 
(gm\+gmb\)roX » 1 ,
1
oo and
gm] + gmbx
(6.23)
B. Flipped follower small-signal analysis - voltage gain
The voltage gain may be found for the flipped follower of Fig 6.6 (b) using the small- 
signal equivalent circuit of Fig 6.7. The analysis is performed with the output unloaded.
113
Therefore KCL at the output gives,
H  + gm2V,+V,goS=0  (6.24)
ro2
From KCL at the drain of Ml gives,
Solving (6.24) for Vj and substituting into (6.25) and rearranging gives,
H  = _____________ « ! _____________  (626)v  ,>0 Ms +r (o.io/
K 1 + (gm, + gmb, K , + ■"
rol(l + g">2l§oB)
With ideal current sources h , g oB -* ^
K
V;
 ____________g m \ro\_________
»■„=0 1 
l + (gw, +g/w/?1)r0l +
(6.27)
ro2 & * 2
The main conclusions when comparing the flipped follower to the source follower are:
• The output resistance of the source follower (6.19) has been lowered by 
approximately a factor of gmj roJ by using negative feedback in the flipped 
follower (6.23).
• The gain is less than unity in the case of the source follower (6.18) and slightly 
more so for the flipped follower (6.27).
114
6.2.4 Analogue Processing Element Frequency Characteristics
The use of feedback invariably offers many benefits such as the low output resistance 
seen in the flipped follower. However the frequency behaviour of the circuit must be 
investigated to derive the conditions for stability. Fig. 6.8a shows a flipped voltage 
follower (A/3, A/4, IB) driving the source of transistor A/1 (case in Fig. 6.5). Capacitors 
Cl and C2 represent the total parasitic capacitances at nodes “a” and “b”, given by
where Cgs is the gate-source capacitance, Cgd is the drain-gate capacitance, Cdb is the 
drain-bulk capacitance, CSb is the source-bulk capacitance, and the numbers refer to the 
corresponding transistors in Fig. 6.8a. In order to carry out the ac analysis, it is 
convenient to break the feedback loop between the gate of M3 and the drain of A/4, and 
to apply the input voltage F, at the gate of A/3. The gate voltages of A/1 and A/4 are 
considered at signal ground and the output voltage V0 is taken at the drain of A/4. The 
resulting small-signal ac equivalent circuit is shown in Fig 6.8b which includes the body 
effect of A/1 and A/4. Straight-forward analysis results in the following transfer function
Vo 8o48a + 8 b ( 8 o + 8m4 + 8mb4 + 8 o a ) H 8 o 4 Q  +C2(ga + 8 m 4 +8mb4 + 8o4 ) >  +  C j C 2S 2
where g a = g o3 + g m] + g mb] + g o], gmb is the bulk transconductance, gQ is the output 
conductance, and the numbers refer to the corresponding transistors in Fig. 6.8a. Using 
dominant-pole approximation [78], the two poles are estimated to be
(6.28)
(6.29)
8  m3^8m4 8mb4 8o4 ) (6.30)
^  8 0 4 8 a 8 B 8^ q ~^8m4 8mb4 8q4^  
8o4^-'\ C 2  ( 8 a 8m4 8mb4 ^  8 o4^
(6.31)
115
*>p2* 8 0 4 ^ 1  ^2 a 8m4 8 mb4 ~^~ 8o4  )CxC2[goAg a + g B(ga + g m4 + g mb4 + g o4)]
(6.32)
and copl « c o p2. To ensure stability for all feedback conditions down to unity-gain 
feedback, with a phase margin o f  at least 45°, the non-dominant pole C0p2 must be at 
least equal to the gain-bandwidth product [101]. That is,
8 o4^ \ ^ 2^ 8  a 8m4 8mb4 8 o 4 ) 8m3^8m4 Smb4 " ^ S o 4 )
^ 1 C 2 \.&o4&a 8 B 8nt4 8mb4 &o4 ) ]  ^ o 4 ^ 1  ^2 ( 8 a 8  m4 8mb4 8o4  )
(6.33)
Ignoring the output conductance terms as these are much smaller than the 
transconductance terms, the condition for stability becomes
8ml(8m4 8 mb4^
^ 1  ( 8 m\ 8 mb\ 8m4 8mb4^
which can be easily satisfied for all expected drain current conditions as shown in 
section 6.2.6.
(6.34)
DD
m J
T  ^n_J|—0
/  I  ~  V4H 0 |& »4^ 4  0 1 SmMVsM |  Xl&*
vi ( t )  V^ 3 0 |  Sm3vsg3 0  |  Smb3vsb3 | 1 ^  C l
VgoB> C2Z~
(a) (b)
Figure 6.8 (a) Flipped voltage follower driving the source o f  another transistor, (b) 
Equivalent small-signal model when the feedback loop is broken.
116
6.2.5 M obility Degradation and Short Channel Effects
The most familiar parameters that change with improvements in MOS technology are 
the vertical and horizontal dimensions of the channel. The vertical reduction refers in 
particular to the gate oxide thickness and the horizontal to the channel length as shown 
in Fig. 6.9. Numerous benefits such as the increased transistor cut off frequency, greater 
transistor density for the same area and lowered material cost, results from the 
successive reductions in device dimensions with each subsequent MOS technology. The 
improvements in speed can be mostly attributed to the reduction of input capacitance. 
However, it is seen that particularly with sub-micron processes, deviations from the first 
order derivations based on the classic square-law interpretation of MOS characteristics 
in the saturation region (6.1) are present.
Therefore, second order models have been applied to better model the MOS 
characteristics. The mobility degradation due to the reduction in vertical dimension is 
the most severe cause of deviations in long channel devices since it cannot be worked 
around. Mobility degradation due to reduced channel length is the most pronounced 
short channel effect. Strategies to reduce or avoid other factors such as channel length 
modulation and the body effect have been developed in circuit design. The body effect 
also known as the back gate voltage, can be avoided by connecting the source of the 
transistor in a separate well to the rest of the substrate [78]. Channel length modulation 
has been reduced using cascode structures in circuit design or feedback [102].
Si02 Si° 2
Gate
tox
n+ n+
Electric
Field
Drainn channel
p- substrate
Source
Figure 6.9 The important dimensions o f  a MOS transistor.
117
A. Mobility degradation
The reduction of the vertical dimension of transistors results in mobility degradation. 
Mobility degradation is caused by velocity saturation. This refers to the condition where 
by increasing the electric field no longer results in increasing carrier speed. This 
situation occurs in technologies that have large electric fields. This situation also occurs 
in technologies that have short channel lengths and therefore can produce large electric 
fields.
Mobility degradation can be modelled [103] using:
Heff  ---------   (6.35)
1 + 0(VGS- V T)
Where fi is the zero field mobility, [iEFF (effective mobility) and 6 is inversely 
proportional to the oxide thickness tox- Rearranging (6.35) as follows,
■ ^- = 1 + 0(VGS- V T) (6.36)
EFF
and substituting (6.35) into (6.1) gives,
I D = k —  (6 37)
1 + 0(Vas - V T)
The effect of mobility degradation (6.37) for the saturation drain current compared with 
the Sah’s model (6.1) is shown in Fig. 6.10 for the drain saturation current as a function 
of Vos• As can be seen the Sah model closely approximates the BSIM3v3 model 
(Berkeley Short Channel Igfet) for up to 500 mV of overdrive voltage ( Vcs -  Vt)- The 
departure from the Sah model is particularly pronounced for high values of Vcs- This is 
largely due to the mobility reduction present with the 16.7 nm gate oxide thickness used 
in the 0.8 /urn technology. The mobility reduction effect is determined by the vertical 
dimension for transistors with larger L. Therefore, increasing values of L from 1.2 fim 
to 9.6 jLtm while keeping the aspect ratio the same is seen to have little effect.
118
54.5 Sah Model 
16/1.2 /xm
4
3.5
3
< 128/9.6 nm 64/4.8 nm 
32/2.4 nm 
16/1.2 jrni
£  2 .5
2
2
1.5
0.5
0
0 0.5 1.5 2 2.5 3 3 .5  4 4 .5  5
|Vgs| (Volts)
Figure 6.10 The drain current (lD) for gate-source voltage ( VGs) for different L keeping 
the aspect ratio constant ( 0 .8/ l l  technology BSIM3v3 model used with tox = 16.3 nm). 
The Sah model is superimposed for comparison.
The square-law conformance is degraded by mobility reduction. This is particularly 
pronounced by high values of VGs as predicted by (6.37). This can be clearly seen by 
taking the square root of the drain current as shown in Fig. 6.11. The linearity 
approximates the Sah model for 500mV overdrive voltage. However, unlike the Sah 
model that predicts a large range of linearity as found in long channel device 
technologies (L > 10 /mi), the mobility reduction causes degradation in square-law 
conformity even for large L.
The following observations regarding mobility degradation can be made:
• A large overdrive leads to significant departure from Sah’s model.
• Square-law conformity decreases with increased overdrive.
119
32 .5
Sah Model
16/1.2 fim
2
£
's
128/9.6 nm 
64/4.8 fim 
32/2.4 nm 
16/1.2 /an
1
0.5
0
0 0.5 1.5 2 2.5 3 3.5 4 4.5 5
|Vgs| (Volts)
Figure 6.11 The square root of drain current (Id ) for gate-source voltage (V cs)  for 
different L keeping the aspect ratio constant. (0.8/xm technology BSIM3v3 model used 
with tox= 16.3 nm). The Sah model is superimposed for comparison.
The mobility degradation is significant and cannot be ignored. Therefore mobility 
degradation must be considered in relation to the analogue PE. It can be shown that the 
mobility degradation is equivalent to a series source resistor [91] represented by:
The equations (6.3a) and (6.3b) can be modified to include the voltage drops across Rs 
as follows,
(6.38)
*''CJ1= r ,+ K r + r , + ( / a - / 0,)* s (6.39a)
and
(6.39b)
120
Given the summed drain currents of M l and M2 are given by,
I Dl +I<n - V t ?  + (VGS2 - V t ) 2] (6.40)
Substituting (6.39a) and (6.39b) into (6.40) and neglecting higher order terms and 
assuming Rs *Vx «  1 then,
The square-law conformance range of including mobility degradation is given by:
The effect of mobility reduction is seen from 6.42 and 6.43 to reduce the range of 
square-law conformance. The effects predicted by 6.42 and 6.43 will become more 
pronounced with decreasing of vertical dimensions due to the scaling of technology.
B. Short Channel Effects
Mobility reduction is the most prominent short channel effect for MOS transistor with 
relatively short channels [104]. The reduction of the vertical and horizontal dimensions 
of transistors results in mobility degradation for short channel devices. The model 
represented by 6.37, only took into account mobility degradation due to the reduction of 
the vertical dimension. As such mobility degradation will be given consideration for 
short channel devices using models that take into account L.
In recent years several empirical models that attempt to model velocity saturation have 
been developed [78], [105], [106]. The model developed by Chen et al [107] is the most 
popular and is given by:
(6.41)
(6.42)
The derivation shows that the error introduced is given by
(6.43)
121
/ DSAT W V S A  T C o X  ( ^ G S  VDSA T ) 
(V o s -V T) (6.44)
SAT EFF
Where V s a t  is the saturation velocity, V d s a t  the drain saturation voltage, L e f f  the 
effective length and the electric field, E s a t  is given by:
Indeed it can be shown that as the limit condition L -*> 0 is approached, (6.45) becomes 
the velocity-saturated limited current given by,
It can be noted that (6.46) is independent of channel length L and thus varies linearly 
with overdrive voltage. This can be contrasted with (6.1) which varies quadratically as 
found in long channel devices.
The effect of mobility degradation for short channel devices (6.44) for the saturation 
drain current compared with the Sah’s model (6.1) is shown in Fig. 6.12 for the drain 
saturation current as a function of Vcs• As can be seen the Sah model moderately 
approximates the BSIM3v3 model for up to 500mV overdrive voltage. The departure 
from the Sah model is particularly pronounced for high values of V c s • This is partly due 
to the mobility reduction present, with the 7.7nm gate oxide thickness used in the 0.35 
/im technology. Increasing the horizontal dimensions with increasing values of L from 
0.6 fim to 9.6 fim while keeping the aspect ratio the same is seen to steadily improve the 
approximation to the Sah model. This is due to increasing the horizontal dimension and 
therefore reducing the mobility degradation effect. It can be seen that with decreasing L, 
the drain saturation current characteristic tends toward linear as predicted by (6.46). 
Therefore velocity saturation is strongly dependant on distance L. This result for the 
short channel 0.35 /mi technology can be contrasted with the 0.8 fxm technology, where
SAT ~ (6.45)
M  EFF
DSAT (6.46)
122
it was shown that mobility reduction due to the vertical dimension was the most 
prominent cause of deviation from the Sah model and increasing L made only a 
marginal difference.
Sah Model 
16/1.2 /an
128/9.6 Jim
64/4.8 fim
32/2.4 nm
16/1.2 jtm
8/0.6 fim
0 0.5 1 1.5 2 2.5 3 3.5 4 .54 5
|Vgs| (Volts)
Figure 6.12 The drain current (ID) for gate-source voltage ( Vgs) for different L keeping 
the aspect ratio constant. (0.35/xm technology BSEM3v3 model used with toX= 7.7 nm). 
The Sah model is superimposed for comparison.
The square-law conformance is degraded by mobility degradation for the 0.35 /xm 
technology. This is particularly pronounced with high values of Vgs as predicted by
(6.41). This can be clearly seen by taking the square root of the drain current as shown 
in Fig. 6.13. The linearity only moderately approximates the Sah model for 500mV 
overdrive voltage. Using large values of L will reduce velocity saturation.
The following observations regarding velocity saturation can be made:
• A large overdrive leads to significant departure from Sah’s model.
• Square-law conformity decreases with increased overdrive.
123
• Mobility degradation for short channel devices is highly dependant on values of 
L, tending towards a linear dependence of drain saturation current for overdrive 
voltage
Sah Model 
16/1 .2 /xm
128/9.6 fim 
64/4.8 nm 
32/2.4 nm 
16/1.2 nm 
8/0.6 nm
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
|Vgs| (Volts)
Figure 6.13 The square root drain current ( Id )  for gate-source voltage ( V g s )  for different 
L keeping the aspect ratio constant (0.35/x technology BSIM3v3 model used with toX = 
7.7 nm). The Sah model is superimposed for comparison.
C. Design Strategy
While design strategies such as feedback and cascode have proved useful in reducing 
short channel effects such as channel length modulation, the mobility reduction presents 
a limitation on analogue design. However, useful circuits can be realised by limiting 
overdrive voltage and therefore better approximating square-law characteristics in 
applications where required.
Simulations have shown that only a moderate square-law conformance is required for 
the MSE matching function used in motion estimation [74]. Therefore, for the purpose 
of realising an analogue PE the limitations imposed by mobility reduction are not 
critical.
124
6.2.6 Simulated Results
To evaluate each of the square function circuits presented in section 6.2.2, the total 
harmonic distortion (THD) was simulated with a 400 mVp.p, 1 kHz differential 
sinusoidal. A comparison of the various realisations as a function of L is shown in Fig. 
6.14. The W to L ratio of each transistor shown in Table 6.1 was kept constant. As 
expected, the THD of the flipped voltage follower implementation is better than that of 
the source follower implementation, because of the significant difference in output 
impedances. However, the THD o f the CMOS pair implementation is not as low as that 
of the flipped voltage follower implementation. This can be attributed to the fact that for 
the process used, the square-law behaviour of the /iMOS and /?MOS transistors are not 
the same. In both the source follower and CMOS pair implementations, the THD 
improves with increasing L , at the expense of more area and lower operating speed. For 
the flipped voltage follower implementation, virtually no improvement in THD is 
achieved by increasing L. Therefore an L of 1.2 pm was chosen for compactness and 
high speed. The lack of improvement is due to the mobility degradation effect described 
in section 6.2.5. It was seen in Fig 6.11 that increasing L only improved the square-law 
conformity marginally. This is due to the mobility degradation effect, caused by the 
vertical dimension therefore increasing the horizontal dimension with increasing L has 
little effect. A comparison of the three squaring circuits, using an L o f 1.2pm, is 
summarized in Table 6.1. The geometries in Table 6.1 were chosen to give similar 
operating ranges for equations 6.11, 6.15 and 6.17. Additionally the geometries of the 
flipped follower implementation using L of 1.2pm, gives a reasonable level of transistor 
matching [72].
125
0.8
0.7
source
a 0.6
CMOS pair implementatjo^
0.5
0.4
0.3
flipped voltage follower implementation
0 2 ' ------------------------------ 1------------------------------ 1----------------------------------1---------------------------1--------
1.2 1.8 2.4 3 3.6
L[nm]
Figure 6.14 Simulated THD of the three squaring circuits as a function of channel
42 4.8 5.4
length (differential sinusoidal input: 1 kHz, 400 mVp-p).
Parameter Squaring Circuit
Fig. 6.3 Fig. 6.4 Fig. 6.5
Minimum power supply (V) 1.7 3.5 2
Differential input range (mV) ~ ±300
Bias current h  (pA) 80 20 20
Transistor dimensions (pm/pm)
A/1 16/1.2 48/1.2 16/1.2
M2 16/1.2 48/1.2 16/1.2
A/3 48/1.2 19/1.2 16/1.2
A/4 48/1.2 19/1.2 16/1.2
A/5 - 48/1.2 16/1.2
A/6 - 19/1.2 16/1.2
M l - 48/1.2 -
A/8 - 19/1.2 -
THD % (400mVp.p, 1kHz sinusoid) 0.9 0.95 0.25
-3dB bandwidth (MHz) >1000 150 175
Table 6.1 Various parameters for the three squaring circuits.
126
It is well known that a two-pole amplifier remains stable no matter how much negative 
feedback is employed [108]. In the case of the squarer shown in Fig 6.5 this is unity. 
The open-loop response of the circuit in Fig. 6.8a ([IB = 20 pA realised by an nMOS 
transistor simple current mirror, all W/L = 16pm/ 1.2pm), were investigated for the 
following two extreme situations: 1) V\ = lOOmV and V2 = 400 mV (dc drain current of 
A/1 is almost zero), and 2) V2 = lOOmV and V\ =400m V  (dc drain current of A/1 is 
approaching 4IB). The output is taken from a 7Kft load. The corresponding Bode plots 
are shown in Figs. 6.15 and 6.16 where it is confirmed that the circuit is stable in both 
situations with a phase margin of 56° and 83°, respectively. In closed-loop 
configuration, the time-domain response of the drain current of A/1 in Fig. 6.8a for step 
inputs (Fig. 6.17a), is shown in Fig. 6.17b. The overshoot in the lower level of the 
current is 6% as expected from the open-loop phase margin in Fig. 6.15.
CD■O
c
-10
-20
Frequency [Hz]
200
— 150
Frequency [Hz]
Figure 6.15. Bode plot of the circuit in Fig. 6.8a for V\ = lOOmV and V2 = 400 mV.
127
3 0
-10
j.-20 ,T.4 10 '10 10'
Frequency [Hz]
200
2.100
Frequency [Hz]
Figure 6.16 Bode plot of the circuit in Fig. 6.8a for V\ = 400mV and V2 = 100 mV.
400
!> 300
200
100
70 800 10 20 30 40 50 60
Time [ns]
80
60
40
20
0
10 20 30 40 50 60 70 800
Time [ns]
Figure 6.17 Simulated time domain response of the closed-loop circuit in Fig. 6.8. (a) 
Applied step-inputs V\ and V2. (b) Time response of the drain current of transistor A/1.
128
6.2.7 Measured Results
The circuits based on the flipped voltage follower, were fabricated using the Austria 
Micro Systems (AMS) 0.8 pm double poly CMOS technology [109]. A 
microphotograph of the squarer circuit is shown in Fig. 6.18. The die area for the 
squarer is 34.6pm * 78.4pm. The various bias currents Ig (20 pA) were realised using 
simple «MOS mirrors of the same geometry as all other transistors (16pm/1.2pm). 
Furthermore, the various drain currents were terminated in off-chip precision-matched 
2 kQ  resistors, and the input voltages were applied differentially (common-mode of 
250 mV). The frequency plots and THD measurements were obtained directly using an 
SR760 FFT spectrum analyser. For the following measurements, the circuits were 
operated from a single 2V power supply.
The dc transfer curve of the squaring circuit is shown in Fig. 6.19. The maximum 
deviation from the ideal square-law characteristic over the input signal range, was 
calculated to be 1.25 %. Fig. 6.20a shows the spectrum of the output voltage for a 
400 mVp-p, 1 kHz differential sinusoidal input. The spectrum shows predominantly 
fourth harmonic distortion, about 45 dB from the desired second harmonic. The 
harmonic at 1 kHz is attributed to transistor mismatches, and it was observed, that by 
slightly offsetting Via, this unwanted term almost disappeared. Fig. 6.20b shows the 
average (over 10 chips) THD as a function of input signal amplitude. The average 
maximum THD remains under 1.5% for the entire input range. The moderate transistor 
matching accounts for the deviation from simulated performance attainable with such 
small devices. The input-offset voltage for the circuit was measured for a number of 
squarer circuits as shown in Fig 6.21. To gain a complete understanding of the input- 
offset voltage mismatch, a large number of samples would be necessary. However 
limited resource allowed only 50 to be fabricated over 10 chips in the same fabrication 
run. Nevertheless some indication of the input-offset is given, suggesting a standard 
deviation of 7.4 mV. Superior performance could be achieved generally in terms of 
mismatch and THD using much larger devices as in [91], [92].
129
O
ut
pu
t 
[m
V
]
Figure 6.18 Chip microphotograph o f squarer circuit.
1 6 0
150
140
130
120
110
100
-300 -200 -100 100 200 300
VmV]
Figure 6.19 Measured plot of square-law transfer curve.
130
-20
-80
-100
-120
Frequency [kHz]
0.5
250 400 450 500 550 600300 350
Figure 6.20 Squarer measurements, (a) Spectrum of output voltage (differential input: 
1 kHz, 400 mVp-p). (b) Average THD as a function of input signal amplitude.
6 ------- 1------- 1--------1------- 1------- 1------- 1------- r
Input-offset voltage (mV)
Figure 6.21 Input-offset voltage distribution histogram of the squarer circuit shown in 
Fig 6.5.
131
6.3 Analogue Memory
The Analogue Memory shown in Fig. 6.1 is comprised of a SB and TB block. The 
accuracy of the pixel values of the blocks to be matched is largely determined by the 
performance of the circuits used in this sub-system. Since the distance metric circuits 
are tranconductance this necessitates the use of voltage sample and hold circuits. Such 
sample and hold circuits range from the simplest comprised of a single transistor switch 
followed by a capacitor and a buffer, to complex cicuits using many transistors and 
more than one capacitor. The former though only moderately accurate, has the 
advantage of simplicity with corresponding compactness, low power dissipation and 
operating speed. The later usually features high precision, however at the expense of 
larger implementation area and higher power dissipation with lower operating speed.
The investigations described in chapter 4 showed that 6-bit precision pixel accuracy is 
sufficient to enable similar performance using the MSE block matching algorithm to 8- 
bit precision pixel accuracy using the MAE block matching algorithm. Therefore, since 
low power disssipation is required the use of simpler sample and hold circuits with a 
source follower as the voltage buffer will be adopted for implementation of the 
analogue motion estimator. The circuits are described in this section.
6.3.1 Sample and Hold with Dummy Switch
Typically two non-ideal effects associated with the single transistor switch limit the use 
of the simple sample and hold circuit shown in Fig 6.22. These effects are charge 
injection and charge feedthrough. The two effects result in a signal dependant hold step 
limiting the accuracy of the sample and hold. Many studies have been undertaken to 
characterise and minimise the effects [110]. The most widely used method to reduce 
charge feedthrough errors is by the use of a dummy switch shown in Fig.6.23. The 
technique proposed by McCreary and Gray [111] is fully characterised by Eichenberger 
and Guggenbuhl [112].
The technique is based on the idea that if the width of MD is half that of M l then 
provided the clocks are fast and the clock of MD changes slightly after that of M l, the 
charges will cancel. Since the complementary clock of MD will be generated using an 
inverter, the slight delay can be- effectively realised. The accuracy improvement from
132
the arrangement shown in Fig. 6.23 increases the accuracy about five times compared to 
the simple sample and hold shown in Fig. 6.22.
0
Vi
Ml
Cl
Figure 6.22 Simple single transistor switch sample and hold.
0  0
Ml MD
Cl
Figure 6.23 The single transistor switch with dummy transistor sample and hold.
The use of the single transistor switch with dummy transistor results in adequate 
performance for use in the analogue motion estimation processor. However, the input 
signal is restricted between about 0 - 500mV operation. To use higher voltages the 
transmission gate switch must be used.
6.3.2 Sample and Hold with Transmission Gate
The signal dependant hold step caused by non-ideal effects can be minimised using the 
transmission gate switch as shown Fig. 6.24. This configuration results in a far wider 
dynamic range than possible with the single transistor arrangements. The technique uses 
complementary transistors of the same size therefore the charge injection due to each 
transistor will cancel when switched off. However since the characteristics of the 
different type transistors are somewhat dissimilar, exactly complementary turn off is not 
possible over the entire dynamic range. Nevertheless when operated in the region 
midway between the supplies, adequate performance for use in the analogue motion 
estimation processor can be realised.
133
0Ml
M2Vi
Figure 6.24 The transmission gate switch sample and hold.
6.3.3 Rotary Shifter
The rotary shifters function to either separately or simultaneously reorder the columns 
or rows of pixels from the memories. Two separate circuits do this, one for the rows on 
the SB side of the MSE block, the other for the columns on the TB side of the MSE 
distance metric. They are separated on each side to balance equally the errors introduced 
by the rotary shifter comprised of analogue switches based on transmission gates.
The circuit used to multiplex the different pixel columns of rows is the barrel shifter 
also known as a rotary shifter as shown in Fig. 6.25. This structure has been used for 
many years in microprocessors, based on pass transistors to multiplex data.
D3
B2
D2
D1
BO
DO
S3 S2 SI SO
Figure 6.25 Rotary Shifter (4 control lines, S0-S3).
134
It has also found application as a video crosspoint switch in analogue video processing. 
However used in the analogue motion estimation processor, analogue signals are 
multiplexed using transmission gates.
6.4 Decision Circuit
This section describes the circuits used in the decision circuit (shown Fig. 6.1) which 
consists of a high accuracy sample and hold circuit based on the Miller hold capacitance 
scheme and a comparator.
6.4.1 High Accuracy Sample and Hold
The sample and hold used for the decision circuit are based on the Miller hold 
capacitance scheme described in [113] and [114] and shown in Fig. 6.26. In the sample 
phase, the switches are closed. Hence V\ = VQ and hence the input signal sees an 
equivalent capacitance Ceq of C\ + Ci neglecting parasitic capacitances. During the hold 
phase, the feedback capacitor across amplifier^ becomes:
c ,  a
Ce = — (6. 47) 
eq Ct + C2 '
The Miller multiplication effect due to the amplifier A, now makes the effective hold 
capacitance (parasitic capacitances contribute slightly to Ceq) at the output:
ChM ={A + (6.48)
This value is typically much larger than charged during the sampling mode therefore 
much smaller switches can be used.
135
0/
C2Cl
Figure 6.26 Simplified diagram of sample and hold circuit.
The transistor level implementation is shown in Fig. 6.27 using a CMOS inverter as the 
amplifier. This operates just at its switching point (by the shorting action of switches). 
This results in a compact, higher speed implementation than would have been possible 
with the compensated operational amplifier used in [113].
0
_ L
Ml 
20/0.8 n
C2 
2 pF2 pF
M2 
20/0.8 n
M3 
48 .4 /2  n
M4 
21 .6/2 n
Figure 6.27 Transistor level circuit diagram of sample and hold circuit.
6.4.2 Comparator
A high performance comparator is often made up of three stages. This consists of a 
decision circuit often employing positive feedback and is usually preceded by a 
preamplifier. The decision circuit is usually followed by an output buffer to interface 
with digital logic as shown in Fig. 6.28.
136
Positive
Feedback
Latch 4  V-
Preamplifier Decision Circuit Buffer
Figure 6.28 High performance comparator.
The preamplifier stage functions to amplify the input to improve the comparator 
sensitivity and isolate the inputs from the switching noise generated from the decision 
stage. The decision stage is based on a positive feedback circuit. As such the positive 
feedback circuit generally will have several millivolts of input-offset and suffers from 
charge injection and clock kickback errors. Therefore the preamplifier input-offset and 
gain, largely determine the accuracy of the comparator. The simple differential 
amplifier circuit shown in Fig. 6.29 was used as the preamplifier stage shown in Fig. 
6.28.
F DD Vd d
Ml 
15/3 n
M2 
15/3 n
M3 M4 M5 M6
200/ 1.2 n 200/ 1.2 n200/ 1.2 m 200/ 1.2 n
Figure 6.29 Comparator preamplifier.
However, recalling that the improved motion estimation architecture described in 5.3 
requires the summation of the SHI and SH2, to be compared with the summation of the 
SH3 and SH4 of Fig 5.11. The voltage of the DM’s being compared is composed of two 
components V, +, V,' and V2 \  V: Therefore applying SHI to V,+, SH2 to V,', SH3 to V2 +
137
and SH4 to V2 ~ of Fig 6.27, it can be seen that the voltage-to-current behaviour of the 
differential pairs, allows the convenient current summation of the required signals in the 
diode connected transistors Ml and M2.
The circuit of Fig. 6.30 was used as the decision stage and buffer as shown in Fig. 6.28. 
The decision circuit uses a dynamic circuit [115], [116] that dissipates zero current 
when the clock is low. In this state the regenerative latch comprising transistors M3 A, 
M3 B, M4 A and M4 B, are reset. The low voltage at the gates of pMOS transistors M5 A, 
M5 B, M6 A and M6 B switch them fully on allowing a low impedance path between V DD 
and the output of the latch and input transistors 
M2 A and M2 B.
F DD Vdd Fdd Vdd
DD DD
J  M4P\ M 6 b  L
9/1.6 /x 110/0.8 n
J  M6a
10/0.8 n
M4a L, 
9/1.6 n
LatchCLK CLK
M8, 
9/0.8 n
DD DD
M3a _  
5/1.2 n
_ M3b 
5/1.2 n
J  m s a
10/0.8 n
CLKCLK
Buffer Buffer
CLK Ml
48/0.8 fi
Figure 6.30 Dynamic Comparator.
It has been found that the comparator settles faster by using this technique. Taking the 
clock high switches on transistor M l, the transistors M2 A and M 2B behave as a 
differential amplifier and the gain provided by the differential transistors is further 
amplified by the positive feedback latch that is now active. As can be seen in Fig 6.30
138
simple inverters are used to buffer the decision from the latch and interface to the 
correct digital level at the output.
6.5 Conclusions
In this chapter the design considerations of analogue circuits suitable for an analogue 
motion estimation processor realisation have been reviewed. The circuits that constitute 
the analogue motion estimator have been detailed and described.
The chapter presents both simulated and measured results for the distance measure 
circuit used as an analogue PE. The analogue PE forms part of the distance metric that 
is one of the most critical components in the motion estimator in regard to power 
dissipation and block matching accuracy. The results indicate the circuit is more than 
satisfactory to meet the system requirements.
The circuits surrounding the distance metric block are based on well known and 
characterised designs. The simplicity of these circuits ensure good performance, low 
power dissipation and implementation area.
139
Chapter 7
Experimental Results of Block Matching 
Analogue Motion Estimation Processor
7.1 Introduction
The results of a set of fabricated analogue motion estimation processors is presented in 
this chapter to evaluate the architecture and circuit design presented in Chapter 6. The 
full custom design layout was designed using Cadence design tools at UCL and 
fabricated in double poly, double metal 0.8 fim «-well CMOS technology using the 
AMS foundry process. The design of the PCB and testing was undertaken at UCL.
In the following sections the analogue motion estimation processor performance is 
evaluated and measured results are presented.
7.2 Analogue Motion Estimation Processor
7.2.1 Fabricated Analogue Motion Estimation Processor
To evaluate the analogue motion estimator, the test structure shown in Fig 7.1 was 
fabricated. The analogue motion estimator is designed for 4 x 4 block sizes. This 
integrated circuit essentially consists of the entire motion estimation processor except 
the block comparison stage. The block comparison was implemented off chip. The sub­
systems shown are implemented using the circuits described in Chapter 6. In particular 
the analogue memories SB and TB were based on the transmission gate (using 
minimum geometry transistors) sample and hold circuit described in 6.3.2, with 0.5 pF 
capacitors and an «MOS source follower used as a buffer. The source follower of 
Fig.6.6 (a) was used based on «MOS transistors with current source implemented with a 
simple current mirror. A geometry of 16/1.6 fim used for all source follower transistors.
140
Analogue Digital
Pixel Input Control
TBDMSB Rotary
Shifter
Reference 
block < 
pixels
Memory Column 
Address Address
Row
Address
INTEGRATED CIRCUIT
Memory
Address
Current
block
pixels
Analogue DM Output 
Figure 7.1 Motion estimation processor integrated circuit.
A breakdown of the area usage of the various parts of the motion estimation processor 
integrated circuit shown in Fig. 7.1 is given in Table 7.1. The area usage has been 
minimised by careful layout. The digital memory address decoder area has not been 
included since the addressing is only one dimensional for this test structure. Significant 
improvements in area efficient would result from two dimensional addressing [108].
Sub-system Area (mm2)
DM 0.10
SB/TB Memory 0.07
Rotary Shifter 0.05
Whole System 0.63
Table 7.1 Showing area of various motion estimation processor sub-systems.
The microphotograph of the analogue motion estimation processor is shown in Fig 7.2. 
The various sub-systems are highlighted. As can be seen the careful layout and the 
structural regularity inherent in the architecture, results in a compact implementation.
141
Memory SB Rotary Distance Rotary TB Memory
Address Memory Shifter Metric Shifter Memory Address
Column Row
Address Address
Figure 7.2 Motion estimation processor microphotograph. The parts not highlighted are 
not part of the processor.
7.2.2 Experimental Evaluation M ethod
To evaluate the analogue motion estimation processor performance, the procedure 
shown in Fig. 7.3 was used. The Personal Computer (PC) offers a software development 
environment, software for processing images and interface to hardware externally. 
Therefore using the Industry Standard Architecture (ISA) bus of a PC, the digital 
control of the integrated circuit can be directly controlled using software. The output 
current from the DM sub-system is fed to an off chip resistor. The voltage developed 
across the resistor is proportional to the MSE between blocks. The pixel values are 
converted from digital to analogue using an external 6-bit DAC and the analogue values
142
obtained from each block match are conveniently read back into the computer using an 
external 12-bit ADC. The test procedure offers a high degree of flexibility since the chip 
is directly under the control of the software.
SOFTWARE
COMPUTEROUTPUT INPUT
PCB
TBSB DM
PC
12-Bit
ADC
Rotary
Shifter
6-Bit
DAC
ISA BUS
C++
Program
Reference
block
pixels
Memory Column 
Address Address
Row
Address
INTEGRATED CIRCUIT
Memory
Address
Current
block
pixels
Figure 7.3 Motion estimation processor test procedure.
7.2.3 Experimental Results
The same test sequences used previously were used to evaluate the fabricated analogue 
motion estimation processor shown in Fig. 7.1. However the resolution for the 
sequences is reduced to QCIF format. The scaled system used 4 x 4  blocks and an 8 x 8 
search window. The use of 4 x 4 blocks and the lower resolution QCIF format relaxes 
the comparator requirement to 12-bits. Two experiments were performed to evaluate the 
motion estimation processor. The first concerned the PSNR efficiency and the second 
the analysis of the patterns of motion. The PSNR error compared to 8-bit precision MSE 
of the fabricated analogue motion estimation processor was evaluated with 8-bit
143
precision MAE and 6-bit precision MSE as a reference. An analysis of the pattern of 
motion has been investigated to give an insight into the processor performance 
compared with 8-bit precision MSE, 6-bit precision MSE and 8-bit precision MAE.
The mean PSNR error over all the frames for various sequences has been evaluated for 
the analogue motion estimation processor as shown in Table 7.2. The error is relative to 
using the 8-bit MSE distance metric in all cases. For reference, the mean error PSNR 
over all the frames using the MAE and 6-bit MSE distance metrics has also been 
included. As can be seen the mean error PSNR for 6-bit MSE is only slight while the 
error is between 0.1086 -  0.1356 dB for MAE in all sequences. The analogue 
implementation error is between 0.1228 -  0.311 dB. The error with the analogue 
implementation varies over the frames for the various sequences. The remainder of this 
section will examine the sequences individually.
Sequence Mean 8-bit 
MAE
Mean 6-bit 
MSE
Mean
Analogue
Carphone 0.1356 0.0009 0.1490
Foreman 0.1206 0.0012 0.1228
Susie 0.1336 0.0017 0.3116
Trevor 0.1086 0.0009 0.2266
Table 7.2. The mean PSNR error for various sequences with different distance metrics 
and measured analogue motion estimation processor.
144
The Carphone sequence has the highest motion of all sequences tested. There is motion 
in part of the background. The PSNR error from 8-bit MSE is shown in Fig 7.4. As can 
be seen the 6-bit MSE performs much better than the 8-bit MAE. The measured 
performance varies from chip to chip. The upper and lower measured results indicates 
the spread. As can be seen the analogue motion estimator performance is comparable to 
the 8-bit MAE.
The pattern of motion for this sequence is quite similar for the 8-bit and 6-bit MSE as 
can be seen in Fig. 7.5. The results of Table 7.2 suggest that the 8-bit and 6-bit MSE 
tends to more accurately predict the motion than MAE or the analogue motion 
estimator. There are only subtle differences for the 8-bit MAE occurring in the limited 
motion regions such as the car interior. This would seem to account for the less accurate 
prediction given by the 8-bit MAE. The patterns of motion for the measured results also 
erroneously detect motion in these regions leading to increased PSNR error.
0.4
0.35
S ' 0 3CDTD
111
co 0.25
s
ob 0.2
-  0.15
U J
0.1
0.05
80 900 10 20 30 40 50 60 70
Frame
 8-bit MAE — 6-bit MSE —Measured (upper and lower indicating the spread)
Figure 7.4 The PSNR error from 8-bit MSE for the Carphone sequence with various
metrics and measured analogue motion estimation processor.
145
(a) Frame 1
0
20
40
60
80
100
120
140
^ sy s .-sv ’t- ^ y »
s >q:! : ' '■< *JT*\.
' ^ r -  • :< v V 3
”  F: .  i ^ r h S S S S f e . '' i *■'!
>.v_x X I t.H'TC-
80 100 120 140 160
(c) MSE 8-bit
(b) Frame 2
iT^V^j^x s >q'J . ;< *JV«\
SSS1 K S i?
80 100 120 140 160
(d) MAE 8-bit
(f) Measured
Figure 7.5 Detecting the patterns of motion in Carphone sequence: (a) Reference frame 
(b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE and (f) Measured 
analogue motion estimation processor.
146
The Foreman sequence has large amounts of motion. There is some motion in part of 
the background and later in the sequence, the scene changes from a talking person to a 
garden scene. The PSNR error from 8-bit MSE is shown in Fig 7.6. The 6-bit MSE 
performs much better than the 8-bit MAE. The spread is less pronounced in this 
sequence. As can be seen the analogue motion estimator performance is mostly better 
than the 8-bit MAE.
The pattern of motion for this sequence show similarity for the 8-bit and 6-bit MSE with 
some difference using the 8-bit MAE as is shown in Fig. 7.7. The areas of limited 
motion with poor definition such as the background show some differences for the 
analogue motion estimator. The pattern of motion is quite chaotic in these regions. This 
is due to the analogue motion estimator not being able to accurately predict the motion 
where there is poor definition between the background and the objects that have motion.
0.35
0.3
r  0.25
0.2
00
a 0.15
0.1
0.05
0 50 100 150 200 250
Frame
 8-bit MAE — 6-bit MSE —Measured (upper and lower indicating the spread)
Figure 7.6 The PSNR error from 8-bit MSE for the Foreman sequence with various
metrics and measured analogue motion estimation processor.
147
(a) Frame 0 (b) Frame 1
(d) MAE 8-bit
0 20 40 60 SO 100 120 140 160
(e) MSE 6-bit (f) Measured
Figure 7.7 Detecting the patterns of motion in Foremen sequence: (a) Reference frame 
(b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE and (f) Measured 
analogue motion estimation processor.
The Susie sequence has limited amounts of motion confined to the person. The person is 
close up to the camera. There is no motion in the background and it is almost uniform. 
The PSNR error from 8-bit MSE is shown in Fig 7.8. The 6-bit MSE performs better 
than the 8-bit MAE. The spread is most pronounced in this sequence. The analogue
148
motion estimator performance is slightly worst than the 8-bit MAE for most of the 
sequence. The exceptions being where there is a high degree of motion.
There are some differences in the pattern of motion for this sequence between the 8-bit 
and 6-bit MSE as well as the 8-bit MAE as can be seen in Fig. 7.9. However the 8-bit 
MAE is distinctively different particularly in the areas of limited motion. In these 
regions the 8-bit MAE seems to erroneously detect motion where the MSE does not. 
The pattern of motion for the measured results is even more chaotic in these regions. 
The cause of this is due to the high amount of input-offset in the analogue PE’s. As was 
shown in section 4.3.2, input-offset tends to affect performance quite noticeably in 
sequences with limited motion. This also accounts for the large spread in PSNR 
performance
0.5
0.45
0.4
m 0.35
S  0.25
uj 0.15
0.1
0.05
0 10 20 30 40 50 60 70
Frame
 8-bit MAE — 6-bit MSE —Measured (upper and lower indicating the spread)
Figure 7.8 The PSNR error from 8-bit MSE for the Susie sequence with various metrics
and measured analogue motion estimation processor.
149
(a) Frame 0 (b) Frame 1
?>£,:20
/ I
100
120 0 20 60 80 100 120 16040 140
(c) MSE 8-bit
100
120 0 20 40 60 80 100 120 140 160
(e) MSE 6-bit
0
20
40
60
80
100
120
0 
20 
40 
60 
80 
100
0 20 40 60 60 100 120 140 160
(f) Measured
0 20 40 60 80 100 120 140 160
(d) MAE 8-bit
V
Figure 7.9 Detecting the patterns o f  motion in Susie sequence: (a) Reference frame (b) 
Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE and (f) Measured analogue 
motion estimation processor.
150
The Trevor sequence has limited amounts of motion. The background is static and non- 
uniform. The motion is mainly confined to the person sitting at some distance from the 
camera. The PSNR error from 8-bit MSE is shown in Fig 7.10. The 6-bit MSE performs 
significantly better than the 8-bit MAE. The spread is quite pronounced in this 
sequence. As can be seen the analogue motion estimator performance is not as good as 
the 8-bit MAE for most of the sequence. The pattern of motion for this sequence shows 
similarity for the 8-bit and 6-bit MSE with some difference using the 8-bit MAE as is 
shown in Fig. 7.11. This sequence contains limited motion and therefore like the Susie 
sequence tends to perform worst than sequences with large motion. The pattern of 
motion is distinctively different for the analogue motion estimation processor compared 
to the MSE and MAE metrics.
The 6-bit MSE using 4 x 4  blocks is seen to perform similarly to 8-bit MSE using the 
lower resolution QCIF format for all the sequences tested. This can be contrasted with 
the degraded performance found using the higher resolution CIF format in section 4.4.
0.25
1  02
LLJ 
C/D
2
s  0.15 
cb
E 2
g  0.1
ui
0.05 
0
0 10 20 30 40 50 60 70 80 90
Frame
 8-bit MAE — 6-bit MSE —Measured (upper and lower indicating the spread)
Figure 7.10 The PSNR error from 8-bit MSE for the Trevor sequence with various
metrics and measured analogue motion estimation processor.
I ' m  | l ' i /M .’ i i
V<V\'V'
151
(a) Frame 1 (b) Frame 2
MS
120
100 12040 60 800 20
(c) MSE 8-bit
100
120
100 12040 60 800 20
(d) MAE 8-bit
100
120 x
100 1200 20 40 60 80
1 4
60
100
120
100 12020 40 60 800
(e) MSE 6-bit (f) Measured
Figure 7.11 Detecting the patterns of motion in Trevor sequence: (a) Reference frame
(b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE and (f) Measured
analogue motion estimation processor.
152
7.3 Improved Analogue Motion Estimation Processor
The input-offset variation between the analogue PE’s is the most significant detrimental 
effect to PSNR performance. As was shown in the previous section, the input-offset is 
largely responsible for the detrimental effect to PSNR in sequences such as Susie with 
limited motion and for the variations in PSNR found in all sequences. The improved 
motion estimation architecture described in section 5.3 reduces this effect. The 
following section evaluates the analogue technique.
7.3.1 Fabricated Improved Analogue Motion Estimation 
Processor
The test structure shown in Fig 7.12 was fabricated to evaluate the improved analogue 
motion estimator. The improved analogue motion estimator is designed for 4 X 4 block 
sizes. This integrated circuit consists of the entire motion estimation including the block 
comparison decision circuit. The sub-systems shown are implemented using the circuits 
described in Chapter 6.
Analogue 
Pixel Input
Digital
Control
Digital
Output
Memory Column Crossover Row
Address Address Address Address
Memory
Address
Reference
block
pixels
SHI
SB Rotary
Shifter
Cross­
over
Switch
<4-
Rotary
Shifter
TB"W Current * block 
pixels
DM
SH2 SH3 SH4
Sample 
& Hold 
Address
Decision Circuit
+ VI
- V I  COMP 
+ V2 
-  V2
HIGH
LOW
INTEGRATED CIRCUIT
Figure 7.12 Improved motion estimation processor integrated circuit.
153
In particular the analogue memories SB and TB were based on the dummy switch 
sample and circuit described in 6.3.1, with 0.5 pF capacitors and a /?MOS source 
follower (with transistor in separate well to eliminate body effect) used as a buffer. The 
source follower of Fig.6.6 (a) was used with current source implemented with a simple 
current mirror. A geometry of 16/1.6 /xm was used for all source follower transistors.
The breakdown of area usage for the improved motion estimation processor integrated 
circuit of Fig. 7.12 is shown in Table 7.3. The various parts are the same size as before 
with the addition of the decision circuit. A digitally selectable variable on chip resistor 
allowed the dynamic range of the DM to be adjusted according to the scene. Therefore 
increasing the effective accuracy of the comparator. An effective overall comparison 
accuracy of 12-bit was maintained for the measured results as in section 7.2. The 
fabrication required a minimum silicon die area usage for each chip. Therefore the 
unused area was filled using additional on chip power supply decoupling capacitors.
Sub-system Area (mm2)
DM 0.10
SB/TB Memory 0.07
Rotary Shifter 0.05
Decision Circuit 0.24
Whole System 1.45
Table 7.3 Showing area of various improved motion estimation processor sub-systems.
The microphotograph of the motion estimation is shown in Fig 7.13. The various sub­
systems are highlighted. As can be seen the careful layout and the structural regularity 
inherent in the architecture, results in a compact implementation.
154
'  Or s i* s s
t w i n 's *
H N M
t t  f l M  >i, ®05?
^  r  
H I*5r I
Figure 7.13 Improved motion estimation processor microphotograph. The top area
highlighted is essentially the circuit shown in Fig. 7.2, middle left the resistor load,
middle right the decision circuit and lower, on chip decoupling capacitors.
155
7.3.2 Experim ental Evaluation M ethod
To evaluate the analogue motion estimation processor performance, the procedure 
shown in Fig. 7.14 was used. A similar set up to that described in section 7.2.2 was 
used only an additional First In First Out (FIFO) and 10 MHz clock was used to test the 
circuit at full operating rate. The FIFO functions to take the low clock rate data from the 
ISA bus and clock it out at the much higher rate. The higher rate would be required for a 
full scale motion estimation processor using 16x16 blocks and 24 x 24 search window.
C++
Program SOFTWARE
PC
ISA BUS
OUTPUT
1
INPU COMPUTER
FIFO Fast
C lo c k
6-Bit
DAC
Analogue 
Pixel Input
Digital
Control
PCB
Digital
Output
Memory
Address
Column
Address
Crossover
Address
Row
Address
Memory
Address
Reference
block
pixels
SH4
SB
-
Rotary
Shifter
Cross­
over
Switch
<4-
Rotary
Shifter -
TB
1
Current
block
pixels
DM
SH3 SH2 SHI
Sample 
& Hold 
Address
Decision Circuit
+ V1
- V I  COMP 
+ V2 
-  V2
HIGH
LOW
INTEGRATED CIRCUIT
Figure 7.14 Improved motion estimation processor test procedure.
156
7.3.3 Experimental Results
The same QCIF test sequences used previously (section 7.2) were used to evaluate the 
fabricated analogue motion estimation processor shown in Fig. 7.12. The scaled system 
used 4 x 4  blocks and an 8 x 8 search window. The same two experiments as section 
7.2.3 were performed to evaluate the motion estimation processor. The small variation 
in analogue motion estimation processor performance seems to be dependant on the 
decision circuit accuracy, in particular the comparator accuracy.
The mean PSNR error over all the frames for various sequences has been evaluated for 
the analogue motion estimation processor as shown in Table 7.4. For reference, the 
mean error PSNR over all the frames using the MAE and 6-bit MSE distance metrics 
has also been included. The error is relative to using the 8-bit MSE distance metric in all 
cases.
Sequence Mean 8-bit 
MAE
Mean 6-bit 
MSE
Mean
Analogue
Carphone 0.1356 0.0009 0.1126
Foreman 0.1206 0.0012 0.1141
Susie 0.1336 0.0017 0.2440
Trevor 0.1086 0.0009 0.1935
Table 7.4. The mean PSNR error for various sequences with different distance metrics 
and measured improved analogue motion estimation processor.
As was previously shown in section 7.2.3, it can be seen the mean error PSNR for 6-bit 
MSE is only slight while the error is between 0.1086 -  0.1356 dB for MAE in all 
sequences. The analogue improved implementation error is between 0.1126 -  0.244 dB.
157
This represents a moderate improvement over the analogue implementation presented in 
section 7.2.3. The error with the analogue improved implementation varies over the 
frames for the various sequences. However this variation is substantially less than was 
seen in the analogue implementation presented in section 7.2.3. The remainder of this 
section will examine the sequences individually.
The Carphone sequence has the highest motion of all sequences tested. The measured 
performance is shown in Fig 7.15. As can be seen the analogue motion estimator 
performance is comparable to the 8-bit MAE. The pattern of motion shown in Fig. 7.16 
for this sequence is similar to the results found in section 7.2.3. This is due to the 
analogue circuits being the same and the decision circuit (section 6.4) was of a similar 
accuracy to the 12-bit ADC used in section 7.2.2.
0.3
0.25
B  0.2
0.05
0 10 20 30 40 50 60 70 80 90
Frame
 8-bit MAE — 6-bit MSE —Measured (upper and lower indicating the spread)
Figure 7.15 The PSNR error from 8-bit MSE for the Carphone sequence with various
metrics and measured improved motion estimation processor.
158
(a) Frame 1 (b) Frame 2
K K~-<*rzA. Tv ■ \ I UZ £ ' i
100 120 140 160
(c) MSE 8-bit
. . r. r. r :: .  : I ^  *  !<----— r_ 7_- ;:m ' ' fc fc < - -j
* *.*. z_. ~ >>N4 v'l. u n'>u*£< **rA~?5i-v I 2 nrx"
80 100 120 140 160
(e) MSE 6-bit
J z f  ; -  '  ' y  ^  ; S u
J4J-: v:* y  ^ Sl“ \ T't ^ X >4 7 " * • •
80 100 120 140 160
(d) MAE 8-bit
(f) Measured
Figure 7.16 Detecting the patterns of motion in Carphone sequence: (a) Reference frame 
(b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE and (f) Measured 
improved motion estimation processor.
159
The Foreman sequence has large amounts of motion. The PSNR error from 8-bit MSE 
is shown in Fig 7.17. As can be seen the analogue motion estimator performance is 
mostly better than the 8-bit MAE. The patterns of motion shown in Fig. 7.18 for this 
sequence show similarity once again to the previous results.
0.3
0.25
co
■ O
0.2
£ 0.1
0.05
0 100 20050 150 250
Frame
 8-bit MAE — 6-bit MSE — Measured (upper and lower indicating the spread)
Figure 7.17 The PSNR error from 8-bit MSE for the Foreman sequence with various 
metrics and measured improved motion estimation processor.
160
(a) Frame 0
0 20 40 60 80 100 120 140 160
(c) MSE 8-bit
(e) MSE 6-bit
(b) Frame 1
0 20 40 60 80 100 120 140 160
(d) MAE 8-bit
0 20 40 60 80 100 120 140 160
(f) Measured
Figure 7.18 Detecting the patterns o f motion in Foremen sequence: (a) Reference frame 
(b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE and (f) Measured 
improved motion estimation processor.
161
The Susie sequence was found previously to be the most sensitive to input-offsets 
resulting in reduced performance. The improved motion estimation processor 
implementation error PSNR can be seen in Fig. 7.19. The spread is reduced, however 
the error in PSNR has not been improved. Therefore a higher resolution comparator is 
needed. The patterns of motion are quite similar to before as shown in Fig. 7.20. 
Nevertheless the error PSNR is comparable to 8-bit MAE.
0.45
0.4
0.35
~  0.3
0.25
oo
0.2
0.05
'  - i -
400 10 20 30 50 60 70
Frame
 8-bit MAE — 6-bit MSE —Measured (upper and lower indicating the spread)
Figure 7.19 The PSNR error from 8-bit MSE for the Susie sequence with various 
metrics and measured improved motion estimation processor.
162
(a) Frame 0 (b) Frame 1
40
120 0 20 40 60 80 100 120 140 160
(c) MSE 8-bit
0
y.‘
20
40
bu
80 >r:
100
120
0 20 40 60 80 100 120 140 160
(d) MAE 8-bit
20
100
120 0 20 40 60 80 100 120 140 160
(e) MSE 6-bit
m ' -
t t '
Iv
•jkiS
\
/ / 
■ n  > /[
20 40 00 60 100 120 140 160
(f) Measured
Figure 7.20 Detecting the patterns o f  motion in Susie sequence: (a) Reference frame (b) 
Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE and (f) Measured improved 
motion estimation processor.
163
Similar observation regarding the Trevor sequence can be made as before. The PSNR 
error is shown in Fig. 7.21. The patterns of motion reveal similar characteristics as 
observed with the previous implementation.
In summary the improved implementation has significantly improved the spread in 
PSNR compared to that found previously. The results however were not as good as 6-bit 
MSE this largely due to the limitations of the analogue circuits used, in particular the 
comparator accuracy. Nevertheless the PSNR was found to be similar to the 8-bit pixel 
MAE as used in digital implementations. The motion estimation performance suggests 
the suitability of the approach for video coding.
0.25
0.15
CO
0.1
0.05
0
0 10 20 30 70 8040 50 60 90
Frame
 8-bit MAE — 6-bit MSE —Measured (upper and lower indicating the spread)
Figure 7.21 The PSNR error from 8-bit MSE for the Trevor sequence with various
metrics and measured improved motion estimation processor.
164
(a) Frame 1 (b) Frame 2
100
120
1200 20 40 60 80 100
(c) MSE 8-bit
60
80
100
120
100 1200 20 40 60 80
(d) MAE 8-bit
40
i 4
02
100
120
o 20 40 60 80 100 120
100
y.< x120
0 20 40 60 80 100 120
(e) MSE 6-bit (f) Measured
Figure 7.22 Detecting the patterns of motion in Trevor sequence: (a) Reference frame 
(b) Current frame (c) 8-bit MSE (d) 8-bit MAE (e) 6-bit MSE and (f) Measured 
improved motion estimation processor.
165
7.4 Motion Estimation Processor Performance
7.4.1 Power Dissipation and Motion Vector Computation Rate
The measured results of the previous sections indicate that a full scale analogue motion 
estimation processor is possible for 16 x 16 blocks with a search window of 24 x 24 and 
QCIF frames at 15 fps. The performance is suitable for the H.261 video coding 
standard. Such a system will be considered in relation to recent digital and analogue 
implementations.
The percentage of time for each event during the computation of the best motion vector 
is shown in Fig. 7.23. The calculation is worked out as follows using r  = 100 ns, which 
is the clock period for 10 MHz. The exhaustive search is applied, therefore 289 MSE 
computations are required within the search window. The initial TB and SB loading of 
512 pixel memory locations is worked out as follows:
512 • 7=51.2 fis
Updating of 16 SB pixel memory locations at each update:
1 6 - 288- 7  = 460.8/xs
Computing matching function requiring 4 sample and holds:
4 - 2 8 9 - 7 =  115.6 fis
The total period for calculating the best motion vector is 627.6 /xs. Since there are 16 
pixel memory updates followed by 4 sample and holds between comparisons:
(16 + 4) • 7 = 2 /xs
It follows that a 500 KHz comparator is needed. This means that a relatively slow, high 
precision comparator may be used. As a consequence, the use of such a comparator 
enables high precision auto zero comparators to be used. This would result in better 
accuracy comparisons between each block match, leading to improved PSNR.
166
Computing
TB&SB
Load18.4%
8 .2%
73 .4 % TB&SB
Update
Figure 7.23 Showing the usage of time for calculating the best motion vector.
The estimated power dissipation for such a system would be about 11 mW. A 
breakdown of the power usage for the analogue motion estimation processor is shown in 
Table 7.5. As can be seen, the analogue part of the processor accounts for the majority 
of the power usage. The DM circuits are operated only during the MSE computation 
much as was done by Tomasini et al [74], resulting in significant power dissipation 
reduction. This is implemented by switching the bias.
Section
Digital Analogue
Subsystem Power Subsystem Power
Memory Address 100 /xW DM 6.2 mW
MUX Address 16 /xW TB & SB Memory 3.1 mW
Decision Circuit 1.3 mW
Table 7.5 Showing the power usage breakdown for analogue motion estimation 
processor.
167
The estimated area for the analogue motion estimation processor would be about 4 mm2. 
A breakdown of the area usage is shown in Table 7.6. As can be seen, the DM circuits 
and SB and TB memories account for most of the area usage. The decision circuit is the 
same size as before. The digital address decoding would be based on two dimensional 
addressing implemented with standard library C2MOS (clocked CMOS) logic, resulting 
in significantly reduced power dissipation and area usage.
The scaled motion estimation processor described previously was designed using 
transistors much larger than necessary using the improved architecture. A future 
implementation using digital geometry transistors would operate at a similar clock rate, 
however with an order of magnitude reduced bias currents since parasitic capacitances 
are reduced. Therefore power dissipation would consequently be significantly reduced.
Sub-system Area (mm )
DM 1.58
SB/TB Memory 1.05
Rotary Shifter 0.86
Decision Circuit 0.24
Whole System *4.00
Table 7.6 Showing area of various motion estimation processor sub-systems.
7.4.2 Comparison of Motion Estimation Processors
The digital approach presented by Chen [117], uses a similar 0.8 fim CMOS process to 
the implementation described in this work. To enable real time video encoding using a 
moderate power dissipation and area the TSS algorithm described in section 3.4.2 was 
used. The motion vector rate is just over a seventh of the analogue motion estimation 
processor described in section 7.4.1. The power dissipation is over twenty times more 
and area is over ten times as much as shown in Table 7.7. The analogue technique offers 
a low power alternative in similar technology to digital.
168
The combination of two 0.8 /mi analogue implementation processors described in 
section 7.4.1 uses a fifth of the power of a recent 0.25 /mi process digital motion 
estimator [118] for the same motion vector rate at just over twice the area. The area 
reduction over the analogue implementation of Tomasini et al [74] is largely due to the 
improved serial-parallel architecture used. The previous analogue implementation 
unnecessarily calculated the whole search window in parallel, resulting in a processor 
many times larger than necessary as is shown in Table 7.7. The 0.7 /mi technology used 
and motion vector rate is similar to this work however.
Param eter Circuit in:
[117] [118] [74] This work
Year 1998 2004 1996 2004
Type Digital Digital Analogue Analogue
CMOS technology 0.8 /mi 0.25 /im 0.7 /xm 0.8 /im
Power supply (V) 5 2.8 5 3
Power Distance Metric(s) (mW) Not Stated Not Stated 8 6.2
Power (mW) 350 153 100 11
Block size (pixels) 8 x 8 16 x 16 8 x 8 16 x 16
Motion vectors/sec 60169 781280 481140 429165
Distance Metric(s) (mm2) Not Stated Not Stated 24 1.58
Core size (mm2) 40.71 3.13 150 *4
Table 7.7 Showing comparison of various motion estimation processor 
implementations.
169
7.5 Conclusions
In this chapter the results of fabricated test structures are presented and the performance 
evaluated. It has been shown that the distance metric implemented with analogue 
techniques results in large variations in PSNR performance when used for motion 
estimation. A fabricated motion estimation processor based on the improved 
architecture with cancellation technique described in section 5.3, provides similar PSNR 
performance to digital implementations using the MAE matching function.
The performance of the improved architecture indicates that a relatively slow high 
precision comparator can be used with autozero techniques. The use of such a 
comparator would enable very precise comparisons of the successive blocks resulting in 
improved PSNR performance.
Finally using the results from the fabricated test structures, an estimated performance of 
a full-scale analogue motion estimation processor suitable for the H.261 standard has 
been presented. The predicted full-scale analogue motion estimation processor 
compared with previous digital and analogue implementations. The results indicate that 
using a 0.8 fim CMOS process, similar performance can be attained with an analogue 
implementation to a digital implementation using 0.25 /mi CMOS process.
170
Chapter 8
Conclusion
8.1 Conclusions
This thesis has described the design of an analogue motion estimation processor 
intended for the motion compensation used in video coding. The analogue motion 
estimation processor was fabricated using 0.8 /mi CMOS double poly, double metal 
technology. The system features digital input and output interface and an efficient 
analogue computation core. The scaled motion estimation processor indicates that 
substantial reduction in power dissipation and implementation area is possible by 
replacing digital arithmetic with compact analogue circuits.
The characterisation necessary to understand the limitations of using analogue circuit 
techniques has been investigated. The results indicate that though analogue circuits 
tend to be limited by such factor as dynamic range, accuracy and variations in 
parameters, a video performance comparable to digital implementations is possible 
provided certain criteria are met. Most notably the use of the MSE matching function 
allows up to two bits of pixel precision to be truncated and provides a similar 
performance to MAE used in most digital implementations. This has two favourable 
consequences for analogue circuit design. The first is that the MOS transistor is a 
square law device and this can be utilised in circuit design to make a compact and 
power efficient distance metric. The second is that the two bits pixel reduction 
represents a four fold reduction in circuit accuracy requirements. This is highly 
desirable for analogue circuit design.
The MAE and MSE matching functions indicate that a large range of values is 
possible. As already mentioned, analogue circuits have limited dynamic range. This 
potentially presents a problem when using analogue techniques for distance metric 
implementation. However the work presented in this thesis suggests that for motion 
estimation only a limited range is actually used. This eases the requirements at which 
each value obtained from the matching functions is compared. The variation in circuit
171
parameters is another very important consideration in analogue motion estimation 
processor design. The most significant effect is the input-offset present in the 
analogue PE’s. The analogue PE’s are used to measure the distance between pixels. It 
has been found that the effect is most significant in video sequences with low 
definition between the moving objects and stationary background. The correlation 
information is lost in such sequences. The use of a global cancellation technique 
significantly reduces the effect of analogue PE input-offset. The technique has been 
introduced in this work and is applicable to other matching applications. The method 
has been shown to be effective even with large amounts of offset. This is particularly 
important since it addresses the problem of increased circuit parameter variation and 
reduced voltage supply with more recent CMOS technologies.
The architecture described has a number of innovative features. The block matching 
motion estimation algorithm used in video coding has a large amount of pixel 
memory redundancy. The architecture used utilises this property to eliminate 
redundancy, hence increasing motion vector computation rate. This combined with 
the analogue PE input-offset cancellation technique ensures the architecture is 
scalable to modem deep sub-micron processes. The simulated and measured results 
indicate that the analogue motion estimation processor is a viable alternative to digital 
implementations. Particularly where power dissipation is at a premium such as in 
portable applications.
8.2 Future Work
The almost periodical frequency at which the minimum feature size in CMOS 
technology diminishes ensures the digital motion estimator motion vector processing 
rate increases and with reduced implementation area. It is well known that analogue 
circuits are not designed using deep sub-micron transistor dimensions. This is partly 
due to significant departures from ideal square law characteristics but also from the 
substantial device variation.
The difficulty caused by device variation is overcome to a certain extent by using 
large transistor dimensions. However the gains in size and power dissipation 
described in this work will become increasingly insubstantial when compared to very
172
deep sub-micron technologies. Fortunately using the analogue PE calibration 
technique, this allows the circuits to be realised using minimum geometry. The square 
law characteristic is not very important for realising the MSE matching function. 
Therefore the distance metric block can be scaled much in the same way digital 
circuits are with new technologies. The other analogue systems such as the memory, 
row/column multiplexer, sample and holds and comparator will also reduce in size.
The use of half-pixel accuracy used in MPEG 2 and H.263 presents some additional 
complexity. However with the addition of more sample and hold and averaging 
circuits, the architecture can be extended. The same is true of the overlapping motion 
compensation used in MPEG 4 and in the optional advanced mode of H.263. These 
are interesting features that point towards the application based future direction of the 
project.
It is clear from the work presented in this thesis that the design of an analogue motion 
estimation processor is time consuming and it is certainly true that each technology 
will require an almost full redesign and custom layout. However the power dissipation 
reduction is an order of magnitude and therefore represents a viable solution for 
conservative power requirement portable applications.
173
Reference
[1] R. E. Graham, “Predictive quantization of television signals,” in IRE Wescon 
Conv. Rec., Part 4, pp. 146-156, Aug. 1958.
[2] J. B. O’Neal, Jr., “Predictive quantizing systems (differential pulse code 
modulation) for the transmission of television signals,” Bell Syst. Tech. J., vol. 
45, pp. 689-721, May-June 1966.
[3] F. Rocca “Television bandwidth compression utilizing ffame-to-ffame 
correlation and movement compensation,” In 1969 Symp. Picture Bandwidth 
Compression. 1969. New York: Gordon and Breach.
[4] B. G. Haskell and J. O. Limb, “Predictive video encoding using measured 
subjective velocity,” 1972 Jan., no. 3 632 865: US Patent.
[5] J. R. Jain and A. K. Jain, “Displacement measurement and its application in 
interffame image coding,” IEEE Trans. Comm., 1981. COM-29: p. 1799-1808.
[6] CCITT. Recommendation H.261. Dec. 1990. "Line transmission on non­
telephone signals. Video codec for audiovisual services at p x 64 kbit/s".
[7] K. Guttag et al. "A single-Chip Multiprocessor For Multimedia: The MVP," 
IEEE Computer Graphics and Applications, pp. 53-64. Nov. 1992.
[8] Konstantinides, V. Bhaskaran. “Monolithic Architectures for Image Processing 
and Compression,” IEEE Computer Graphics & Applications. Nov. 1992.
[9] H. Fujiwara et al. "An All-ASIC Implementation of Low Bit-Rate Video 
Decoder," IEEE Trans, on Circuits and Systems. Jun. 1992.
[10] P.A. Ruetz et al. "A High-Performance Full-Motion Compression Chip Set," 
IEEE Trans, on Circuits and Systems. Jun. 1992.
[11] I. Tamitani et al. "An Encoder/Decoder Chip Set for the MPEG Video Standard," 
IEEE ICASSP-92, CS Press, Los Alamitos, Calif., 1992.
[12] D. Bursky. "Improved DSP ICs Eye New Horizons," Electronics Design. Nov. 
11. 1993.
174
[13] P. Pirsch, N. Demassieux, W. Gehrke. "VLSI Architectures for Video 
Compression-A Survey," Proceedings o f  the IEEE. Vol. 83 No 2. Feb. 1995.
[14] E. Vittoz, “Analog VLSI signal processing: Why, where and how?,” Analog 
Integrated Circuits and Signal Processing, pp. 27-44, 1994.
[15] P. Kinget, M. Steyaert, Analog VLSI Integration o f  Massive Parallel Processing 
Systems, Kluwer Academic Publishers, The Netherlands, 1997
[16] D. T. Hoang, J. S. Vitter, Efficient Algorithms fo r  MPEG Video Compression, 
Wiley, 2002.
[17] A. N. Netravali, B. G. Haskell, Digital Pictures -  Representation, Compression 
and Standards, 2nd ed. New York: Plenum Press, 1995.
[18] A. K. Jain, “Image Data Compression: A Review,” Proceedings o f the IEEE, 
Vol. 69, 1981, pp. 349-389.
[19] Recommendation ITU-R BT.500 (Revised), “Methodology for the subjective 
assessment of the quality of television pictures”
[20] T. K. Tan, M. Ghanbari, D. E. Pearson, “An objective measurement tool for 
MPEG video quality”, Signal Processing, 1, pp. 279-294, 1998.
[21] N. Ahmed, T.Natatajan and K. R Rao, “Discrete cosine transform,” IEEE Trans. 
Computer, 1974, pp. 90-93.
[22] Y. Wang, J. Ostermann and Y-Q Zhang, Video Processing and Communication, 
Prentice-Hall 2002.
[23] W. H Chen, et al, “A fast computational algorithm for discrete cosine transform,” 
IEEE Trans. Commun., pp. 1004-1009, Sep. 1997.
[24] J. L. Mitchell et al, MPEG Video Compression Standard, Chapman & Hall 1996.
[25] D. G. Hoffman et al, Coding Theory and Cryptography: The Essentials, Marcel 
Dekker Inc., 2000.
[26] A. Gersho, R. M. Gray, Vector Quantisation and Signal Compression, Kluwer 
Academic Publishers, Boston, 1992
[27] T. Baji, et al. “A 20ns CMOS DSP core for video-signal processing,” in ISSCC 
Dig. Tech. Papers, pp. 156-157, Feb. 1988.
175
[28] K. Kikuchi, et al. “A Single-Chip 16-bit 25-ns Real-Time Video/Image Signal 
Processor,” IEEE J. Solid-State Circuits, vol. 24, pp. 1662-1667, Dec. 1989.
[29] C. C. Culter, “Differential quantization of communication systems,” U.S. Patent 
2 605 361, July 29, 1952.
[30] ITU-T Recommendation H.261. Video Codec for audiovisual services at p  x 64 
kbit/s, Geneva, Aug. 1990.
[31] ISO/IEC 11172, Information Technology—Coding of Moving Pictures and 
Associated Audio— for Digital Storage Media at up to about 1.5 Mbit/s, 1993.
[32] ISO/IEC Committee Draft 13818-2, Information Technology— Generic Coding 
of Moving Pictures and Associated Audio Information: Video, 1995.
[33] ITU, Draft Recommendation H.263—Video Coding for Low Bitrate 
Communication, Geneva, Nov. 1995.
[34] ISO/IEC Committee Draft 14496-X, Information Technology—Coding of audio­
visual objects, 1999.
[35] B Furht, J Greenberg and R Westwater, Motion Estimation Algorithms fo r  Video 
Compression, 1997, Kluwer Academic Publishers.
[36] M. Ghanbari, Video Coding: An Introduction to Standard Codecs, The Institution 
of Electrical Engineers, London, 1999.
[37] H. Gharavi, M. Mills, “Blockmatching Motion Estimation Algorithms -  New 
Results,” IEEE Transactions on Circuits and Systems, Vol. 37, No. 5 May 1990.
[38] S. Nogaki. M. Ohta. “An overlapping block motion compensation for high 
quality motion picture coding,” IEEE Int. Conf. Circuits and Systems, pp. 184- 
187, May 1992.
[39] M. T. Orchard, G.J. Sullivan, “Overlapping block motion compensation: An 
estimation-theoretic approach,” IEEE Trans. Image process, pp 693-699, 1994.
[40] ITU, Recommendation H.263—Video Coding for Low bit rate communication, 
1998.
176
[41] A. C. Dowton, “Speed-up trend analysis for H.261 and model-based image 
coding algorithms and parallel-pipeline model,” Signal Processing, Image 
Commun. pp 489-502, 1995.
[42] T. Koga, et al. “Motion compensated intraframe coding for video conferencing,” 
In Proc. NTC 81. 1981. New Orleans.
[43] A. Puri et al, “An efficient block-matching algorithm for motion compensated 
coding,” in Proc. IEEE ICASSP’87, pp. 25,4,1-25.4.4, 1997.
[44] R. Srinivasan, K. R. Rao, “Predictive codimg based on efficient motion 
estimation,” IEEE Inter. Conf. On Commun. pp. 521-526, May 1984.
[45] M Bierling, “Displacement estimation by hierarchical block matching”, 
November 1988, Proc. SPIE VCIP 1988, Cambridge, MA.
[46] A. Puri, H. M. Hang, D. L. Schilling, “Interffame coding with variable block-size 
motion compensation," in GLOBECOM’87, pp. 65-69, Nov. 1987.
[47] M. H. Chan, Y. B. Yu, A. G. Constantinides, “Variable size block matching 
motion compensation with application to video coding,” Proc. IEE, Pt. /, vol. 
137, pp. 205-212, Aug. 1990.
[48] G. J. Sullivan, R. L. Baker, “Rate-distortion optimized motion compensation for 
video compression using fixed or variable size blocks,” in GLOBECOM'91, 
(Phoenix, Arizona), pp. 85-90, Dec. 1991.
[49] C. D. Kuglin, D.C.Hines, “The phase correlation image alignement method,” 
Proc. IEEE Int. Conf. Cybern. Soc. San Francisco, Sep. 1975
[50] Y. Fok, O. Au. “Novel fast motion estimation in feature subspace,” In ICIP 1995, 
(September).
[51] B. Liu, A. Zaccarin, “New Fast Algorithms for the Estimation of Block Motion 
Vectors,” IEEE Trans, on Circuits and Systems fo r  Video Technology, pp 148- 
157, Vol. 3, No. 2, April 1993.
[52] Y. Baek, H. Oh, H. Lee, “Block-matching criterion for efficient VLSI 
implementation for motion estimation,” IEE Electronics Letters, pp 1184-1185, 
vol. 32, no. 13, Jun 1996,
177
[53] P. Kinget and M. Steyaert, “A programmable analogue CMOS chip for high 
speed image processing based on cellular neural networks,” in Proceedings of 
Custom Integrated Circuits Conference (San Diego), pp. 570-573,1994.
[54] P. Kinget and M. Steyaert, “A programmable analogue cellular neural networks 
CMOS chip for high speed image processing,” IEEE J. Solid-State Circuits, vol. 
30, pp. 235-243, March 1995.
[55] P. Kinget and M. Steyaert, “An analog parallel array processor for real-time 
sensor signal processing,” in Digest o f  Technical papers IEEE International 
Solid-State Circuits Conference, pp. 92-93, Feb. 1996.
[56] J. Cruz and L. Chua, “A CNN chip for connected component detection,” IEEE 
Transactions on Circuits and Systems vol. 38, pp. 810-817, July 1991.
[57] S. Espejo et al., “Smart-pixel cellular neural networks in analog current mode 
CMOS technology,” IEEE J. Solid-State Circuits, vol. 29, pp. 895-905, Aug.
1994.
[58] H. Kobayashi, J. White and A. Abidi, “An analog CMOS network for gaussian 
convolution with embedded image sensing,” in Digest o f Technical papers IEEE 
International Solid-State Circuits Conference, pp. 216-217, Feb. 1990.
[59] J. Madrenas et al, “A CMOS Analog Circuit for Gaussian Functions,” IEEE 
Transactions on Circuits and Systems-H: Analog and Digital Signal Processing, 
Jan. 1996, Vol.43, No. 1, pp. 70-74.
[60] Bin-Da Liu, Chuen-Yau Chen, Ju-Ying Tsao, “A Modular Current-Mode 
Classifier Circuit for Template Matching Applications,” IEEE Transactions on 
Circuits and Systems-II: Analog and Digital Signal Processing, pp. 145-151, 
Vol.47, No2, February 2000.
[61] Wai-Chi Fang, Bing. J Sheu, Oscal. T C Chen and Joongho Choi, “A VLSI 
Neural Processor for Image Data Compression Using Self-Organization 
Networks,” IEEE Transactions on Neural Networks, May 1992, Vol. 3, No. 3, 
pp. 506-518.
178
[62] Shang-Yi Lin, Ren-Jiun Huang and Tzi-Dar Chiueh, “A Tunable 
Gaussian/Square Function Computation Circuit for Analog Neural Networks,” 
IEEE Transactions on Circuits and Systems-II: Analog and Digital Signal 
Processing, March 1998, Vol.45, No3, pp. 441-446.
[63] S. Churcher, A. F Murray and H. M Reekie, “Programmable Analogue Vlsi For 
Radial Basis Function Networks”, Electronic. Letter, 1993, Vol. 29, No. 18, pp. 
1603-1605.
[64] U. Qilingiroglu and D. Y Aksin, “A 4-Transistor Euclidean Distance Cell For 
Analog Classifiers”, IEEE Int. Conf. Circuits and Systems, Vol. 1, pp. 84-87, 
June 1998.
[65] A. Gopalan and A. H. Titus, “A New Wide Range Euclidean Distance Circuit for 
Neural Network Hardware Implementations,” IEEE Transactions on Neural 
Networks, Vol. 14, No. 5, Sept. 2003.
[66] S. Collins, G. F. Marshall and D. R. Brown, “An Analogue Radial Basis Function 
Circuit using a Compact Euclidean Distance Calculator,” in Proc. IEEE Int. 
Conf. Circuits and Systems, pp. 233-236, 1994.
[67] V. Pedroni, “Highly linear high-density vector quantizer and vector-matrix 
multipier,” IEE Electronic Letter, vol. 30, pp. 945-946, Jun. 1994.
[68] F. J. Kub, K. K. Moon, I. A. Mack, and F. M. Long, “Programmable analog 
vector-matrix multipliers,” IEEE J. Solid-State Circuits, vol. 25, pp. 207-214, 
Feb. 1990.
[69] V. Pedroni, “Error-Compensated Analog Cells for Vector Multiplication and 
Vector Quantizer,” IEEE Trans. Circuits Syst. II, vol. 48, pp. 511-519, No. 5, 
May. 2001.
[70] G. Cauwenberghs and V. Pedroni, “A low-power CMOS analog vector 
quantizer,” IEEE J. Solid-State Circuits, vol. 32, pp. 1278-1283, Aug. 1997.
[71] G. T. Tuttle, S. Fallahi, and A. A. Abidi, “An 8b CMOS vector A/D converter,” 
in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1993, pp. 38-39.
179
[72] M. Pelgrom, A. Duinmaijer and A. Welbers, “Matching properties of MOS 
transistors,” IEEE J. Solid-State Circuits, vol. 24, pp. 1433-1439, Oct 1989.
[73] J. Bastos et al. “Mismatch characterization of small size MOS transistors,” in 
Proceedings o f IEEE International Conference on Microelectronic Test 
Structures, pp. 271-276, March 1995.
[74] A. Tomasini et al, “B/W adaptive image grabber with analog motion vector 
estimator at 0.3GOPS.” IEEE Int. Solid-State Circuits Conf (ISSCC’96), San 
Fransisco, CA, pp. 94-95, Feb. 1996.
[75] A. Demosthenous, S. Smedley and J. Taylor, “A CMOS Analog Winner-Take- 
All Network for Large-Scale Applications,” IEEE Transactions on Circuits and 
Systems-I: Fundamental Theory and Applications, March 1998, Vol.45, No3, pp. 
300-304.
[76] A. Demosthenous, J. Taylor and G. Morrison, “An Analogue Approach to the 
Design of Motion Estimators for Digital Video Encoding,” IEEE Int. Conf. 
Circuits and Systems, pp. 675-678, May 2000.
[77] E. Seevinck, R. F. Wassenaar, and H. C. K. Wong, “A wide-band technique for 
vector summation and rms-to-dc conversion,” IEEE J. Solid-State Circuits, vol. 
SC-19, pp. 311-318, 1984.
[78] P. R. Gray, P. J. Hurst, S. H. Lewis and R. G. Meyer, Analysis and Design o f  
Analog Integrated Circuits, 4th Edition, Wiley, 2001.
[79] O. A Seriki and R. W Newcomb, “Direct-Coupled MOS Squaring Circuit”, IEEE 
J. Solid-State Circuits, August 1979, SC-14, pp. 766-768.
[80] D. W. J. Groeneveld et al, “A Self-Calbrating Technique for Monlithic High- 
Resolution D/A Converters,” IEEE J. Solid-State Circuits, Dec. 1989, Vol. 24, 
No. 6, pp. 1517-1522.
[81] J. B. Hughes, N. C. Bird and L. C. Macbeth, “Switched currents-Anew technique 
for analogue sampled-data signal processing,” in Proc. IEEE Int. Conf. Circuits 
and Systems, pp. 1584-1587, 1989.
180
[82] K. Bult, and H. Wallinga, “A class of analog CMOS circuits based on the square- 
law characteristic of an MOS transistor in saturation,” IEEE J. Solid-State 
Circuits, June 1987, SC-22, No.3, pp. 357-365.
[83] O. Landolt, E. Vittoz and P. Heim, “CMOS Selfbiased Euclidean Distance 
Computing Circuit with High Dynamic Range,” IEE Electronic Letter, vol. 28, 
pp. 352-353, Feb. 1992.
[84] Chuen-Yau Chen, Chun-Yueh Huang, Ju-Ying Tsao and Bin-Da Liu, “A 
Current-Mode Circuit for Euclidean Distance Calculation,” IEEE Proceedings o f  
Technical Papers, International Symposium on VLSI Technology, Systems and 
Applications, June 1997.
[85] Chun-Yueh Huang, Chuen-Yau Chen and Bin-Da Liu, “Current-mode fuzzy 
linguistic hedge circuit for adaptive fuzzy logic controllers,” IEE Electronic 
Letter, vol. 31, pp. 1517-1518, Aug. 1995.
[86] D. Montanari, V. Houdt, G. Greseneken, H. E. Maes, “Novel level-identifying 
circuit for flash multilevel memories,” IEEE J. Solid-State Circuits, vol. 33, pp. 
1090-1095, June 1998.
[87] G. Marshall and S. Collins, “A Compact Analogue Radial Basis Function 
Circuit,” IEE Artificial Neural Networks, Conference publication No. 409, June
1995.
[88] J. Babanezhad and G. C. Temes, “A 20-V four-quadrant CMOS analog 
multiplier,” IEEE J. Solid-State Circuits, 1985, vol. SC-20, pp. 1158- 1168.
[89] M. P. Craven, B. R. Hayes-Gill and K. M. Curtis, “Two Quadrant Analogue 
Squarer Circuit Based On MOS Square-Law Characteristic,” Electronic. Letter, 
1991, Vol. 27, No.25, pp. 2307-2308.
[90] G. Giustolisi, G. Palmisano and G. Palumbo, “1.5 V power supply voltage 
Squarer,” Electronic. Letter, 1997, Vol. 33, No.13, pp. 1134-1136.
[91] E. Seevink, and R. F. Wassenaar, “A versatile CMOS linear 
tranconductor/square-law function circuit,” IEEE J. Solid-State Circuits, SC-22, 
pp. 366-377, June 1987.
181
[92] Nedungadi, and T. R Viswanathan, “Design of linear CMOS transconductance 
elements,” IEEE Trans. Circuits Syst., vol. 31, pp. 891-894, Oct. 1984.
[93] C. T. Sah, “Charateristics of the Metal-Oxide-Semiconductor Transistor,” IEEE 
Trans, Electron. Devices, Vol. ED-11, pp. 324-345, July 1964.
[94] M. Panovic and A. Demosthenous, “Compact CMOS Linear Transconductor and 
Four Quadrant Analogue Multiplier” Proc. ISC AS  2004, Vancouver, Canada, 
May 2004, vol. 1, pp. 685-688.
[95] K. Kimura, “Analysis of an MOS four-quadrant analog multiplier using two- 
input squaring circuits with source followers,” IEEE Trans. Circuits Syst.-I: 
Fundamental Theory and Applications, vol. 41, pp. 72-75, Jan. 1994.
[96] A. P. Nedungadi and R. L. Geiger, “High-frequency voltage-controlled 
continuous-time filter using linearized CMOS integrators,” Electron. Lett., vol. 
22, pp. 729-731, Jun. 1986.
[97] H.-J. Song, and C-K. Kim, “A MOS four-quadrant analog multiplier using simple 
two-input squaring circuits with source followers,” IEEE J. Solid-State Circuits, 
vol. 25, pp. 841-848, June 1990.
[98] V. Peluso, P. Vancorenland, M. Steyaert and W. Sansen, “900mV Differential 
class AB OTA for switched opamps application,” Electron. Lett., 1997, Vol. 33, 
No. 17, pp. 1455-1456.
[99] J. Ramirez-Angulo, R. G. Carvajal, A. Torralba, J. Galan, A. P. Vega-Leal, and J. 
Tombs, “The flipped voltage follower: a useful cell for low-voltage low-power 
circuit design”, in Proc. 2002 IEEE Int. Symp. Circuits Syst., Phoenix, AZ, May 
2002, pp. 615-618.
[100] V. Peluso, P. Vancorenland, A. M. Marques, M. Steyaert and W. Sansen, “A 
900-mV Low-Power A£ A/D Converter with 77-dB Dynamic Range,” IEEE J. 
Solid-State Circuits, vol. 33, pp. 1887-1897, Dec. 1998.
[101] K. R. Laker and W. M. Sansen, Design o f Analog Integrated Circuits and 
Systems, 1994.
182
[102] Z. Wang, “Analytical determination of output resistance and DC matching errors 
in MOS current mirrors,” IEE Proceedings. Vol. 137, No. 5, Oct. 1990.
[103] Y. Tsividis, Operation and Modelling o f the MOS Transistor, Electrical & 
Electronic Engineering Series, New York: McGraw Hill, 1988.
[104] G. W. Taylor, “Velocity-saturated characteristics of short channel MOSFETs,” 
AT&T Bell Lab. Tech. Journal, vol 63, 1984, pp. 1325-1404.
[105] K.Y.Toh, Ping-Keung Ko, and R. Meyer, “An engineering model for short 
channel MOS devices”, IEEE J. Solid State Circuits, vol 23, 1988, pp. 950-958.
[106] T.Sakurai and A.R.Newton, “Alpha power law MOSFET model and its 
applications to CMOS inverter delay and other formulas,” IEEE J. Solid State 
Circuits, vol 25, 1990, pp. 584-594.
[107] K.Chen et al, “An accurate semi-empirical saturation drain current model for 
LDD N-MOSFET”, IEEE Electron Device Lett., vol 17, 1996, pp. 145.
[108] J. Millman and C. C Halkias, Integrated Electronics, McGraw Hill, 1971.
[109] AustriaMicroSystems (AMS) Int. AG, “0.8 pm CMOS Process Parameters,” 
Doc. 9933006, Rev. B, April 1997.
[110] J. Shieh, M. Patil and B. L. Sheu, “Measurement and Analysis of Charge 
Injection in MOS Analog Switches,” IEEE J. Solid State Circuits, vol 22, Apr. 
1987, pp. 277-281.
[111] J. McCreary and P. R. Gray, “All-MOS Charge Redistribution Analog-to Digital 
Conversion Techniques-Part 1,” IEEE J. Solid State Circuits, vol SC-10, Dec. 
1985, p p .371-379.
[112] C. Eichenberger and W. Guggenbuhl, “On Charge Injection in Analog MOS 
Switches and Dummy Switch Compensation Techniques,” IEEE Transactions on 
Circuits and Systems, Feb. 1990, Vol.37, No. 2, pp. 256-264.
[113] P. J. Lim, B. A. Wooley, “A High-speed sample-and-hold technique using a 
Miller hold capacitance,” IEEE J. Solid State Circuits, vol. 26, pp.643-651, April 
1991.
183
[114] M.-J. Chen, Y.-B. Gu, J.-Y. Huang, W.-C. Shen, T. Wu, and P.-C. Hsu, “A 
compact high-speed Miller-capacitance-based sample-and-hold Circuit,” IEEE 
Transactions on Circuits and Systems-1, vol. 45, pp. 198-201, Feb. 1998.
[115] Steven R. Nosworthy, Richard Schreier, Gabor C. Temes, Delta-Sigma Data 
Converters: Theory, Design and Simulation, IEEE Press, 1997.
[116] B. Razavi and B. A. Wooley, “Design techniques for highspeed, high-resolution 
comparators,” IEEE J. Solid State Circuits, vol. 27, pp. 1916-1926, Dec. 1992.
[117] Thou-Ho Chen, “A cost-effective three-step hierarchical search block-matching 
chip for motion estimation.” IEEE Journal o f Solid State Circuits, Vol. 33 No. 8, 
pp. 1253-1258, Aug. 1998.
[118] Y.-K. Lai and L.-F. Chen, “A performance-driven configurable motion estimator 
for full-search block-matching algorithm,” Proc. ISC AS  2004, Vancouver, 
Canada, May 2004, vol. 2, pp.233-236.
184
